Text Mining

Due to the increasing size of literature repositories, there is a strong need for tools that identify and gather the relevant information from publications and place it in the context of current biomedical knowledge. Our research line on text mining is focused on:

  • the development of tools to extract information from the scientific literature, such as diseases and their associated genes and genetic alterations, drugs and their therapeutic and side effects, etc.
  • the development of tools to extract information from clinical records, such as diagnoses, treatments, procedures, among others.

Text mining on biomedical literature

Named Entity Recognition

We develop methods an tools based on dictionaries to identify genes/proteins, sequence variants and diseases. An example of this approach is the Osiris system:

Sequence variants, in particular Single Nucleotide Polymorphisms (SNPs), are considered key elements in fields such as genetic epidemiology and pharmacogenomics [Palmer and Cardon, 2005]. Researchers in these areas are interested in finding genes associated with diseases or with drug responses, as well as in selecting the relevant sequence variants on candidate genes for genotyping studies. Several public databases are available containing sequence information on genes and proteins (NCBI Entrez, SwissProt and many others). Data on sequence variants can be found at other public resources such as NCBI dbSNP and HapMap. In contrast, information about phenotypic consequences of the sequence variants of genes is generally found as non-structured text in the biomedical literature. However, the identification of the relevant documents and the extraction of the information from them are often hampered by the lack of widely accepted standard notation for genes, proteins and sequence variants in the biomedical literature, and by the large size of current literature databases. Bearing this in mind, automatic systems for the identification of gene/protein entities and their corresponding sequence variants from biomedical texts are required. Our group have previously reported the development of OSIRIS, a search system that integrates different sources of information and incorporates ad-hoc tools for synonymy generation with the aim of retrieving literature about sequence variation of a gene using PubMed search engine. We have developed a new version of OSIRIS as a first step towards an integrated text mining system for the extraction of information about genes, sequence variants and related phenotypes. The new implementation of OSIRIS (OSIRISv1.2) incorporates a new entity recognition module and is built on top of a local mirror of MEDLINE collection and HgenetInfoDB. HgenetInfoDB is a database that integrates data of human genes from the NCBI Gene database and dbSNP. The entity recognition module is based on a corpus of articles annotated with gene identifiers and the new search algorithm, which uses a pattern-based search strategy and a sequence variant nomenclature dictionary for the identification of terms denoting SNPs and other sequence variants and their mapping to dbSNP entries. The use of OSIRISv1.2 generates a corpus of annotated literature linked to sequence database entries (NCBI Gene and dbSNP). The results of the searches are stored in a database that can be used to query the results and, in the future, for the extraction of relationships among biological entities. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in a 99 % precision at a 82 % recall, and a F-score of 0.9.

Another example is the BeFree system.

Extraction of relationships

We are interested in the identification of relationships between biomedical entities such as genes, proteins, diseases, chemicals and drugs. We are also interested in relationship between sequence variants and diseases, with particular focus in the functional effect of the sequence variant.

We have developed BeFree, a text mining system to unlock the information contained in biomedical documents. It is composed of a Named Entity Recognition module (BioNER) and a relation extraction system based on SVM. We have applied BeFree for the identification of disease-related biomarkers and for extraction of information about diseases and associated genes from the literature. Learn more about this project here.


We have developed the following corpora:

Osiris corpus

EU-ADR corpus

GAD corpus

In addition, we participated in the development of the CALBC corpus.

For more information on these corpora, go here.

Related publications

Bravo, À.; Piñero, J.; Queralt, N.; Rautschka, M.; Furlong, L.I. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 2015, 16:55  doi:10.1186/s12859-015-0472-9

Carbonell P, Mayer MA, Bravo A. Exploring brand-name drug mentions on twitter for pharmacovigilance. Stud Health Technol Inform 2015; 210:55-9.

À. Bravo, M. Cases, N. Queralt-Rosinach, F. Sanz, L.I. Furlong, A knowledge-driven approach to extract disease-related biomarkers from the literature. Biomed Res Int. 2014;2014:253128. doi: 10.1155/2014/253128

Erik M. van Mulligen, Annie Fourrier-Reglat, David Gurwitz, Mariam Molokhia, Ainhoa Nieto, Gianluca Trifiro, Jan A. Kors, Laura I. Furlong. The EU-ADR Corpus: Annotated Drugs, Diseases, Targets, and their Relationships. J Biomed Inform. 2012 Apr 25. PubMed

Philippe E. Thomas, Roman Klinger , Laura I. Furlong , Martin Hofmann-Apitius and Christoph M. Friedrich. Challenges in the Association of Human Single Nucleotide Polymorphism Mentions with Unique Database Identifiers. BMC Bioinformatics 2011, 12(Suppl 4):S4. PubMed

Dietrich Rebholz-Schuhmann, Antonio Jimeno Yepes, Chen Li, Senay Kafkas,Ian Lewin, Ning Kang, Peter Corbett, David Milward, Ekaterina Buyko,Elena Beisswanger, Kerstin Hornbostel, Alexandre Kouznetsov, René Witte,Jonas B. Laurila, Christopher J.O. Baker, Chen-Ju Kuo, Simone Clematide,Fabio Rinaldi, Richárd Farkas, György Móra, Kazuo Hara, Laura   Furlong, Michael Rautschka, Mariana Lara Neves, Alberto Pascual-Montano, Qi Wei,Nigel Collier, Md. Faisal Mahbub Chowdhury, Alberto Lavelli, Rafael Berlanga, Roser Morante, Vincent Van Asch, Walter Daelemans, José Luís Marina, Erik van Mulligen, Jan Kors, Udo Hahn. Assessment of NER solutions against the first and second CALBC Silver Standard Corpus. Journal of Biomedical Semantics 2011, 2(Suppl 5):S11. PubMed

Hofmann-Apitius M, Fluck J, Furlong L, Fornes O, Kolarik C, Hanser S, Boeker M, Schulz S, Sanz F, Klinger R, Mevissen T, Gattermayer T, Oliva B, Friedrich CM. Knowledge environments representing molecular entities for the virtual physiological human. Philos Transact A Math Phys Eng Sci. 2008 Jun 17. PubMed

Furlong L.I., Dach H, Hofmann-Apitius M, Sanz F. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008 5;9(1):84. PubMed

Klinger R., Furlong L.I., Friedrich C.M., Mevissen H.T., Fluck J., Sanz F., Hofmann-Apitius M. Identifying Gene Specific Variations In Biomedical Text. J Bioinform Comput Biol. 2007 Dec;5(6):1277-96. PubMed

Bonis J., Furlong L.I., Sanz F. OSIRIS: a tool for retrieving literature about sequence variants. Bioinformatics 2006 22: 2567-2569. PubMed