BeFree is a text minning tool to unlock the information contained in biomedical documents. BeFree is composed of a module for Biomedical Named Entity Recognition (BioNER)  based on dictionaries using fuzzy- and pattern matching methods to find and uniquely identify entity mentions in the literature, and a module for Relation Extraction (RE)  based on Support Vector Machine (SVM).
BeFree is used in a text mining workflow aimed at extracting information on biological associations from scientific publications. Briefly, after document selection, the text mining approach comprises as a first step the recognition and normalization of the entities in biomedical publications by means of the BioNER module, and secondly, the identification of relationships between the aforementioned entities by their co-occurrence in sentences are processed by our RE module to predict the correct co-occurrences, that is, the correct associations. The different steps addressed in the text mining workflow are illustrated in Figure 1.
Figure 1. Text mining workflow using BeFree.
BioNER is a Named Entity Recognition (NER) system based on dictionaries and fuzzy matching methods. In the current implementaton BioNER recognizes and annotates to database identifier gene and disease mentions from free text.
An important aspect of BioNER is the development and curation of dictionaries. The main aspects of the development of these dictionaries is summarized below.
Figure 2. An example of the variability in terminology for genes depending on the primary sources.
Relation Extraction (RE) module is composed of a combination of kernels based on a Shallow Linguistic Kernel (KSL) and our Dependency Kernel (KDEP).
Figure 3. Different linguistic representations of a sentence containing an association between a gene and a disease. a) The sentence extracted form a MEDLINE abstract (PMID:22337703) expresses the association between the disease MMD (Major Depressive Disorder) and the genes EHD3 and FREM3. We will focus in the association between EHD3 and MMD to illustrate the features considered in each kernel. b and c) The KLC uses orthographic and shallow linguistic features (POS, lemma, stem) of the tokens located at the left and right (window size of 2) of the candidate entities (EHD3 and MDD). d) The KGC is based on the assumption that an association between two entities (in this case EHD3 and MDD) is more likely to be expressed within on of three patterns (fore-between, between, between-after). In this example the association between EHD3 and MDD is expressed in the between pattern. e) In the KGC we consider both trigrams and sparse bigrams in each pattern.
Figure 4. Different linguistic representations of a sentence containing an association between a gene and a disease (cont). a) Dependency graph representation of the sentence. Solid lines represent the shortest path between the two candidates. The token “associated” is the Least Common Subsumer (LCS) of both candidates. b) Subgraph representing the shortest path between EHD3 and MDD, where syntactic dependencies are represented as edges and tokens as nodes. c) The e-walk and v-walk features for the node.
The development and application of the BeFree system has been described in the following publications:
 À. Bravo, M. Cases, N. Queralt-Rosinach, F. Sanz, and L. I. Furlong, "A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature", BioMed Research International, vol. 2014, Article ID 253128, 11 pages, 2014. doi:10.1155/2014/253128. (Article, for the "Big Data and Network Biology" special issue at BioMed Research International).
 À. Bravo, J. Piñero, N. Queralt, M. Rautschka and L.I. Furlong, "Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research". BMC Bioinformatics 2015, Article, doi:10.1186/s12859-015-0472-9.
Current biomedical research relies on the successful exploitation of information reported in publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for the identification of the actionable knowledge from free text repositories. We report on the development of the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. BeFree exploits morpho-syntactic information of the text and performs competitively for the identification of gene-disease relationships from free text, but also for drug-disease and drug-target associations. We show the value of the gene-disease associations extracted by BeFree through a number of analysis and integration with other data sources. The application of BeFree to real-case scenarios shows its potentiality in extracting relevant information for translational research. For instance, BeFree is able to identify genes associated to one of the most prevalent diseases, depression, which are not present in public databases. Moreover, large-scale extraction and analysis of gene-disease associations provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. For example, only a small proportion of the gene-disease associations discovered by text mining are collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical applications.
Table 1. Results obtained by 10-fold cross-validation on the different relationships available in the AIMED, EU-ADR and GAD corpora. The first column indicates the number of experiment. The second column shows if KSL is used with (TG+SBG) or without (TG) sparse bigrams, or if it is not used (whitespace). The next two columns focus on KDEP walk features indicating the use of one of the following features: token (T), stem (S), lemma (L), POS-tag (P), role (R) or none (-). Finally, the last columns show the result obtained in each experiment indicating Precision (P), Recall (R) and F-score (F).
|1||GAD||F Vs N/Y model
|2||F Vs N Vs Y model
Table 2. Evaluation of BeFree trained on EU-ADR and GAD to identify genes associated to depression.
Corpus with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts. The corpus is available here. The dataset used for training the BeFree system can be downloaded here.
The Genetic Association Database (GAD) is an archive of human genetic association studies of complex diseases, including summary data extracted from publications on candidate gene and GWAS studies. We use GAD for the development of a corpus on associations between genes and diseases (downloaded on January 21st, 2013). We considered the annotations of relationships between a gene and a disease in a single sentence provided as a reference set to build this corpus. GAD contains over 130,000 records with different type of information. We selected the records satisfying the following requirements: (i) the association between gene and disease is annotated as positive or negative, (ii) the association is expressed in one sentence and (iii) the Entrez Gene identifier for the gene is provided. Although GAD provides the sentence in which a gene-disease association is stated, there is no information on the exact location of the gene and disease entities in the text. In order to develop a corpus suitable for training a gene-disease relation extraction system, the exact location of the interacting entities in the text is required. To achieve that, we applied BeFree to identify the gene and disease entities in the text and normalize them to NCBI Gene and UMLS identifiers, respectively. Then, the sentences in which a given gene was found together with a specific disease, and this gene-disease association was annotated by GAD curators as positive or negative were labelled as TRUE. In order to create a dataset containing false associations (FALSE) between a gene and a disease, that is, a gene and a disease that co-occur in a sentence but are semantically not associated, we selected the sentences with co-occurrences between a disease and a gene found by the BioNER system that were not annotated by GAD curators as gene-disease associations. Table 1 shows the number of TRUE and FALSE associations that represent the GAD corpus. Figure 5 summarizes the methodology followed to derive the GAD corpus an example of this extraction with a record from GAD. The dataset used for training the BeFree system can be downloaded here.
Figure 5. An example about the methology used to create our dataset from GAD.
Integrative Biomedical Informatics Group, Research Programme on Biomedical Informatics (GRIB) IMIM-UPF.
Please send questions or comments to: lfurlong(at)imim(dot)es
The research leading to these results has received support from Instituto de Salud Carlos III-Fondo Europeo de Desarollo Regional (PI13/00082), the Innovative Medicines Initiative Joint Undertaking under grants agreements n°  (eTOX) and no  (Open PHACTS)], resources of which are composed of financial contribution from the European Union's Seventh Frame-work Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. À.B. and L.I.F received support from Instituto de Salud Carlos III Fondo Europeo de Desarollo Regional (CP10/00524). The Research Unit on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB).