We have developed several corpora in the biomedical domain (OSIRIS corpus, CALBC corpus, EU-ADR corpus, GAD corpus).

OSIRIS corpus


The OSIRIS corpus is a set of MEDLINE abstracts manually annotated with human variation mentions. The corpus is distributed under the terms of the Creative Commons Attribution License Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (Furlong et al, BMC Bioinformatics 2008, 9:84).

The OSIRIS corpus can be used to assess the performance of both variation entity recognition and variation entity disambiguation to NCBI dbSNP identifiers.

For a detailed description on how the corpus was developed, see Furlong LI, Dach H, Hofmann-Apitius M, Sanz F. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008, 9:84.

What is a variation entity?

We use the term variation to refer to any kind of short range change in the nucleotide sequence of the genome. SNPs are the most studied type of sequence variation, but we can also consider as member of this class short insertions or deletions, named variations as Alu sequences, and other types of variations collected in the dbSNP database. These variations can be mapped to the exonic regions of genes, and produce a change at the protein level, or within introns, untranslated regions or between genes. Some variations may alter protein function, such as non synonymous SNPs, or alter other processes related with the regulation of gene expression. From the point of view of a Named Entity Recognition system, a variation entity is defined by the combination of tokens that specify the location of the variation in the sequence and the original and altered alleles.This information can be represented as nucleotide sequence or amino acid sequence. For instance, the term G894T can be interpreted in two ways: as a variation in the protein sequence involving the change of a glycine resiude to a threonine residue at position 894 of the protein, or a variation at the DNA level at a guanine residue in the gene sequence at position 894 that changes to a thymine residue.


Corpus statistics

Number of articles 105
Number of articles with NCBI Gene annotations 105
Number of articles with NCBI dbSNP annotations 57
Number of variations normalized to NCBI dbSNP identifiers 105
Number of variations not normalized to NCBI dbSNP identifiers 212


Corpus format

The corpus is distributed in two formats: an XML file and a WorFreak format file. For editing the XML file, the Vex editor was used in the framework of the Eclipse platform. The corpus in WordFreak format contains a finer level of annotation of the variations: the location and alleles are annotated separately. This format is suitable for machine-learning applications, for instance see Klinger R, Friedrich CM, Mevissen HT, Fluck J, Hofmann-Apitius M, Furlong LI, Sanz F. Identifying gene-specific variations in biomedical text. J Bioinform Comput Biol. 2007 Dec;5(6):1277-96. Link as an example application of this corpus for the development of a Conditional Random Fields based NER system for variations.



XML format
XML file DTD


WordFreak format

CALBC corpus

We have also contributed to the development of the CALBC silver standard corpus. This is a large-scaleĀ  corpus annotated with different biomedical entities through the harmonisation of annotations from automatic text mining tools. More information can be found here.

EU-ADR corpus

Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts.

The dataset used for training the BeFree system can be downloaded from this link.

GAD corpus

This corpus contains annotations on the relationships between genes and diseases and has been developed from the Genetic Association Database (GAD). GAD is an archive of human genetic association studies of complex diseases, including summary data extracted from publications on candidate gene and GWAS studies. We use GAD for the development of a corpus on associations between genes and diseases by a semi-automatic annotation procedure. More information is provided here.

The gene-disease association corpus based on GAD is available here.