Sequence variants, in particular Single Nucleotide Polymorphisms (SNPs), are considered key elements in fields such as genetic epidemiology and pharmacogenomics [Palmer and Cardon, 2005]. Researchers in these areas are interested in finding genes associated with diseases or with drug responses, as well as in selecting the relevant sequence variants on candidate genes for genotyping studies. Several public databases are available containing sequence information on genes and proteins (NCBI Entrez, SwissProt and many others). Data on sequence variants can be found at other public resources such as NCBI dbSNP and HapMap. In contrast, information about phenotypic consequences of the sequence variants of genes is generally found as non-structured text in the biomedical literature. However, the identification of the relevant documents and the extraction of the information from them are often hampered by the lack of widely accepted standard notation for genes, proteins and sequence variants in the biomedical literature, and by the large size of current literature databases. Bearing this in mind, automatic systems for the identification of gene/protein entities and their corresponding sequence variants from biomedical texts are required. Our group have previously reported the development of OSIRIS, a search system that integrates different sources of information and incorporates ad-hoc tools for synonymy generation with the aim of retrieving literature about sequence variation of a gene using PubMed search engine. We have developed a new version of OSIRIS as a first step towards an integrated text mining system for the extraction of information about genes, sequence variants and related phenotypes. The new implementation of OSIRIS (OSIRISv1.2) incorporates a new entity recognition module and is built on top of a local mirror of MEDLINE collection and HgenetInfoDB. HgenetInfoDB is a database that integrates data of human genes from the NCBI Gene database and dbSNP. The entity recognition module is based on a corpus of articles annotated with gene identifiers and the new search algorithm, which uses a pattern-based search strategy and a sequence variant nomenclature dictionary for the identification of terms denoting SNPs and other sequence variants and their mapping to dbSNP entries. The use of OSIRISv1.2 generates a corpus of annotated literature linked to sequence database entries (NCBI Gene and dbSNP). The results of the searches are stored in a database that can be used to query the results and, in the future, for the extraction of relationships among biological entities. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in a 99 % precision at a 82 % recall, and a F-score of 0.9.
TopSubarachnoid hemorrhage as a consequence of intracranial aneurysm is one of the most devastating cerebrovascular diseases due to its high morbidity and mortality. Cerebral aneurysms are extensions of arterial vessels in the brain. These extensions are balloon-type bulges that form in apparently healthy people with a comparably high frequency (about 2 % of the general population have an intracranial aneurysm) [Rinkel et al 1998]. The interaction of genetic as well as environmental risk factors are thought to play an important role in the pathogenesis of the disease. In addition to smoking, hypertension, atherosclerosis and alcohol intake, hemodynamic stress at arterial bifurcations is believed to contribute to the development of aneurysms [Krischek and Inoue, 2006]. Several studies have claimed that the presence of certain allelic variants may increase the individual's susceptibility to develop intracranial aneurysms, and might be used as biomarkers for the risk of aneurysm rupture [Krischek and Inoue, 2006, Ruigrok et al 2005]. In this context, the identification of the sequence variants associated to the disease phenotype in specific populations is of high value for early diagnosis and treatment, and also for providing an understanding of the pathogenesis of the disease. OSIRISv1.2 was utilized for collecting sequence variants data for a set of 302 genes related to the disease, and for the recognition and extraction of SNP terms from MEDLINE abstracts, with the aim of collecting the available information on the variants under study in the disease. The set of aneurysm-related genes was automatically selected on the basis of their occurrence in abstracts pertaining to the disease, in the context of the UE project @neurIST. The abstracts were retrieved with PubMed using search queries with disease related terms, and scanned with ProMiner [Hanisch et al 2006] for the recognition of gene terms. From an initial set of 790 genes, results were obtained for 302 genes scanning the entire MEDLINE collection, which were stored in the database and are available for browsing. The results here presented represent a collection of the citations that refer to sequence variants on these genes.
TopBreast cancer constitutes the most prevalent type of cancer in women after nonmelanoma skin cancer, and is the second leading cause of cancer deaths after lung cancer. In 2007, an estimated 180,510 new cases will be diagnosed, and 40,910 deaths from breast cancer will occur (more information can be found here). Only 15-20 % of the cases occurs in families carrying a strong predisposing mutation, for instance in the BRCA-1 and BRCA-2 genes. The remaining of the cases have a combination of environmental and genetic origins, the latter with an individual small effect. During the development of a tumour, genetic changes acquired by cells during the initial phases provide proliferative advantages, such as the acquisition of constitutive mitogenic signals, ability to resist growth inhibiting signals and apoptosis and to induce angiogenesis. In addition, cells acquire more mutant alleles that enable them to metastize to other tissues [Balmain et al, 2003]. In this context, the identification of the sequence variants associated to the disease phenotype in specific populations is of high value for early diagnosis and treatment, and for understanding the mechanisms leading to tumour development and metastasis and the differential response to therapy of individual patients. OSIRISv1.2 was utilized for collecting sequence variants data for a set of 5182 genes related to the disease, and for the recognition and extraction of SNP terms from MEDLINE abstracts, with the aim of collecting the available information on the variants under study in the disease. The set of breast cancer-related genes was automatically selected on the basis of their occurrence in abstracts pertaining to the disease, in the context of the UE project @neurIST. The abstracts were retrieved with PubMed using search queries with disease related terms, and scanned with ProMiner [Hanisch et al 2006] for the recognition of gene terms. From an initial set of 5182 genes, results were obtained for 1055 genes scanning the entire MEDLINE collection, which were stored in the database and are available for browsing. The results here presented represent a collection of the citations that refer to sequence variants on these genes.
TopThe toxicity of prescription drugs is a known issue of current drug therapies. However, the mechanisms that underlye the side effects of drugs are not completely understood, although genetic factors might be important in some cases. Thus, the elucidation of the genetic basis of drug toxicities is relevant for improving our understanding in this area and also for pinpointing putative biomarkers. OSIRISv1.2 was utilized for collecting sequence variants data for a set of 79 genes related to drug toxicity in general, and for the recognition and extraction of SNP terms from MEDLINE abstracts, with the aim of collecting the available information on the variants under study in the disease. From an initial set of 124 genes, results were obtained for 79 genes scanning the entire MEDLINE collection, which were stored in the database and are available for browsing. The results here presented represent a collection of the citations that refer to sequence variants on these genes.
TopBrowse the results of OSIRISv1.2 searches conducted on the three case studies explained above. One case study pertains to genes related to the cerebrovascular disease intracranial aneurysm and subarachnoid haemorrhage, the second pertains to genes related with breast cancer, and the third to genes related with the toxicity of drugs. Otherwise, you can browse the results for all the genes available in our database.
Please select the scope of the results from the menu below:
This is another way to browse the results of the OSIRISv1.2 searches. Terms from the Disease category of the MeSH hierarchy were extracted from the set of abstracts associated to each dbSNP entry. A weight was assigned to each term, which indicates the relevance of the term in describing the set of abstracts associated to each variation. The terms from the MeSH Disease category were chosen to describe the set of documents related to a dbSNP entry. This representation provides an easy way to retrieve the variations that are associated to a disease category, according to the MeSH controlled vocabulary.
Retrieve the variations associated to the main categories of disease terms in the MeSH hierarchy:
Retrieve the MeSH disease terms associated to a dbSNP identifier:
| Number of dbSNP entries found in Medline | 5167 |
| Number of Medline abstracts annotated to SNPs | 13485 |
| Number of NCBI genes mapped to the SNPs | 1827 |
Furlong LI, Dach H, Hofmann-Apitius M, Sanz F. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008, 9:84. Link
Furlong L.I.and Sanz F. Identification of sequence variants of genes from biomedical literature: the OSIRIS approach. Book chapter for the book "Information Retrieval for Biomedicine: Natural Language Processing for Knowledge Integration", in press.
TopThis work is part of a joint effort between the Integrative Biomedical Informatics Group at the Research Unit on Biomedical Informatics (GRIB) and the Department of Bioinformatics at Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI). The Integrative Biomedical Informatics Group promotes and tackles the synergistic and integrative approaches of the diverse reasearch lines developed by the research groups of the Research Unit on Biomedical Informatics (GRIB). The group focuses on the application of methods and software developed in-house to tackle human health issues, including disease prevention and diagnosis and therapeutic tecnologies. One of our research lines is devoted in the development of new strategies and tools for text mining, focused in the literature retrieval and classification, particularly considering the documents dealing with genetic variation. Visit our resources page at IBI.
TopComments and suggestions: Laura I. Furlong (lfurlong@imim.es, web page) Integrative Biomedical Informatics Group, Research Unit on Biomedical Informatics (GRIB), Institut Municipal d´Investigació Médica (IMIM) and Universitat Pompeu Fabra (UPF).
Updated: November 2008
TopThis work has been generated in the framework of the following projects: @neurIST IP (European Commission financed through the contract no. IST-2005-027703), INFOBIOMED NoE (European Commission financed through the contract no. IST-2002-507585), INBIOMED ISCIII network and ALERT project (European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 215847).
Top