Liu, H, Christiansen T, Baumgartner Jr WA, Verspoor K.
2012.
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics. 3(3)
AbstractBackground
The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.
Results
In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system.
Conclusions
The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.
Liu, H, Hunter L, Keselj V, Verspoor K.
2013.
Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations, 04. PLoS ONE. 8:e60954., Number 4: Public Library of Science
AbstractThe biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
Lu, Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai R, Dai HJ, Okazaki N, others, Verspoor K, Livingston K, Wilbur WJ.
2011.
The gene normalization task in BioCreative III. BMC Bioinformatics. 12(Suppl 8):S2., Number Suppl 8: BioMed Central Ltd