Text Mining For Science

This is the page dedicated to "Text Mining for Science".

Text Mining as applied to scientific domains, and specifically the biomedical domain, is in many ways similar to the processing of any other natural language text. The language is:

  • Highly ambiguous: 
    • words may have many meanings 
      (There are over 5000 results for the search "p53" in EntrezGene! A more obvious example is a word like "patient" which can refer to a clinical subject or a psychological characteristic of a person.)
    • phrase interpretation can be difficult to resolve
      (Consider: "The inability of E1A gene products to induce cytolytic susceptibility during infection ... ". What is the scope of "during infection" -- is it "induce" or "susceptibility"? Here, it is probably "induce", but this requires understanding the biological context.)
  • Richly expressive:
    • there are many ways to say the same thing
      (A simple concept like "DNA metabolism" can also be expressed as "the DNA metabolic process" or through the use of the verb "metabolizes")

However, there are also many specific characteristics of the biomedical domain that mean that solutions for processing general (English) text do not transfer well to these texts:

  • highly specialized vocabulary
    (words like "phosphorylate", "mutate", "methylation", "epigenetics")
  • complex relationships and events
    (understanding the role of genes, proteins, chemicals and their interactions in the context of specific diseases or drug treatments is the challenge of modern biomedical research)

Therefore, these texts require specialized solutions in order to correctly extract information from them.