The Science of Text Mining

This is a page dedicated to the "Science of Text Mining".

What is text mining?

The phrase text mining has two usages in common practice. The first usage is generic, referring to analysis of text, or the use of features extracted from a textual data source. Another term sometimes used for this is text analytics.  Researchers working on natural language processing of or information extraction from scientific text will often refer to their work as text mining as that term seems to be more familiar to their biomedical collaborators.

The second usage is more specific, and more directly analogous to the notion of mining in data mining. This usage was pioneered by Marti Hearst at UC Berkeley. She defines text mining as "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Professor Hearst elaborates, "A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation."

Achieving text mining in the latter sense requires text mining in the former sense. My work involves a bit of both.