Text Mining

What is now known as "text mining" is an extension of the better known data mining. Data mining is software that analyses billions of numbers to extract the statistics and trends emerging from a company's data. This kind of analysis has been successfully applied in business situations as well as for military, social, government needs. But, only about 20% of the data on intranets and on the World Wide Web are numbers - the rest is text. The information contained in the text (about 80% of the data) is invisible to the data mining programs that analyze the information flow in corporations.

The state of the art in information technology today allows development of software that can analyze text in the same way that data mining software analyses numbers. "Big data" analysis growth rate is impressive (five fold from 2011 to 2012). There are many companies and research groups racing to grab a piece of a multi billion dollar pie that is estimated to be the market for text mining.

An early push for this kind of technologies came from a seminal event in the 1990s known as MUC - the Message Understanding Conference, founded by US defense research group DARPA (click here for more current programs). MUC was a series of international trials of research systems. Groups working in text processing were given a common task, and a few months to train their system to do the task. Then they were given some data to test how well their particular system extracted the knowledge. The results were ranked, to show which systems worked best. A decade later, the two technologies, data mining and text analysis are merging. With the combination, suddenly, the data miner has access to the data in the text.

The data mining approach to text processing is to use information retrieval technology techniques. Pattern matching, keyword matching, collocations or word frequency analysis are used to discover what a document is about - essentially treating a text document as if it were numbers. Statistical techniques are used to locate documents on a particular topic, or to route documents in an organization. A typical example with which people are more familiar is spam e-mail filtering. All messages that fulfill the statistical profile of junk e-mail are routed to a particular folder.

The linguistic approach, on the other hand, consists on analyzing the text or more generally applying natural language processing techniques to text mining. These techniques allow the production of systems which actually "understand" text and extract information, rather than just scanning it as a list of strings. This is also known as information extraction. For example, such systems go through a lot of text finding references for companies, people, places, etc. with goal of answering who, what, when and where questions.

Knowledge may be discovered from many sources of information, yet, unstructured texts, remain the largest readily available source of knowledge. The problem of Knowledge Discovery from Text (KDT) is to extract explicit and implicit concepts and semantic relations between concepts using Natural Language Processing (NLP) techniques. Its aim is to get insights into large quantities of text data. KDT, while deeply rooted in NLP, draws on methods from statistics, machine learning, reasoning, information extraction, knowledge management, cognitive science and others for its discovery process. KDT plays an increasingly significant role in emerging applications, such as Text Understanding, Machine Translation, Ontology Development, etc.