Text Mining

What is now known as "text mining" seems to be an extension of the better known data mining. Data mining is software that analyses billions of numbers to extract the statistics and trends emerging from a company's data. This kind of analysis has been successfully applied in business situations as well as for military, social, government needs. But, only about 20% of the data on intranets and on the World Wide Web are numbers - the rest is text. The information contained in the text (about 80% of the data) is invisible to the data mining programs that analyze the information flow in corporations.

The state of the art in information technology today allows development of software that can analyze text in the same way that data mining software analyses numbers. There are many companies and research groups racing to grab the largest piece of a five billion dollar pie that is estimated to be the market for text mining in the next two years.

An early push for this kind of technologies came from a seminal event in the 1990s known as MUC - the Message Understanding Conference, founded by US defense research group DARPA. MUC was a series of international trials of research systems. Groups working in text processing were given a common task, and a few months to train their system to do the task. Then they were given some data to test how well their particular system extracted the knowledge. The results were ranked, to show which systems worked best. A decade later, the two technologies, data mining and text analysis are merging. With the combination, suddenly, the data miner has access to the data in the text.

The data mining approach to text processing is to use information retrieval technology techniques. Pattern matching, keyword matching, collocations or word frequency analysis are used to discover what a document is about - essentially treating a text document as if it were numbers. Statistical techniques are used to locate documents on a particular topic, or to route documents in an organization. A typical example with which people are more familiar is spam e-mail filtering. All messages that fulfill the statistical profile of junk e-mail are routed to a particular folder.

The linguistic approach, on the other hand, consists on analyzing the text or more generally applying natural language processing techniques to text mining. These techniques allow the production of systems which actually "understand" text and extract information, rather than just scanning it as a list of strings. This is also known as information extraction. For example, such systems go through a lot of text finding references for companies, people, places, etc. with goal of answering who, what, when and where questions.

Language technology is not yet widespread in the ordinary user market, but in specialized commercial areas it is finding a ready market for its tools, consultancy and training. This technology is at roughly the same stage as OCR was ten years ago: everyone wants it, but no one is sure how it will turn out. But all the signs are that text mining is going to be huge, and that the technologies which it is developing are going to change the face of computing.

Knowledge may be discovered from many sources of information, yet, unstructured texts, remain the largest readily available source of knowledge. The problem of Knowledge Discovery from Text (KDT) is to extract explicit and implicit concepts and semantic relations between concepts using Natural Language Processing (NLP) techniques. Its aim is to get insights into large quantities of text data. KDT, while deeply rooted in NLP, draws on methods from statistics, machine learning, reasoning, information extraction, knowledge management, cognitive science and others for its discovery process. KDT plays an increasingly significant role in emerging applications, such as Text Understanding, Machine Translation, Ontology Development, etc.

Text Mining Links:

  1. About Scatter-Gather
  2. Book - Information Retrieval
  3. CELI - Language and Information Technology
  4. CIIR at UMass
  5. Content Analysis Resources
  6. H5 Technologies
  7. IBM - Abstracts of Rakesh Agrawal's Publications
  8. IBM - Clustering Publications
  9. Knowledge Discovery in Databases Archive
  10. Knowledge Management
  11. LSI - Latent Semantic Indexing Web Site
  12. Making a Semantic Web
  13. Marti Hearst_ What Is Text Mining_
  14. Megaputer Intelligence -- Software Evaluation
  15. MetaCarta - Technology
  16. Open Directory - Reference Knowledge Management Knowledge Discovery
  17. Proteus Project Technical Reports
  18. SRA International, Inc. - NetOwl Product Family Information
  19. Suffix Trees
  20. Summarization
  21. Text Analysis Info
  22. Text Analysis, Text Mining and Information Retrieval Software
  23. The STRAND Bilingual Databases
  24. WEBSOM Publications
  25. About Scatter/Gather
  26. About the Cat-a-Cone
  27. About TileBars
  28. Automated Info Solutions
  29. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text 
  30. AvaQuest
  31. Compris Intelligence
  32. Delft-Cluster TextMiner
  33. Eidetica
  34. Extraction of Knowledge from Unstructured Text
  35. Indexer from Xanalys
  36. Intelligent Miner for Text
  37. Leximancer
  38. Machine Learning in Automated Text Categorization
  39. NetOwl - Intelligent Content Management
  40. Pertinence Mining
  41. SAS Text
  42. Miner
  43. Synthema
  44. Systems Services - Web Data Retrieval
  45. Text Mining and the Knowledge Management Space
  46. Text Mining at Waikato
  47. Text Mining Community
  48. Text Mining, Web Mining, Information Retrieval and Extraction from the WWW References
  49. TextAI: Text Analysis International
  50. TextAnalyst
  51. TextMining.org
  52. Untangling Text Data Mining
  53. WebAnalyst
  54. Web Search and Mining: Introduction
  55. WordStat