Machine Translation

The idea of a computer system for translating from one language to another is almost as old as the idea of computer systems. Warren Weaver wrote about "mechanical translation" as early as 1949. He and many others were inspired by the success of the Allied efforts to break the German military code produced by the Enigma machine, and the obvious similarity between the task of decoding an encoded message and the task of translation of one language into another. By 1954 there was a MT project at Georgetown University which succeeded in correctly translating several sentences from Russian into English. Soon there were MT projects at MIT, Harvard, and the University of Pennsylvania.

In 1964, the National Academy of Sciences commissioned the Automatic Language Processing Advisory Committee (ALPAC) to write a study on the status of MT. The committee, headed by John R. Pierce, wrote a now-famous report in which it expressed doubt that a fully-automatic MT system could ever be produced. That report caused the end of easy funding for MT research, and MT was neglected for many years afterwards.

Human analysis of natural language relies on information which is not present in the words that make up the message. This fact led the linguist Yehoshua Bar-Hillel to declare that MT was impossible, as famously illustrated by the example:

The pen is in the box. [i.e. the writing instrument is in the container]

The box is in the pen. [i.e. the container is in the playpen or the pigpen]

There are two ways that a person could correctly understand these sentences.

First, if there is a context preceding these sentences, it could make clear which meaning of pen is being used in which sentence. There is now an entire branch of linguistics, called discourse analysis, devoted to the study of how context affects the meaning of words and sentences. In order to infer in this way the correct meaning of an ambiguous sentence, computers will have to learn how to "remember" a context and make use of it to interpret the correct meaning of words and sentences within that context.

Second, what "pen" might mean in each of these sentences is determined via extralinguistic knowledge about the size and function of the two objects referred by the word. Bar-Hillel concluded that "no existing or imaginable program will enable an electronic computer to determine the word pen in the given sentence within the given context. [...] A translation machine should not only be supplied with a dictionary but also with a universal encyclopedia." In fact, this suggestion has lead to several projects in creating ontologies or encoding of tacit knowledge.

The most primitive approach to automatic translation is called the direct MT strategy also known as word per word translation. This approach is always between pairs of languages. This approach is based on good glossaries and morphological analysis.

The next most advanced system is called the transfer MT strategy. First, the text in the source language is parsed into a product specific representation. Afterwards, this representation is "transferred" into the corresponding structures of the target language. Then a translation is generated. This approach is more advanced theoretically, but also translates between specific pairs of languages. Both the direct MT strategy and the transfer MT strategy can take advantage of similarities between languages. Logos, LionBridge's Barcelona technology, Systran, ProMT could be classified under this category.

Another advanced system is called the Interlingua MT strategy. Interlingua is an artificial language, which shares possibly all the features and makes all the distinctions of all languages. To translate between two different languages, an analyzer "transforms" the source language text into the Interlingua, and a generator "transforms" again the Interlingua into the target language.

An Interlingua variant is called knowledge-based machine translation (KBMT). The text is converted into an intermediate form independent of any specific language. The knowledge-based approach attempts, mostly in the fashion of knowledge engineering (KE) in traditional symbolic AI, to acquire and encode various kinds of knowledge (e.g., encyclopedic knowledge) for the purpose of disambiguation, but the source of knowledge remains a serious problem.

Using a rules based approach to MT present huge difficulties. This is because no human language conforms rigidly to a set of rules. Furthermore, translation necessarily involves more than one language, thus adding further complexity. For these reasons, linguists have begun to look at alternative approaches to MT systems. Typically these make use of a corpus, or body, of already translated text. This acts as a pool of ready-made examples of language use. Therefore, this approach is called example based machine translation (EBMT). This is a strategy where the translation is produced by comparing the input with a corpus of typical translated examples, extracting the closest matches and using them as a model for the target text.

A similar proposal is known as statistics-based MT. Since 1988, it has been suggested (IBM, AT&T, CMU) that it may be possible to construct machine translation systems automatically. Instead of codifying the human translation process from introspection, these researchers proposed machine learning techniques to induce models of the process from examples of its input and output. The proposal generated much excitement, because it held the promise of automating a task that forty years of research have proven very labor-intensive and error-prone. Teams at IBM, Johns Hopkins University, University of Pennsylvania, USC/ISI etc. have tried to improve upon this approach and achieve the promised results. Yet, with these systems becoming increasingly complex and unintuitive, the improvements in the results have been minuscule.

Today, there are several approaches to build hybrid systems that utilize some of the features from each approach or even from other fields of information retrieval such as latent semantic analysis, aligned bitexts, probabilistic parsers, vector space distance, etc. Most of these efforts are focused on improving specific tasks without modifying the overall approach. Language Weaver was founded in 2002 to productize and commercialize their statistical MT approach. Their proprietary statistical translation technology is the result of twenty person-years of invention and development at the University of Southern California's Information Sciences Institute (USC/ISI) by Drs. Kevin Knight and Daniel Marcu, Language Weaver's founders, and their students. Now, they can also rely on the strategic investment from In-Q-Tel Inc., the private venture group funded by the Central Intelligence Agency (CIA).

At the Technical University of Aachen (latest sources and wiki) as well as University of Koeln or City University of Hong Kong, several researchers are trying combinations of probabilistic models with EBMT and other more traditional approaches.

Other interesting research in new directions is being conducted by Sehda, Inc. Sehda's phrase-based technology allows them to represent the information contained in texts and conversations potentially containing several billions of words with a thesaurus containing only thousands of entries. Due to this enormous reduction in complexity, they believe their ability to create a natural language system can be significantly advanced, the cost of system building significantly reduced, while the efficiency of the recognition engine is substantially increased.

Given the complex solutions offered by all of the approaches considered, FAHQT (fully automated human quality translation) as a goal has been considered to have a very low ROI (Return On Investment). Even the probabilistic approaches have hardware and software requirements so high, that only few organizations can allow its experimentation. And even IBM's resources weren't enough: all sentences longer that 30 words were excluded from their training set because "decoding them would take too long". The human crafted rules on the other side are considered to require an effort in the order of 500 to 1000 person years for creating a system capable of translating any kind of text. Also, building a specialized bilingual system (in the order of 10000 concepts) would require approximately 100 person years.

A new fresh approach is offered from Fluent Machines (Meaningful Machines), a startup in New York City. Dr. Jaime Carbonell is quoted saying: "[Fluent Machines'] Method is clearly the most promising and theoretically important MT development in the past several years (and probably since the advent of MT itself). It is the one recent development with the greatest possibility of making a major advance in practical large-scale diffusion of MT technology."

The technology focuses on identifying the basic "building blocks of language". After this is done, any sentence or phrase can be created by linking together the correct sequence of language building blocks. The Fluent Machines system breaks down a source language into its components and re-assembles them in the target language using two processes: the first process automatically builds large cross-language databases of basic word-string combinations and the second process accurately reassembles word-string translations across a pair of languages, thereby producing translated text. This system is described in more detail at Fluent Machines website.

Microsoft, which has built one of largest research teams Natural Language Processing and Machine Translation, is making progress in their hybrid approach which builds knowledge from parallel text but also uses parsing, dictionaries and some kind of interlingua which they call LF (logical forms). The LFs are used to generate the target language text.

ESTeam too has an approach which takes advantage of pre-existing translation memories. Their system is a Machine Translation module, which can be used together with their Translation Memory module. They claim that processing, once the language resources have been gathered, is fairly language independent, enabling the fast expansion of the MT to accommodate new languages and support all the new translation directions.

MT links: