Machine translation of human language is a long-standing and elusive goal in computer science. Since the earliest translation efforts in the 1950's, it has been widely assumed that, in order to translate a sentence from (for instance) Chinese into English, a system must first analyze the Chinese sentence, extract its meaning, and then reformulate the meaning into English. This approach requires solving numerous hard problems such as lexical, syntactic, and semantic analysis; knowledge representation; and language generation. Systems based on it require intensive knowledge engineering, many years, and millions of dollars to develop.
The internet has made a new resource available to translation researchers: millions of sentences of already translated data. Books and web sites are published in multiple languages. Multilingual governments and news agencies generate tremendous volumes of translated text as a byproduct of their day-to-day activities.
The widespread availability of translated texts has enabled a new approach known as statistical machine translation. In this approach, we view translation as a machine learning problem. We present a learning algorithm with a set of existing translations, and then apply the trained algorithm to new input sentences to produce translations. Knowledge engineering is no longer necessary; data is the only requirement. Many state-of-the-art research systems are now based on statistical methods. In principle, we can use this approach to create a translator for a new language pair in a single day.
In this talk I will present a tutorial overview of statistical machine translation, including current best methods, open problems, and future directions.