Google Summer of Code 2009 Organization The Apertium Project

The Apertium Project

Web Page: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

Mailing List: apertium-stuff@lists.sourceforge.net

The Apertium project (http://www.apertium.org) is a project which works on open-source machine translation and language technology. We try and focus our efforts on lesser-resourced and marginalised languages, but also work with larger languages. The project, including language data, translation engine and auxiliary tools, is being developed in several universities and companies around the world, with the principal part of the development on the engine being done by the Transducens research group of the Universitat d'Alacant (Alacant, Spain) and Prompsit Language Engineering. There are currently 17 published language pairs within the project (including a number of "firsts" — for example Spanish—Occitan and Basque—Spanish among others), and several more in development.

Projects

Apertium going SOA To translate many documents, many Apertium processes are created and each one loads dictionaries, transducers, grammars etc. from scratch, causing a waste of resources and reducing scalability. A solution is to implement an Apertium Server that doesn't need to reload all the resources for every translation task; this kind of service would be able to handle multiple request at the same time, improve scalability and could be easily included into existing business processes with little effort.
Apertium nb2nn: machine translation between Norwegian Bokmål and Nynorsk Norwegian has two written variants, Bokmål and Nynorsk. The Apertium translation pair for Bokmål-Nynorsk has many available resources (morphological dictionaries, CG tagger) but work is needed to make it a full-blown translation system. This involves expanding the dictionaries with closed class words and adjectives, making transfer rules for the structurally different noun phrases and improving coverage of the translation dictionaries.
Apertium-sv-da: Machine translation between Swedish and Danish I'm a pharmacy student from Denmark interested in languages and computers. I would like to develop the Swedish-Danish language pair for the Apertium Project.
Conversion of Anubadok: Building an English-Bengali Language Pair For Apertium Anubadok is an experimental English-to-Bengali machine translation system developed by G M Hossain. I'm going to port the existing Anubadok system to the framework of Apertium. I plan to implement a Bengali morphological generator. The tagging system also needs to be standardized and then a new transfer system can be written. I think Apertium’s flexible framework will allow me to implement a language pair as good as Anubadok.
Highly scalable web service architecture for Apertium Web services application that allows programmers to access, in their desktop or web applications, the same operations that can be done with a local installation of Apertium. It is intended to support high loads by scheduling and prioritizing pending translations according to server-side resources available, and load balancing across a static or dynamic amount of servers. The availability of highly scalable web services for Apertium will catalyze its worldwide adoption.
Implement a Trigram Tagger for Apertium and support-tools for training it To implement the the part-of-speech tagger using 2nd order hidden Markov model and Viterbi algorithm, and the various training algorithms: maximum likelihood estimate (MLE), Kupiec's method, Baum-Welch expectation maximization, Parameter smoothing (state-to-state transition and emission probabilities) Tools to train the trigram tagger based on both source and target language information. Integrate Baum-Welch and supervised methods implemented in att-tools into Apertium bigram tagger.
Java port of Apertium lttoolbox proposal The Apertium project works on open-source machine translation and language technology. If apertium is to work on platforms such as mobile phones one day, lttoolbox must be ported to java. I have found an XML java library that works similarly as the libxml2 used in lttoolbox C++ version. It will make it easy to use the existing java code.
Multi-Engine Machine Translation I would like to achieve a MEMT program that uses MOSES and statistical merging with Apertium during the other half of the GSoC period. The goal is to provide better translations for all languages but I will particularly work with Welsh-English.