GSoC/GCI Archive
Google Summer of Code 2014

Association Tatoeba

License: GNU General Public License version 3.0 (GPLv3)

Web Page: http://en.wiki.tatoeba.org/articles/show/gsoc_ideas

Mailing List: https://groups.google.com/forum/#!forum/tatoebaproject

Tatoeba is a libre/free database of example sentences translated into many languages. Our goal is to create a resource for people studying languages—either to learn or research. The database is currently used as a source of example sentences by free dictionaries and language learning websites (like Jim Breen’s WWWJDIC; Jim Breen is actually a member too). It's also used as a rich resource for language learners: They can find out how to use words or how to translate grammatical constructs and idioms. The main site currently has about 1 million page views and 250 thousands unique visitors monthly, as reported by Google Analytics, and the corpus is growing steadily by 3% or more every month.

Projects

  • Administrative scripts and better export scripts The aim of this project is to write clean administrative scripts that could ease the task of setting up a development or production server from scratch and also automate basic administrative tasks such as backup, restore, import, export, etc. It also includes writing certain other scripts that perform important tasks specific to the website. Along with this the project deals with improving the existing export scripts so as to include certain important features that are currently missing.
  • Complete Python Rewrite of Tatoeba and Revamping of Its Architecture A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website.
  • Export to Anki A user can upload their Anki deck and Tatoeba can use that data to generate a list of cards to add based on i+1 and other requirements.
  • Mass Importing Sentences from Open texts to Fill Gaps in Tatoeba A mass import system to mine sentences from open texts in different languages and implementing an public interface through which quality sentences selected by system are further proofread by the crowd (Responsive Mobile friendly site) before being submitted to the database. Alignment of parallel texts and therefore pairing sentences also done if good confidence levels achieved.