GSoC/GCI Archive
Google Code-in 2013 Apertium

scrape a freely available dictionary using tesseract (Crimean Tatar and Russian) [0]

completed by: Pylypchuk Ljudmyla

mentors: Jonathan Washington, Francis Tyers

Use tesseract to scrape a freely available dictionary that exists in some image format (pdf, djvu, etc.). Be sure to scrape grammatical information if available, as well stems (e.g., some dictionaries might provide entries like АЗНА·Х, where the stem is азна), and all possible translations. Ideally it should dump into something resembling bidix format, but if there's no grammatical information and no way to guess at it, some flat machine-readable format is fine.