GSoC/GCI Archive
Google Code-in 2011 MoinMoin Wiki

Research existing MS Office text extractors

completed by: qxcv

mentors: Bastian Blank, Reimar Bauer, Thomas Waldmann, Prashant Kumar, Eugene Syromyatnikov

Abstract

Research existing solutions for extracting text from proprietary Microsoft file formats.

Details

For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.

We need to know:

  • is a license compatible to GPL2+ used?
    • for python libraries e.g.: GPL, BSD, MIT, ... (not: Apache License 2)
    • in general: a free software license, not any proprietary license
  • the programming language used
    • strongly preferred is library code in python (we can just call it)
    • also maybe working is a commandline tool (supported platforms?) that we can call as a subprocess
  • windows-only solutions are not wanted
  • compatibility with different file formats (mainly Word but also Excel and Powerpoint)
  • compatibility with different versions (i.e. .DOC and .DOCX)
  • reliability (is it well-maintained code, is it recently updated?)

Deliverable: wiki page

Benefits

Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.

Researching existing code base is a first step on this direction.

Skill Requirements

You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.

This task refers to moin2 (http://moinmo.in/MoinMoin2.0)!

You can discuss this issue in the MoinMoin wiki: http://moinmo.in/EasyToDo/TextExtractors