GSoC/GCI Archive
Google Summer of Code 2012

Open Bioinformatics Foundation

Web Page: http://www.open-bio.org/wiki/Google_Summer_of_Code

Mailing List: http://lists.open-bio.org/mailman/listinfo/gsoc

The OBF is a nonprofit volunteer run organization focused on supporting open source programming in bioinformatics. It acts as an umbrella organization for the BioPerl, BioPython, BioJava, BioRuby, BioSQL, and BioLib projects, and organizes conferences and workshops to promote and support open-source bioinformatics.

Projects

  • Diff My DNA: Development of a Genomic Variant Toolkit for Biopython Genomic variants are small alterations in DNA sequences which often have important biological significance. A Biopython toolkit facilitating the use of genomic variant files will broaden the availability of these datasets to the biology and bioinformatics communities.
  • Multiple Alignment Format parser for BioRuby The MAF (Multiple Alignment Format) file format has become popular in bioinformatics for representing similarities, in the form of sequence alignments, between multiple whole genomes. Such multiple sequence alignments enable many kinds of analysis, including examination of the phylogenetic relationships between organisms, of conserved regions of DNA potentially indicating functionally important sequences, and making inferences about genomes by reference to more fully annotated genomes (Blankenberg et al., 2011). Although BioRuby supports multiple sequence alignments through the bio-alignment gem, it does not currently support the MAF format. This project aims to rectify that by creating a bio-alignment plugin to provide full MAF support from a native BioRuby interface. This is particularly important as MAF files can often be in the hundreds of gigabytes and are often queried and filtered in a rich way; programmatic access to them is valuable for the same reasons programmatic access to databases is indispensable. Having native BioRuby support for MAF will thus allow use of the sizable BioRuby toolset with this data, and also make BioRuby a more viable tool for an important class of problems.
  • Robust and fast parallel BAM parser in D for binding against dynamic languages SAM/BAM data formats have become ubiquitous in bio-medical research due to wide use of next generation sequencing. The existing tools either don't use parallelism or are hard to bind against commonly used dynamic languages. This project aims to fill this gap.
  • SearchIO Implementation in Biopython Biopython is a widely-used Python-based toolset for working with biological data. It was built mainly to simplify biological data analysis workflows, for example by providing parsers for various data file formats, wrappers for command line programs, and interfaces for remote data sources. However, until now it still lacks a common framework for interacting with outputs of sequence search programs. These programs allow similarity-based searching across various biological sequence databases, a task inseparable from modern biology research. Unfortunately, extracting information from their outputs is often difficult due to the amount of results produced and the dense information packed with them. To solve this problem, this project aims to add a submodule called SearchIO to Biopython. SearchIO will allow more systematic information extraction from sequence-search programs’ outputs and an easier interaction with various output formats through a common programming interface.
  • The worlds fastest parallelized GFF3/GTF parser in D, and an interfacing biogem plugin for Ruby This project is about creating the fastest parallel GFF3/GTF parser using the low-level next-generation D programming language and an interface plugin for BioRuby. Not only will this become a state-of-the-art fast GFF3/GTF parser, this project will also provide a proof of concept on how to create fast parallelized parser and algorithm implementations, and make them available in higher-level scripting languages. This project will also lay the groundwork for the new BioLib/HPC library.