GSoC/GCI Archive
Google Summer of Code 2012

National Evolutionary Synthesis Center

Web Page: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012

Mailing List: mailto:phylosoc@nescent.org

NESCent facilitates synthetic research on grand challenge questions in evolutionary biology and also works to address critical needs in software infrastructure and education through promoting open, collaborative development of interoperable and standards-supporting open-source software. The Center is located in Durham, North Carolina, is jointly operated by Duke University, the University of North Carolina at Chapel Hill, and North Carolina State University, and receives its core funding from the National Science Foundation (NSF). As part of our cyberinfrastructure program, NESCent has run five collaborative software source code and vocabulary development sprints aimed at improving interoperability in phyloinformatics, engaging developers of scientific software tools, promoting integration among online data resources, and sustaining the development of shared vocabularies. These events, and our past Summer of Code participation, continue to have significant and lasting impacts on the landscape of collaborative software development in our field. The Center is committed to FLOSS and sharing of scientific data (see for example the NESCent Data and Software Policy); all software products of the Center are released as open source and established as collaborative projects on sites such as SourceForge, Google Code, and GitHub. Members of the Center's Informatics team are lead developers in several open-source projects, and one of our Assistant Directors has been active for ten years on the Board of the Open Bioinformatics Foundation, the umbrella organization for the Bio* projects.

Projects

  • A program to compute probabilities of ranked gene tree topologies in species trees A polynomial-time algorithm has been described for computing probabilities of ranked gene tree topologies given species trees. Once, implemented, ranked gene tree probabilities could be used to infer species trees, although inferring species trees is beyond the scope of the project. The idea is to consider ranked gene tree topologies, where we distinguish the relative order of times of nodes on gene trees, but not the real-valued branch lengths.
  • Google Maps like Matrix browsing Matrices are basic data stores used throughout evolutionary studies. Large (1000 by 1000), collaboratively built matrices of quantitative/qualitative are becoming increasingly common on the web, and a generic mechanism to browse (interact) with these data in the browser via a "windowing" mechanism would be broadly useful. The project focuses on building an interface which allow users to browse large data stored in the matrices. With this, user would be able to see and analyse large data stored on the web. It requires to build a jQuery library that will produce a Google-maps-like tile based interface. Users can drag to area of their interest and can see the data more clearly. To improve the performance neighbouring shell should be loaded in the background. Although the original idea is to have the plugin load and visualize large matrices, a well built tool for this task could further be used to display more complex objects (e.g. heat maps, highly detailed images etc.) as these objects can also be treated in form of a matrix. Hence the aim here would be to allow an open format for the cell values (e.g. HTML div) that can be extended to allow any kind of data in the future.
  • MASTodon - a Java tool for the summary and visualization of large sets of phylogenetic trees. MASTodon is a Java application that looks for common subtrees in large sets of phylogenetic trees. It provides a user-friendly graphical interface, automatic pruning algorithms as well as powerful manual pruning options.
  • NeXML to MIAPA Mapping & ISAtab Transformation Project Plan MIAPA is a proposed minimum information standard for phylogenetic data sharing and reuse. Barriers to its adoption include lack of detail and definition in the standard itself (a draft standard has only recently been produced) and lack of tools supporting it. My project will focus on the latter. This project will identify data elements within NeXML that are MIAPA compliant and seek to extract them via XSL and then transform them into the ISAtab format via Java and XSLT. This will facilitate MIAPA use within ISAtools, a software system for collection, sharing, and repository submission with built-in support for minimum information standards. I also propose extensions for the project if I am ahead of schedule.
  • Optimizing R code for approximate bayesian computing Approximate bayesian computing is becoming a more powerful approach for phylogenetic comparative methods. Unfortunately there are no real user friendly open source packages to run this oft-times computationally intensive process. The R package TreEvo hopes to fill this void but there exists a very large amount of code which may or may not function optimally. I propose to go through this code, test and optimize large loops in order to speed upTreEvo's simulation based bayesian architecture.
  • Phylosoc 2012: Apply Machine Learning Algorithm(s) to Ecology Data With the prevalence of 16S sequence data there is a need for ecologists to classify different populations associated with different conditions. To this end, the goal of this project is to create a program that will allow microbial ecologists to apply machine learning algorithms (e.g. random forests, classifiers) to microbial ecology data so they can identify bacterial populations that are associated with differences between health and disease.
  • Phylowood.js: Browser-based Interactive Animations of Ancestral Dispersal and Diversity Patterns This project will develop an open-source Javascript and D3.js package to generate interactive browser-based animations from phylogeographical and biogeographical datasets and inferences. The objective of this project is to design a tool that facilitates scientific discourse, both between researchers and between scientists and the public.