GSoC/GCI Archive
Google Summer of Code 2012

Apache Software Foundation

Web Page:

Mailing List:

Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading Open Source projects, including Apache HTTP Server — the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 350 individual Members and 3,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations and corporate sponsors including AMD, Basis Technology, Cloudera, Facebook, Google, IBM, HP, Hortonworks, Matt Mullenweg, Microsoft, PSW Group, SpringSource/VMware, and Yahoo!.


  • [XERCESJ-1130] Validating against multiple XML schemas with the same namespace Apache Xerces-J is a high performance, fully compliant XML parser and an XML schema processor which is written in Java[1]. Xerces has more features in it than the requirements of the w3c specification. But, still there are some design limitations which have to be resolved in its implementation. This project aims one of those design limitations which is “Validating against multiple XML schemas with the same namespace”[2] which will comfort the life of the users.
  • Add ODF 1.2 RDF Metadata Support to ODF Toolkit New in ODF 1.2 is the ability to annotate parts of the document and the document itself with arbitrary metadata using the W3C RDF standard. I'll design and develop a new feature of the ODF Toolkit library, a new feature of the ODF Toolkit library, which allows to add arbitrary RDF graphs to the document or parts of it, identified by XML file with xml:id, summarized in a manifest.rdf, and to transform ODF XML metadata to RDF metadata.
  • Adding the Datetime Type as a Primitive for Pig Apache Pig is a platform for analyzing large data sets based on Hadoop. Currently Pig does not support the primitive datetime type, which is a desired feature to be implemented. In this proposal, I explain my plan to implement the primitive datetime type, including the details of my solution and schedule. Additionally, I briefly introduce my background and the motivation of applying GSoC'12.
  • Apache Stanbol: Entity Disambiguation I would like to work on Entity disambiguation in Stanbol. Suppose I have "George Bush" in a sentence or document. There are may be two matches. 1. George H. W. Bush - 41st US President 2. George W. Bush – 43rd US President. The idea is to provide support to Stanbol enhancer to be able to disambiguate between 1 and 2 on the basis of context from the document and data associated with 1 and 2.
  • Automated Publishing of GeoRSS Data from OODT File Manager to SIS This proposal involves the 'Object Oriented Data Technology' (OODT) and 'Spatial Information Systems' (SIS) projects, both run by the Apache Software Foundation. The goal of this project is to establish a well-designed, formal, robust connection between OODT and SIS so that data from OODT that is tagged with geographical metadata can be read in and understood by SIS. This is expected to bring significant benefits to both the OODT and SIS communities.
  • CUBE operation in Pig Computing aggregates over a cube of several dimensions is a common operation in data warehousing. In Online Analytical Processing (OLAP) systems, a cube is a way of organizing data in N-dimensions so as to perform analysis over some measure of interest. Measure is a term used to refer numerical facts that can be algebraic(SUM, COUNT etc.) or holistic (DISTINCT, TOP-K etc.). The aim of this project is to provide support for cube computation over massive datasets using Apache Pig. It extends my current naïve implementation of cube operator to support efficient cube computation for algebraic and holistic measures.
  • Developing a metadata model to manage photos in Apache Photark Implementing a proper way to manage photos by structurally maintaining the meta information of the photos in Photark
  • Distribuited mailbox indexing over HBase/HDFS Currently, James mailbox supports email indexing over Lucene, the directory implementation of the Lucene search and indexing relies on relational databases, or file-system storing. As the number of indexes increases with the number of clients using the mailbox so does the performance of the indexing degrade, thus an implementation over a noSQL database like HBase would solve this problem by distributing the indexes and documents on a system designed for high amounts of data.
  • Ecipse XML editor based lightweight Tapestry visual editor Apache Tapestry supplies powerful component structure, and it offered a number of other powerful features that proved to be critical in our work. It allowed a clean separation between Java and HTML, make it possible for the design work on the application to continue well after the code had been completed, and becoming more and more popular today. This project aims to build a lightweight Tapestry visual editor on Eclipse XML editor.
  • FUNTIONAL TESTING PLAN FOR THE OFBIZ The functional testing plan for the OFBIZ, which includes estimated number of iteration, and tasks to be completed within a specific iteration
  • Google Summer of Code: ODF Command Line Tools (Draft Proposal) The ODF Command Line Tools (OCLT) would purport to be a suite of cross-platform utilities that would make it easy to do both simple and complex text manipulation tasks from the command line. The OCLT would be able to manipulate ODF documents both in terms of structure *and* content. The target audience of the OCLT would be command-line afficionados, power users, QA testers, web developers, and others who may be, but are not necessarily, professional developers.
  • Gora - Amazon DynamoDB datastore for Gora Provide a gora-amazondynamodb module for Gora in order to offer a popular datastore for the community. In this way, the main objective is to embrace the use of Gora within the open source community.
  • Hupa evolution Hupa is an Rich IMAP-based Webmail application written in GWT, but there is no release delivered yet. This proposal's aim is to make it fully functional adding many features which are missing and removing out-of-dated dependencies in favor of the new stuff in the latest GWT version.
  • Implement a Collection of Integration Patterns for Synapse Core idea of the project is implementing a library of built-in integration patterns for Synapse. The Enterprise Integration Patterns are well defined solutions to recurring problems within a given SOA or ESB or Web Service development. Each pattern is defined as a framework for designing, building messaging and integration systems as well as a common language for architecting solutions. So the purpose of project is to strengthen Apache Synapse to support most of the Enterprise Integration Patterns from the book by Gregor Hohpe and Bobby Woolf .
  • Implement the W3C XML Digital Signatures for Widgets Specification in Apache Wookie Apache Wookie is a Java server application in the incubation status at ASF. It allows administrators to upload and deploy Widgets packaged according to the W3C Widgets specification. These widgets are then hosted in container applications which connects with Wookie server and let it users to pick widgets to be added to their pages. This GSoC project aims to implement the W3C XML Digital Signatures for Widgets specification, in Wookie. With this feature, organizations can automate the installation and updating process of widgets, verifying the signature.xml of the widget to be deployed or updated.
  • Implementing module to creating document thumbnail, displayed on the whiteboard, without using, as mandatory dependency. Project Apache OpenMeetings My goal is to improve Apache OpenMeetings performance by implementing a system of posting documents on the whiteboard, without using OpenOffice.
  • Improve Derby's Code Coverage Apache Derby is an open source relational database implemented entirely in Java. The purpose of this project is to improve the code coverage of Derby by writing additional tests. It is important to have tests to cover the entire code base in order to ensure proper functionality. This project aims to find out classes with poor code coverage and create tests to cover them.
  • Improve JSON support in Apache Axis2 with google-gson java library According to the existing JSON support in Apache Axis2/Java, it deals with badgerfish[1] format of the JSON object which is an XML representation of JSON object. In the server side it converts that badgerfish string to the relevant XML format and use it similarly as a normal XML SOAP message. As described in Axis2 documentation[1] it partially supports mapped format which is another XML representation of JSON object too. But this JSON support can be improved to support pure JSON objects without using any format to covert it into a XML, as JSON is a light weighted alternative to XML. Therefore existing JSON support doesn't expose its advantages at all. Here I propose a solution to improve the existing JSON support of Axis2 with some limitations by using google-gson [2] library which is a rich library to convert a JSON string to a Java object and vice-versa. As this is also released under Apache license, using this library we can improve JSON support in Axis2 to overcome above problem.
  • Improvement of testing and adjusting of media content transmitting in Apache OpenMeetings conferences I want to make the Apache OpenMeetings software more handy and flexible to set it up that would improve routine of video conferences. I want to add several new features. These features are connected with user’s custom preferences of media content transmitting.
  • Improving drawing support for Apache POI HSSF module HSSF is the Apache POI Project’s implementation of the Excel binary file format (.xls). One of drawbacks of the current implementation is limited support for Excel drawing layer
  • Improvised code coverage for Derby This project aims at improvising code coverage in Derby. Having good code coverage helps assure that the most important conditions are met and tested. Since derby has lots of legacy code and is used in a variety of environments including commercial deployments, it becomes more important to provide high code coverage for important classes.
  • Integrate Apache Sling to Apache Airavata Registry Project’s main idea is to implement the Sling support for Apache Airavata. Currently Airavata API is used to store/retrieve data from Jackrabbit content repository. But it’s not scalable when comes to handle large amount of data, because of keeping the session in the client side. Sling supports accessing Jackrabbit in RESTfull way. So the basic requirement is to come up with another API which looks like the current API, but uses jackrabbit through Sling Interface.
  • LUCENE-3312 Break out StorableField from IndexableField Lucene is a open-source full-text search library written in Java, but ported to many other languages. It relays on concept of document, a core Lucene information holder in need to be indexed or stored. This project will decouple indexing and storing operations over fields in document, as well as separate document classes in indexing and search time.
  • LUCENE-3892: Add a useful intblock postings format My project aims to improve search performance by integrating a new postings format with a search engine library (Lucene). The work includes implementation of encoder/decoder, modifications on previous postings format interface, and performance comparisons. I'll also compare the performance among the new postings format and current approaches.
  • Message Queue component for Nuvem Apache Nuvem is an open application programming interface for common cloud application services, allowing applications to be easily ported across the most popular cloud platforms. The eventual goal of this project is to develop a standard vendor independent message queue API for Nuvem.
  • Non rowkey predicate pushdown in Hive-HBase handler (Hive jira - 1643) This project aims at optimizing the network IO in Hive queries on HBase StorageHandler by converting non-row-key predicates into corresponding HBase filters and pushing them down in the query tree and using row-key predicates to determine scan ranges.
  • openmeetings plugin for zimbra There are two features that the plugin would add in zimbra: 1.There could be an openmeeting conference event in the Zimbra calender. Every person invited to the calender event receives a mail in Zimbra. The email will also be having a link to enter the conference room. 2.A conference invitation could be attached in emails with the link to enter the conference rooms.
  • Photo galleria: A hybrid mobile app developed using Phonegap There is no Hybrid mobile app for a user to view all his photos stored in various online social networking sites and storage sites at one place. The goal of this project is to solve the aforementioned problem. Through this project, I intend to develop a hybrid app that will stream all photos a user has uploaded on various sites like Facebook, Picasa, and Flickr etc. by making use of APIs of these sites. This app will be developed using Phonegap.
  • PIG: Move Grunt from Javacc to ANTRL At the end of the program, I will deliver a patch with following set of changes. 1) Refactored Grunt parser code to use ANTRL instead of javacc. 2) Updated unit tests and e2e tests to accommodate above changes. 3) Build script changes to remove javacc parser generator code and replace it with ANTRL codegen. 4) Remove pig’s dependency on javacc library as it would not be used anymore
  • Provide cloud bursting like capabilities to Airavata computational workflows integrating with Apache Whirr Apache Airavata provides capabilities to construct, execute and monitor computational workflows with built in support for executing compute intensive applications on grid computing resources and cloud resources. This project aims to implement support for cloud(grid) bursting to compute intensive Hadoop jobs. Migration of partially completed Hadoop jobs to/from cloud(grids) from/to local cluster will also be implemented as part of the project.
  • Provide JSR 223 implementation for Velocity Apache Velocity is a well known java based template engine. It’s a powerful development tool that allows to easily create and render documents that format and present the data. For a such widely used product, it will be a great value addition if Velocity have JSR 223 API support.
  • REST Interface for Libcloud Apache Libcloud is a standard Python library that abstracts away differences among multiple cloud provider APIs. Currently Libcloud has a big limitation - you can only use it with Python. I'm willing to implement HTTP interface for Libcloud which exposes all the Libcloud functionality through a RESTful API. Adding a REST interface would allow users to leverage Libcloud functionality through an arbitrary language which knows how to talk HTTP.
  • Sparse Matrix-Vector multiplication(SpMV) on Hama This proposal about implementation of SpMV with Hama Bulk Synchronous Parallel framework.
  • Statistical Inference Operator on Pig (List Operators) The use of statistical inference methods on a set of data provides the most basic analytical intuition of a how the collection of data could be summarized, for instance, a sequential identifier according to a specific order, density among the elements of a particular subgroups, distribution over partitions and so on. These functionalities imply efficiency and reliability on operations performed on large datasets. Currently, statistical operations on large data sets can be done with SQL instructions, by means of DBMS or some other frameworks based-on SQL sentences; it is done through countless and complicated nested queries, implying performance concerns to non-related users with SQL. On a high-level layer, it is also possible to have a functional implementation of these statistical operations tightly attached to the performance of each language and to portability concerns among different DBMS. On an intermediate layer, is also possible to recreate these methods through a sequence of operators on Pig, without being too complex like SQL statements and mainly with the advantage of running over a distributed platform. On this sense, one feasible improvement on Pig is to provide a set of named operators that implements the statistical inference methods with a standard performance level. This improvement becomes a functional integration to experienced users on statistical frameworks, without knowing SQL or any programming techniques.
  • VXQuery Built On Parallel Processing (Hyracks) A VXQuery ticket outlines the desire to build a XQuery process that supports a parallel runtime environment. In a recent conversation with committers from VXQuery, they have several pieces ready (Algebricks operators) and just need the first-order functions that actually deal with XQuery processing.
  • Workflow Debugging Framework for Apache Airavata Workflow debugging framework is a new feature for Apache Airavata, which is proposed by the Airavata community. XBaya GUI of Airvata has a debugging feature, but that is limited only to executing the workflow at the XBaya GUI. Therefore new debugging module will implement at the backend of the Airavata, so Xbaya GUI should be able to connect with it and debug the executions at server. With this proposal I’m proposing a plan to develop the identified initial requirements of this feature.
  • WSDL customization framework for Axis2 and stabilize WSDL 2.0 features. Apache Axis2 is an open source framework which acts as an engine for Web Services, SOAP and WSDL. It supports automatic generation and publishing of both WSDL 1.1 and WSDL 2.0 documents at runtime. As it stands, currently Axis2 does not provide any API to customize and control the behavior of WSDL generation. The main objective of this proposal is to provide an easy to use WSDL customization framework for both WSDL 1.1 and WSDL 2.0. As of yet, Axis2 is the only known web service implementation that supports for WSDL 2.0. But Axis2’s WSDL 2.0 features are not up-to-date as WSDL 1.1 features. Some of the issues identified are, Not supporting the use of “useOriginalWsdl” parameter Not supporting the WSDLSupplier interface and Not supporting WS-Policy publishing etc. Addressing the above mentioned limitations is another significant goal of this proposal. Other than these two goals, GSOC project idea pages have mentioned number of issues as optional tasks. Addressing some of those issues is also included in the scope of this proposal.
  • XALAN - Add More EXSLT Functions Implement a more complete set of EXSLT functions into the Xalan-C XPath environment.