GSoC/GCI Archive
Google Summer of Code 2013

Shogun Machine Learning Toolbox

Web Page: http://shogun-toolbox.org/page/Events/gsoc2013_ideas

Mailing List: http://news.gmane.org/gmane.comp.ai.machine-learning.shogun

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.  

SHOGUN is implemented in C++ and interfaces to all important languages like MATLAB, R, Octave, Python, Lua, Java, C#, Ruby and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

During Summer of Code 2013 we are looking to extend the library in two different ways:

  1. Improving accessibility to shogun by developing improving i/o support (more file formats), machine learning demos, and mloss.org/mldata.org integration.
  2. Integration of existing and new machine algorithms.

 Here is listed a set of suggestions for projects.

Please use the scheme shown below for your student application. If you have any questions, ask on the mailing list (shogun-list@shogun-toolbox.org, please note that you have to be subscribed in order to post).

Projects

  • Fast Reading and writing of shogun features / objects in standard file formats Hello, I'm Evgeniy Andreev, a third year undergraduate student at the Samara State Aerospace University. I have participated in GSoC with the SHOGUN-related project last year. This year I want to solve issues related to I/O shogun objects in various file formats.
  • Gaussian Processes for Classification Gaussian Processes provide probabilistic approach to supervised machine learning. SHOGUN Toolbox has already implemented flexible Gaussian Processes (GP) framework for regression. This project is about extending existing GP framework for classification.
  • Implement algorithms for Blind Source Separation (BSS) and Independent Component Analysis (ICA) based on Approximate Joint Diagonalization (AJD) of matrices. ICA/BSS can be done via the approximate joint diagonalization (AJD) of matrices. ADJ is the problem of finding a matrix, or set of basis vectors, that best diagonalizes a set of input matrices. It is an important tool playing a critical role in many applications including ICA and BSS. For machine learning in particular ICA can be used for pre-processing, automatic feature selection and dimensionality reduction for visualization. ADJ would be a valuable addition to the SHOGUN toolbox.
  • Implement estimators of large-scale sparse Gaussian densities Computing log-likelihood for Gaussian distribution requires computation of log-determinant of the covariance (or precision) matrix. Usual approach, based on Cholesky factorization, often suffers from huge memory requirement for fill-in phenomenon when the covariance matrix is huge and sparse. This project aims for computing the log-determinant in an efficient way, which makes use of a bunch of techniques from numerical linear algebra and complex analysis. The objective of this project is to approximate the matrix-logarithm up to an arbitrary precision and evaluate log-determinant with reduced memory requirement, targeting for speeding up by enabling parallel computation of the components involved.
  • Implement Metric Learning Algorithms with Applications to Metagenomics Metric learning algorithms constitute an interesting approach in which a transformation of the data is sought in order to maximize classification accuracy. This property together with the use of the rather successful kNN algorithm make these algorithms suitable for real-world problems in fields such as bioinformatics. We aim at implementing the large margin nearest neighbor classifier and expose it in a easy-to-use manner, contributing to the metagenomics research community.
  • Large-Scale Learning of General Structured Output Models This proposal focuses on two extensions of current structured output framework in Shogun: 1) Extending the structured output framework to support general graphical models by introducing factor graph and general MAP inference. 2) Implementing online solvers for structural SVM such that the framework can deal with large-scale problems.
  • My Proposal for "Large Scale Learning: loglinear time kernel expansion (aka Fast Food)" Hello, my name is Vangelis and this is my proposal for the Google Summer of Code 2013 concerning the ""Large Scale Learning: loglinear time kernel expansion (aka Fast Food)" project from your ideas page. I hope you will find that I am well equipped to deal with this project and we will spend the summer coding together!
  • Proposal for "Develop interactive machine learning demos" In this year's GSoC, I'll contribute to shogun-toolbox, help the project by developing demos for all available algorithms in it. I've got a very strong knowledge and experience in website development and had learnt some machine learning curriculum, so I fit all the requirement of this idea. I am passionate with machine learning and shogun-toolbox, and have enthusiasm in open-source development. I'll devote myself to the project this summer for shogun!