Google Summer of Code Project Proposal

Project Title

Evolve Unix phyloinformatics tools into Ajax applications

Synopsis

Phylogenetics is the study of evolutionary relatedness among various groups of organisms. Many of the powerful software applications that scientists in this field use, like xrate or PHAST, are only available via the Unix command line. Biologist are often more comfortble working with a browser than a CLI, so these tools need to become accessible over the web in a user-friendly way. This project would focus on providing an Ajax interface for xrate.

Wordle: Google Summer of Code Project Proposal

Project Background

xrate (1) is an open source software tool for genome annotation based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. The software takes two datatypes as input: a multiple sequence alignment in Stockholm Format and a grammar in a format based on Lisp S-expressions. Depending on the mode of operation in wich it is run (training, annotation or phylogeny mode) the software produces different output which can be used to estimate rate parameters or make predictions.

Project Plan

First step would be to build a web-based interface for xrate, comparable to what Pise (2) has accomplished for many other command line bioinformatics tools in the past. Next we want to enhance the user experience by using asynchronous javascript for the output (eg. phylogeny tree viewers & navigators). Javascript toolkits such as Dojo, Script.aculo.us, Mochikit, and Zimbra will be evaluated and used where appropriate. During the development of the Ajax Genome browser, much of the code involving dragging, track managment, tile caching, etc. had to be written from scratch (slide 20). Dojo now provides a lot of these tools (eg. drag-and-drop frameworks, events, i/o) already out-of-the box. The ajax-enabled xrate would build further upon the alignment viewer, but with added asynchronous functionality so that xrate can be called on the server, and a client-server rendering scheme which produces bubbleplots & graphviz state diagrams (for the xrate format files) and alignments & phylogenetic trees (for the Stockholm format files). Like in Google Maps, you will be able to seamlessly zoom in and out, drag around, and toggle labels on and off in these diagrams and trees. Likewise the input could be enhanced by providing an interface that validates the S-expressions (or even gets rid of the need of “writing” them altogether). Finally, an API should provide an easy interface with other web-based bioinformatics platforms, so that the output from one program can be exchanged in a machine-readable form into another. (Overview of which programs can be “piped” into each other.) The project will be build in rapid iterations, following the eXtreme Programming guidelines, with extensive testing and documentation.

Deliverables

  • website with a working demo of an Ajax-enabled xrate
  • website describing the project, including software documentation
  • API for interaction with other web-based applications.

Project Schedule

I expect to complete this project in 3 months.

March 14 - March 26 : Students submit their applications

May 27 - April 10 : Mentors review applications

April 11 : Accepted student applications are announced

April 12 - April 19 : Introduce myself to the community

April 20 - April 29 : Read the scientific literature about phylo-grammars

April 30 - May 6 : Experiment with different javascript toolkits

May 7 - May 13 : Experiment with the Ajax Genome Browser

May 14 - May 20 : Experiment with SWIG

May 21 - May 27 : Make first sketches for implementation

May 28 - June 3 : Start coding

June 4 - June 10 : Write the web-based interface

June 11 - June 17 : Ajaxify the xrate output

June 18 - June 24 : write the API

June 25 - July 1 : Early prototype is finished

July 2 - July 8 : Upload code to code.google.com/hosting

July 9 - July 15 : Mentors begin mid-term evaluations

July 16 : Mid-term evaluation deadline

July 23 - July 29 : Usability testing with end-users

July 30 - August 5 : Extensive debugging

August 6 - August 12: Write documentation

August 23- August 19: Students begin final evaluations

August 20- August 26: Mentors begin final evaluations

August 31 : Final evaluation deadline


Bio

In 2004 I obtained a Masters degree in Biology from the University of Antwerp on a bioinformatics thesis, titled “comparison of methods to prioritize candidate disease genes according to their presumed involvement in human hereditary diseases” (in Dutch). For it, I wrote a computer program in Perl which combined and compared four techniques to prioritize candidate disease genes. The program worked, but involved a lot of screen scraping and some sub-optimal statistics.

I also noticed that my knowledge of algorithm design lacked, so I enrolled at the Open University Netherlands to obtain an additional Masters degree. This time in Computer Science. I’m currently in my first year, specializing in bioinformatics.

Besides Perl for my thesis project, the main scripting language I currently use is Python. Ocasionally, I also play with Ruby, Haskell, Erlang, J and Scheme48. I am in the process of learning Java and C for my university courses.

Since 2002, I have been using Linux as my only operating system (first Red Hat, then Freebsd, now Ubuntu).It has given me a good understanding of UNIX system administration.

I have wide ranging interests, see for example my del.icio.us bookmarks. Tags that frequently appear include bioinformatics (358 times), webdesign (211), Ajax (97), Javascript (94), data visualization (57) and machine learning (29).

Motivation

Since my youth I have a passion for both design and science. My father is an industrial designer, and from a very young age made me aware of things like typography, minimalist architecture and ergonomics. My mother on the other hand stimulated my inquisitive mind, and -as a biologist herself- introduced me to the scientific method. While growing up, I became interested in diverse fields such as information visualization, data mining and graphic design and noticed how they all solved an isolated part of the larger problem: making sense of the ever growing amount of data. The Aha-erlebnis came when I saw how Ben Fry brought these disparete fields together in a singular process titled Computational Information Design. That’s also the way I want to go with this project:
bridging diverse cultures by making the software more accessible, and presenting the data in an enhanced visual way.

Name

Jeroen Van Goey
on the net, also known as BioGeek (for example at reddit)

Contact

‘jeroen.vangoey+soc’ (@gmail.com)
http://jeroen.vangoey.be

IM

GoogleTalk

References

  1. XRate: a fast prototyping, training and annotation tool for phylo-grammars Peter S Klosterman, Andrew V Uzilov, Yuri R BendaƱa, Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick Goldman, and Ian HolmesBMC Bioinformatics. 2006; 7: 428.
  2. A Web interface generator for molecular Unix Catherine Letondal Bioinformatics. 2001; 17:73

Comments are closed.