Google Summer of Code Project Proposal
Project Title
Evolve Unix phyloinformatics tools into Ajax applications
Synopsis
Phylogenetics is the study of evolutionary relatedness among various groups of organisms. Many of the powerful software applications that scientists in this field use, like xrate or PHAST, are only available via the Unix command line. Biologist are often more comfortble working with a browser than a CLI, so these tools need to become accessible over the web in a user-friendly way. This project would focus on providing an Ajax interface for xrate.
Project Background
xrate (1) is an open source software tool for genome annotation based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. The software takes two datatypes as input: a multiple sequence alignment in Stockholm Format and a grammar in a format based on Lisp S-expressions. Depending on the mode of operation in wich it is run (training, annotation or phylogeny mode) the software produces different output which can be used to estimate rate parameters or make predictions.
Project Plan
First step would be to build a web-based interface for xrate, comparable to what Pise (2) has accomplished for many other command line bioinformatics tools in the past. Next we want to enhance the user experience by using asynchronous javascript for the output (eg. phylogeny tree viewers & navigators). Javascript toolkits such as Dojo, Script.aculo.us, Mochikit, and Zimbra will be evaluated and used where appropriate. During the development of the Ajax Genome browser, much of the code involving dragging, track managment, tile caching, etc. had to be written from scratch (slide 20). Dojo now provides a lot of these tools (eg. drag-and-drop frameworks, events, i/o) already out-of-the box. The ajax-enabled xrate would build further upon the alignment viewer, but with added asynchronous functionality so that xrate can be called on the server, and a client-server rendering scheme which produces bubbleplots & graphviz state diagrams (for the xrate format files) and alignments & phylogenetic trees (for the Stockholm format files). Like in Google Maps, you will be able to seamlessly zoom in and out, drag around, and toggle labels on and off in these diagrams and trees. Likewise the input could be enhanced by providing an interface that validates the S-expressions (or even gets rid of the need of “writing” them altogether). Finally, an API should provide an easy interface with other web-based bioinformatics platforms, so that the output from one program can be exchanged in a machine-readable form into another. (Overview of which programs can be “piped” into each other.) The project will be build in rapid iterations, following the eXtreme Programming guidelines, with extensive testing and documentation.
Deliverables
- website with a working demo of an Ajax-enabled xrate
- website describing the project, including software documentation
- API for interaction with other web-based applications.
Project Schedule
I expect to complete this project in 3 months.
March 14 - March 26 : Students submit their applications
May 27 - April 10 : Mentors review applications
April 11 : Accepted student applications are announced
April 12 - April 19 : Introduce myself to the community
April 20 - April 29 : Read the scientific literature about phylo-grammars
April 30 - May 6 : Experiment with different javascript toolkits
May 7 - May 13 : Experiment with the Ajax Genome Browser
May 14 - May 20 : Experiment with SWIG
May 21 - May 27 : Make first sketches for implementation
May 28 - June 3 : Start coding
June 4 - June 10 : Write the web-based interface
June 11 - June 17 : Ajaxify the xrate output
June 18 - June 24 : write the API
June 25 - July 1 : Early prototype is finished
July 2 - July 8 : Upload code to code.google.com/hosting
July 9 - July 15 : Mentors begin mid-term evaluations
July 16 : Mid-term evaluation deadline
July 23 - July 29 : Usability testing with end-users
July 30 - August 5 : Extensive debugging
August 6 - August 12: Write documentation
August 23- August 19: Students begin final evaluations
August 20- August 26: Mentors begin final evaluations
August 31 : Final evaluation deadline
Bio
In 2004 I obtained a Masters degree in Biology from the University of Antwerp on a bioinformatics thesis, titled “comparison of methods to prioritize candidate disease genes according to their presumed involvement in human hereditary diseases” (in Dutch). For it, I wrote a computer program in Perl which combined and compared four techniques to prioritize candidate disease genes. The program worked, but involved a lot of screen scraping and some sub-optimal statistics.
I also noticed that my knowledge of algorithm design lacked, so I enrolled at the Open University Netherlands to obtain an additional Masters degree. This time in Computer Science. I’m currently in my first year, specializing in bioinformatics.
Besides Perl for my thesis project, the main scripting language I currently use is Python. Ocasionally, I also play with Ruby, Haskell, Erlang, J and Scheme48. I am in the process of learning Java and C for my university courses.
Since 2002, I have been using Linux as my only operating system (first Red Hat, then Freebsd, now Ubuntu).It has given me a good understanding of UNIX system administration.
I have wide ranging interests, see for example my del.icio.us bookmarks. Tags that frequently appear include bioinformatics (358 times), webdesign (211), Ajax (97), Javascript (94), data visualization (57) and machine learning (29).
Motivation
Since my youth I have a passion for both design and science. My father is an industrial designer, and from a very young age made me aware of things like typography, minimalist architecture and ergonomics. My mother on the other hand stimulated my inquisitive mind, and -as a biologist herself- introduced me to the scientific method. While growing up, I became interested in diverse fields such as information visualization, data mining and graphic design and noticed how they all solved an isolated part of the larger problem: making sense of the ever growing amount of data. The Aha-erlebnis came when I saw how Ben Fry brought these disparete fields together in a singular process titled Computational Information Design. That’s also the way I want to go with this project:
bridging diverse cultures by making the software more accessible, and presenting the data in an enhanced visual way.
Name
Jeroen Van Goey
on the net, also known as BioGeek (for example at reddit)
Contact
‘jeroen.vangoey+soc’ (@gmail.com)
http://jeroen.vangoey.be
IM
GoogleTalk
References
- XRate: a fast prototyping, training and annotation tool for phylo-grammars Peter S Klosterman, Andrew V Uzilov, Yuri R BendaƱa, Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick Goldman, and Ian HolmesBMC Bioinformatics. 2006; 7: 428.
- A Web interface generator for molecular Unix Catherine Letondal Bioinformatics. 2001; 17:73