Implementation of a new quality control program for RNA sequencing data
Modern high throughput sequencers can generate tens of millions of sequences in a single run. Before analysing this sequences to draw biological conclusions one should always perform some simple Quality Control (QC) checks to ensure that the raw data looks good and there are no problems or biases in the data which may affect how you can usefully use it.
Most sequencers will generate a Quality Control report as part of their analysis pipeline, but this is usually only focused on identifying problems which were generated by the sequencer itself.
For the Google Summer of Code 2011 I would like to develop a stand-alone Python program that performs base-level and transcript-level Quality Control measures on RNA sequencing data from various sequencing platforms (e.g., Illumina paired-end, ABI Solid).
The program will build on currently existing scripts in Perl and R and go far beyond existing QC programs, such as FastQC, by providing analyses on known transcripts, exons and junctions.
Community Bonding Period (April 25 – May 22):
Get to know mentors, read documentation, read FastCQ source code, determine with mentors how we can integrate this project as closely as possible into the GenMAPP, Cytoscape, and WikiPathways workflows.
Start of Program (May 23 – July 14)
Start coding, port the Java, Perl and R codebases to Python, implement the different algorithms.
Midterm Evaluation (July 15 – Augustus 15)
Write documentation, write tests, format output in nice tables and graphical plots.
Pencils down (Augustus 15 – Augustus 21)
Scrub code, improve tests and documentation.
Final evaluation deadline (August 26)
List of Deliverables
A stand-alone Python program with:
- base-level Quality Control:
- analysis of the base composition
- information about the error rates (e.g., quality per base position over read length)
- alignment statistics (mapped, unmapped, non-unique mappings)
- transcript-level Quality Control:
- transcript read density variation (5’ vs. 3’, exon vs. junction, exon vs. intron, normalization correction bias)
- replicate comparison (quantile-quantile aligned read count plots)
- known versus novel exon/junctions and expression of a panel of known housekeeping genes
Output will include tables and graphical plots (PMW). Implementation without calls to external Python libraries (e.g., NumPy) would be preferrable.
In 2004 I obtained a Masters degree in Biology from the University of Antwerp, Belgium, on a bioinformatics thesis whose title translates to “comparison of methods to prioritize candidate disease genes according to their presumed involvement in human hereditary diseases”. For it, I wrote a computer program in Perl which combined and compared four techniques to prioritize candidate disease genes.
Then I worked a few years in the computer industry at EDS (now part of Hewlett-Packard), obtained a Java EE software developer certificate and did an internship at a startup where I wrote an extension in Eclipse BIRT for their real-time quantitative PCR analytics software.
Since September I am again a full-time student, now at the Catholic University of Leuven, Belgium, and studying towards a Master of Bioinformatics degree.
Besides using R and Matlab for homework assignments, the main scripting language I use is Python. I love solving Project Euler problems with it. Ocasionally, I also play with Haskell, J and Scheme.
Since 2002, I have been using Linux as my only operating system (first Red Hat, then Freebsd, now Ubuntu). It has given me a good understanding of UNIX system administration.
I have wide ranging interests, see for example my del.icio.us bookmarks. Tags that frequently appear include bioinformatics (479 times), data visualization (57) and machine learning (64).
Jeroen Van Goey
On the net, also known as BioGeek (for example at reddit)
Preferred method of communication
Google Talk, Skype