2008/12/18

Why I decided to work on bacteria

After working exclusively on bacteria genomes for a couple of months, recently I went back to the world of eukaryotic genomics to compile some data sets for comparison with my findings in bacteria. Now that I am totally spoiled by the availability and quality of bacteria genome sequences, working on eukaryotes is excruciatingly frustrating and I just have to vent here.

1. The taxon sampling is so sparse (both within and across phylogenetic groups). Often the lineages are either too closely or too distantly related, which makes it really hard to do comparative genomics.

2. When I get lucky and find a group that has just the right level of divergence, some of the genome sequencing projects seemed to be "in progress" forever. Seeing that the last update was from several years ago, I really doubt if they intend to finish what they started.

3. Okay now, what about the published genomes? Everyone knows that you are supposed to deposit the sequences in GenBank/EMBL/DDBJ when you publish the genome paper, so people can get the sequences to do more analysis later. Well, as it turned out, this is quite tricky as well. While most groups followed this guideline, many of the deposited genome sequences do not contain annotation of any kind. To get the annotation, you have to hunt down the files from various sources, and needless to say, everyone uses whatever file format and convention that caught their fancy.

4. Just when I think I am done with all the painful data collection/file format conversion and ready to roll, my whole data analysis pipeline simply blows up in my face. As it turned out, some annotations are just plain horrible; there are annotated proteins with less than 60 amino acids and more than 3 in-frame stop codons, or worse yet, "genes" with one single amino acid. I mean, come on guys, even a first-pass-fully-automatic annotation can do better than that. While my pipeline worked well for what it was designed to do, there is very little that it can do about the classic garbage-in-garbage-out problem.

All these frustrations reminded me why I decided to switch from eukaryotes to bacteria last year, and I really glad that I did.

No comments: