Wednesday, October 3, 2012

Counting nucleotides

Rosalind is a nice web site I recently discovered. Like 99 scala problems, Topcoder or similar sites it poses little programming challenges but with a special focus on bioinformatics.

The first and simplest programming challenges is to count the nucleotides within a DNA sequence (see here). For instance, given a sequence "AGCTTTTCATTCTGACTGC", what are the frequencies of the four nucleotides A,C,T,G?

It is a trivial problem, I used to solve with a dictionary but reading the problem description, which requires an ordered output of the frequencies, I realized that there is an even more elegant solution.

An efficient but ugly solution would be to count the individual nucleotide occurrences within a loop, e.g.

s = "AGCTTTTCATTCTGACTGC"
A,C,T,G = 0,0,0,0
for e in s:
    if e=='A': A +=1
    if e=='C': C +=1
    if e=='G': T +=1
    if e=='T': G +=1            

print A,C,T,G

The repetition of the four ifs is not nice and could be replaced as follows

s = "AGCTTTTCATTCTGACTGC"
idx = {'A':0, 'C':1, 'G':2, 'T':3}
counts = [0]*4
for e in s:
    counts[idx[e]] += 1           
print counts 

However, it remains a rather ugly solution and I find the following line much more elegant

print map(s.count, "ACTG") 

The good news is, it is even pretty efficient. The sequence gets parsed 4 times and the complexity is therefore O(4n) = O(n), which is not worse than the complexity of the dictionary approach:

counts = dict()
for e in s:
    counts[e] = counts.get(e,0)+1
print counts  
print map(s.count, "ACTG") 

Just for fun and comparison the same in Scala. First the dictionary approach

def counts = s.foldLeft(Map[Char,Int]() withDefaultValue 0){
  (m,c) => m + ((c,m(c)+1))
}
println(counts)

which is rather convoluted. Less efficient but much more readable is the version that uses "groupBy":

val s = "AGCTTTTCATTCTGACTGC"
def counts = s.groupBy(identity).mapValues(_.size)
println(counts)

and mapping the count function onto the nucleotides again leads to the most elegant code

val s = "AGCTTTTCATTCTGACTGC"
val counts = "ACTG".map(ch => s.count(_==ch))
println(counts)

A slightly different version I saw on Rosalind

val s = "AGCTTTTCATTCTGACTGC"
println( "ACGT" map (s count _.==) mkString " " )