Monday, May 22, 2017

Reading line-structured files (FASTA) with Python

Many files contain data in a line-structured format, where a line prefix indicates a record or fields in a record. An example are FASTA files, where ';' or '>' indicate a header or comment lines, followed by multiple lines with sequence data:

  
;LCBO - Prolactin precursor
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin 
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Files in these formats are surprisingly unpleasant/difficult to read if they are large and cannot be loaded completely into memory. The reason is that we need to use an Iterator to read the file line by line and cannot go back.

Before looking at the solutions below, give it a try yourself! The task is to read the file above (without loading it in memory via open().read()) and return tuples of the form (header, sequence), where header is only the first line of multiple header lines, if there are any, and sequence is the concatenation of all sequence data. In other words, we want a generator or iterator that produces the following tuples:

  
(';LCBO - Prolactin precursor', 'MDSKGSS...NNC*')

('>MCHU - Calmodulin', 'ADQLTEE...TAK*')

(>gi|5524211|gb|AAD44166.1', 'LCLYTHIG...ENY')

The sequence strings above are shortened (...) for readability. Note that newlines within the sequence data need to be removed and that there can be multiple blank lines. Here my solution in plain Python:

def read_seqs():
    header, seqs = None, []
    isheader, prev_isheader = False, False
    for line in open('fasta.txt'):
        line = line.strip()
        if not line: 
            continue
        prev_isheader = isheader
        isheader = line[0] in {'>', ';'}
        if isheader and not prev_isheader:
            if header:
                yield header, ''.join(seqs)
                header, seqs = line, []
        elif not isheader:
            seqs.append(line)
    yield header, ''.join(seqs)

Now we can read and print the sequences one-by-one:

for h,s in read_seqs():
    print '{}\n{}\n'.format(h,s)

However, the code is ugly, complex and it took me embarrassing long to write it. A simple task such as reading a line-structured file should not be that difficult and we will have a look at a much better solution later. But first a quick discussion of the code above.

We read the file 'fasta.txt' with open(), which gives us an iterator over the lines in the file. Newlines are removed and empty lines are skipped. isheader tells us if a line starts with a header prefix and prev_isheader contains this state for the previous line. If we find a header prefix and the previous line was not a header line, we start a new record. If it is not a header line we collect the sequence data. Note that we have to call yield twice. Once when we start a new record we return the record we have completed reading. And the second time when we are finished reading the file. Two flags, two yields, multiple nested if statements, and more than 10 lines to read a simple file is shockingly complex.

Luckily, with itertools the same problem can be solve much more elegantly. In just four lines (not counting imports), to be precise:

from itertools import groupby, izip

lines = (l.strip() for l in open('fasta.txt') if l.strip())
isheader = lambda line: line[0] in {'>', ';'}
chunks= (list(g) for _, g in groupby(lines, isheader))
seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

Now this is beautiful! Much shorter, more readable and easy to implement. The key is the groupby function that simplifies our life dramatically here. But let's start from the top.

The first line is a a generator that iterates over the lines in the file, strips newlines and skips empty lines. The second line defines a lambda function isheader that recognizes header lines. In the third line it gets interesting. groupby groups a stream of data (here lines) according to a key function. If you are not familiar with groupby, here a small example to understand it works:

>>> from itertools import groupby
>>> numbers = [1, 2, 3, 4, 5, 1]
>>> gt2 = lambda x : x > 2
>>> [(key, list(group)) for key, group in groupby(numbers, gt2)]
[(False, [1, 2]), (True, [3, 4, 5]), (False, [1])]

We group a list of numbers depending on if they are greater than 2 or not. groupby reads the numbers one-by-one and whenever the key function tt>gt2 changes its value it creates a new group. groupby returns an iterator over the keys and groups it produces, and each group itself is an iterator over its elements. In the example above we use list to convert the group iterator to a list and then use a list comprehension to print the keys and groups.

Note that groupby almost does what we need to read the records in our FASTA file. We can group the lines and start a new group whenever we encounter a header line. Since we don't need the result of the key function we can simply write:

chunks = (list(g) for _, g in groupby(lines, isheader))

Each chunk is a list of lines, containing either all header lines or all sequence lines. For instance, if we print the chunks

for chunk in chunks:
    print chunk

we get the following output (sequences are shortened for readability):

[';LCBO - Prolactin precursor', '; a sample sequence in FASTA format']
['MDS...LSS', 'EMF...YHL', 'VTE...DED', 'ARY...NNC*']
['>MCHU - Calmodulin']
['ADQ...GTID', 'FPE...REA', 'DID...TAK*']
['>gi|5524211|gb|AAD44166.1']
['LCL...NLV', 'EWI...DFLG', 'LLI...IVIL', 'GLM...AGX', 'IENY']

What is left to do is to bring the header lines and their sequence lines together, and there is a very neat trick to achieve this. Remember this one! As you know, zip zips iterables:

>>> zip([1, 2, 3], 'abc')
[(1, 'a'), (2, 'b'), (3, 'c')]

Under Python 2.x zip return a list and under Python 3.x it returns an iterator. izip under Python 2.x does the same as zip under Python 3.x and returns an iterator. More importantly, however, zip and izip operate on iterators and here comes the trick! If we want to group elements of an iterable let's say in tuples of size 2 we can pass the same iterator twice and zip will do the job:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers)
[(1, 2), (3, 4), (5, 6)]

Note that we wrap numbers into an iterator by calling iter(). Without that, the result would be different:

>>> numbers = [1, 2, 3, 4, 5, 6]
>>> zip(numbers, numbers)
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]

What's going on here? zip reads the an element from numbers and reads again to build the first tuple. Since numbers is an iterator the element pointer progresses and we get (1, 2). We could equally easily group numbers in triples:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers, numbers)
[(1, 2, 3), (4, 5, 6)]

Obviously, this works also for our FASTA file reading. We can group the chunks of header or sequence lines in tuples of size 2 to bring header and sequence together. We use izip because we don't want to collect all sequences in memory. So the fourth line of

seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

the itertools solution above means that we group header lists and sequence lists in tuples h, s. But we need only the first header line and the concatenation of the sequence lines and therefore return (h[0], ''.join(s)) in the generator expression.

There you go! A much simpler implementation to read FASTA files and other line-structured files. Can we do even better? Using a library for data flows such as nuts-flow leads to even shorter and more readable code:

from nutsflow import *
from nutsflow import _

isheader = lambda line: line[0] in {'>', ';'}
seqs = (open('fasta.txt') >> Map(str.strip) >> Filter(_) >> 
        ChunkBy(isheader) >> Chunk(2) >> MapCol(0,next) >> MapCol(1,''.join))

The >> operator in nuts-flow indicates the chaining of iterators and defines the data flow. Here we open the file, map the string stripping function on each line to remove white spaces and then filter out empty lines. The underscore in Filter(_) functions as an anonymous variable here. Similar to Python's groupby, ChunkBy groups the lines according to the isheader predicate. We then call Chunk(2) to bring header and sequence lines together in tuples of the form (header, sequence) -- no need for the zip-trick we used before. Finally, we map next() on column 0 and join on column 1 of these tuples to extract the first header line and to concatenate the sequence lines.

That's it. Can't get better than that ;)

Tuesday, May 2, 2017

Fun with itertools in Python

Let's say we are given a file 'number1.txt' that contains a header, numbers and blank lines, and the task is to sum up all numbers within the file. Here an example file

  
MyNumbers
1
2

3
4

If the file can be loaded into memory this is easy

  
>>> lines = list(open('numbers1.txt'))
>>> sum(int(l) for l in lines[1:] if l.strip())
10

open() returns an iterator over the lines in the file, which we collect in a list. Alternatively we could have used open().readlines(). We call lines[1:] to skip the header line and test for non-empty lines with if l.strip(). A sum over a list comprehension finally returns the wanted result. Note that for this simple example we don't bother with closing the file, which you should do for robust code.

If the file is too big to be loaded into memory things are getting a bit ugly. open() returns an iterator, which is good but we need to skip the first header line and cannot apply list slicing to an iterator. However, we can advance the iterator by calling next() to solve this issue

  
>>> lines = open('numbers1.txt')
>>> next(lines)
'MyNumbers\n'
>>> sum(int(l) for l in lines if l.strip())
10

But what if there are multiple header lines and multiple files we want to sum over? Let's implement this in plain Python and then use itertools to solve this more elegantly.

from glob import glob

def filesum(fname, skip=1):   # sum numbers in file
  lines = open(fname)
  for _ in xrange(skip):
      next(lines)
  return sum(int(l) for l in lines if l.strip())

sum(filesum(fname) for fname in glob('numbers*.txt'))

As you can see things are getting a tad ugly, especially with respect to the skipping of multiple header lines. We are going to use islice, ifilter, and imap from the itertools library to clean things up. Note that the names of most functions in itertools start with an 'i' to indicate that they are returning iterators and do not collect results in memory.

Similar to list slicing, islice() takes a slice out of an iterator. Here we use it to skip one header line

>>> from itertools import islice
>>> list(islice(open('numbers1.txt'), 1, None))
['1\n', '2\n', '\n', '3\n', '4']

where islice(iterable, start, stop[, step]) takes an iterable as input and skips the first start elements of it, if called with None for the stop parameter. Since islice returns an iterator we wrap it into a list to display the result.

The next step is to filter out the blank lines and we employ ifilter() for this purpose

>>> from itertools import islice, ifilter

>>> lines = islice(open('numbers1.txt'), 1, None)
>>> list(ifilter(lambda l: l.strip(), lines))
['1\n', '2\n', '3\n', '4']

In the next step we convert the number strings into integers by mapping the int function on the lines of the file using imap()

>>> from itertools import islice, ifilter, imap

>>> alllines = islice(open('numbers1.txt'), 1, None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> list(imap(int, numlines))
[1, 2, 3, 4]

Note that imap, in contrast to map returns an iterator in Python 2, while in Python 3, map is equivalent to imap. Having an iterator over integer numbers, we can now sum them up

>>> from itertools import islice, ifilter, imap
>>> SKIP = 1
>>> alllines = islice(open('numbers1.txt'), SKIP , None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> sum(imap(int, numlines))
10

What is left to do is to aggregate over multiple files and similarly to our first implementation we define a separate function filesum for this purpose

from glob import glob
from itertools import islice, ifilter, imap, chain

def filesum(fname, skip=1):
    isnumber = lambda l: l.strip()
    lines = islice(open(fname), skip, None)
    return sum(imap(int, ifilter(isnumber, lines)))
    

sum(imap(filesum, glob('numbers*.txt')))

If you compare both implementations, I hope you will find the itertools-based code a bit more readable and elegant than the plain Python code presented at the start. But still the code seems overly complex for such a simple problem. An alternative to itertools is nuts-flow, which allows implementing data flows, such as the one above, much more easily. For instance, to sum up the numbers within the file we could simply write

>>> from nutsflow import Drop, Map, Sum, nut_filter
>>> IsNumber = nut_filter(lambda l: l.strip())

>>> open('numbers1.txt') >> Drop(1) >> IsNumber() >> Map(int) >> Sum()
10

where Drop(1) skips the header line, IsNumber filters out the numbers (non-empty lines), Map(int) converts to integers and Sum sums the numbers up. The explicit data flow and linear sequence of processing steps renders this code considerably more readable and it can easily be extended to operate over multiple files

>>> from glob import glob
>>> from nutsflow import Drop, Map, Sum, nut_filter

>>> IsNumber = nut_filter(lambda l: l.strip())
>>> filesum = lambda f: open(f) >> Drop(1) >> IsNumber() >> Map(int) >> Sum()
>>> glob('numbers*.txt') >> Map(filesum) >> Sum()
10

More about nuts-flow in a follow-up post. That's it for today ;)

Monday, May 1, 2017

lambda functions in Python

The most common (and recommended) way to implement a function in Python is via the def keyword and function name. For instance

  
def add(a, b): return a + b 

is a function that adds two numbers and can be called as

  
>>> add(1, 2)
3

We can (re)define the same function as a lambda function

  
add = lambda a, b: a + b

and it can be invoked with exactly the same syntax as before

  
>>> add(1, 2)
3

However, we can also define and invoke a lambda function anonymously (without giving it a name). Here some examples

  
>>> (lambda a, b: a + b)(1, 2)      # add
3
>>> (lambda a: a * 2)(2)            # times 2
4
>>> (lambda a: sum(a))([1, 2, 3])   # sum of list
6
>>> (lambda *a: min(a))(1, 2, 3)    # min of list
1

The syntactical differences between conventional function definitions via def and lambda functions are that lambda functions have no brackets around their arguments, don't use the return statement, do not have a name and must be implemented in a single expression.

So, what are they good for? Fundamentally there is no need for lambda functions and we could happily live without them but they make code shorter and more readable when a simple function is needed only once. The most common use case is as key function for sort:

  
>>> names = ['Fred', 'Anna', 'John']
>>> sorted(names)                        # alphabetically
['Anna', 'Fred', 'John']

>>> sorted(names, key=lambda n: n[1])    # second letter
['Anna', 'John', 'Fred']

Especially for collections of data structures key functions come in handy for sorting

  
>>> contacts = [('Fred', 18), ('Anna', 22)]

>>> sorted(contacts)                               # name then age
[('Anna', 22), ('Fred', 18)]

>>> sorted(contacts, key=lambda (name, _): name)   # name
[('Anna', 22), ('Fred', 18)]

>>> sorted(contacts, key=lambda (_, age): a)       # age
[('Fred', 18), ('Anna', 22)]

>>> sorted(contacts, key=lambda (n, a): a, n)      # age then name
[('Fred', 18), ('Anna', 22)]

Note the brackets in lambda (name, _) and similar calls that result in tuple unpacking and a very readable sorting implementation by name or age (removed in Python 3, see PEP 3113). We could have achieved the same by defining a conventional function

  
>>> def by_age((name, age)): return age
>>> sorted(contacts, key=by_age)
[('Fred', 18), ('Anna', 22)]

which is also very readable but longer and pollutes the name space with a by_age function that is needed only once. Note that lambda functions are not needed if we already have a function! For instance

  
>>> names = ['Fred', 'Anna', 'John']

>>> sorted(names, key=lambda n: len(n))    # don't do this
['Fred', 'Anna', 'John']

>>> sorted(names, key=len)                 # simply do this
['Fred', 'Anna', 'John']

An elegant example for the usage of lambda functions is the implementation of argmax, which returns the index of the largest element in a list (and the element itself)

  
>>> argmax = lambda ns: max(enumerate(ns), key=lambda (i,n): n)
>>> argmax([2, 3, 1])
(1, 3)

In Python 3 we can use the following implementation instead, which is also short but less readable

  
>>> argmax = lambda ns: max(enumerate(ns), key=lambda t: t[1])
>>> argmax([2, 3, 1])
(1, 3)

While applications for lambda functions in typical Python code are limited they occur frequently in APIs for graphical user interface, e.g. to react on buttons pressed or in code that employs a functional programming style, which uses filter, map, reduce or similar functions that take functions as arguments. For instance, the accumulative difference of positive numbers provided as strings could be computed (in functional style) as follows

>>> from operator import sub  
>>> numbers = ['10', '-5', '3', '2']  # we want to compute 10-3-2
>>> reduce(sub, filter(lambda x: x > 0, map(int, numbers)))  
5

However, list comprehension and generators are often more readable and recommended. Here the same computation using a generator expression

>>> reduce(sub, (int(n) for n in numbers if int(n) > 0)) 
5

Note, however, that we have to convert strings to integers twice and that there is no elegant way to compute the accumulative difference without using reduce, which is not available in Python 3 anymore :( So in plain, boring, imperative Python we would have to write

numbers = ['10', '-5', '3', '2']
diff = None
for nstr in numbers:
   n = int(nstr)
   if n <= 0: 
      continue
   diff = n if diff is None else diff - n

Pay special attention to the ugly edge case of the first element when accumulating the differences. We need to initialize diff with a marker value, here None, and then have to test for it in the last line. Ugly! The functional implementation is clearly more readable and shorter in this case (and similar problems).

Luckily the functools library comes to the rescue and brings back reduce. Also have a look at itertools, which provides many useful functions for functional programming. Similarily, there is nuts-flow, which allows even more elegant code. Here the accumulative difference using nuts-flow

>>> from operator import sub 
>>> from nutsflow import _, Reduce, Map, Filter

>>> numbers = ['10', '-5', '3', '2']
>>> numbers >> Map(int) >> Filter(_ > 0) >> Reduce(sub)
5

That's it. Have fun ;)