Monday, May 22, 2017

Reading line-structured files (FASTA) with Python

Many files contain data in a line-structured format, where a line prefix indicates a record or fields in a record. An example are FASTA files, where ';' or '>' indicate a header or comment lines, followed by multiple lines with sequence data:

;LCBO - Prolactin precursor
; a sample sequence in FASTA format

>MCHU - Calmodulin 


Files in these formats are surprisingly unpleasant/difficult to read if they are large and cannot be loaded completely into memory. The reason is that we need to use an Iterator to read the file line by line and cannot go back.

Before looking at the solutions below, give it a try yourself! The task is to read the file above (without loading it in memory via open().read()) and return tuples of the form (header, sequence), where header is only the first line of multiple header lines, if there are any, and sequence is the concatenation of all sequence data. In other words, we want a generator or iterator that produces the following tuples:

(';LCBO - Prolactin precursor', 'MDSKGSS...NNC*')

('>MCHU - Calmodulin', 'ADQLTEE...TAK*')

(>gi|5524211|gb|AAD44166.1', 'LCLYTHIG...ENY')

The sequence strings above are shortened (...) for readability. Note that newlines within the sequence data need to be removed and that there can be multiple blank lines. Here my solution in plain Python:

def read_seqs():
    header, seqs = None, []
    isheader, prev_isheader = False, False
    for line in open('fasta.txt'):
        line = line.strip()
        if not line: 
        prev_isheader = isheader
        isheader = line[0] in {'>', ';'}
        if isheader and not prev_isheader:
            if header:
                yield header, ''.join(seqs)
                header, seqs = line, []
        elif not isheader:
    yield header, ''.join(seqs)

Now we can read and print the sequences one-by-one:

for h,s in read_seqs():
    print '{}\n{}\n'.format(h,s)

However, the code is ugly, complex and it took me embarrassing long to write it. A simple task such as reading a line-structured file should not be that difficult and we will have a look at a much better solution later. But first a quick discussion of the code above.

We read the file 'fasta.txt' with open(), which gives us an iterator over the lines in the file. Newlines are removed and empty lines are skipped. isheader tells us if a line starts with a header prefix and prev_isheader contains this state for the previous line. If we find a header prefix and the previous line was not a header line, we start a new record. If it is not a header line we collect the sequence data. Note that we have to call yield twice. Once when we start a new record we return the record we have completed reading. And the second time when we are finished reading the file. Two flags, two yields, multiple nested if statements, and more than 10 lines to read a simple file is shockingly complex.

Luckily, with itertools the same problem can be solve much more elegantly. In just four lines (not counting imports), to be precise:

from itertools import groupby, izip

lines = (l.strip() for l in open('fasta.txt') if l.strip())
isheader = lambda line: line[0] in {'>', ';'}
chunks= (list(g) for _, g in groupby(lines, isheader))
seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

Now this is beautiful! Much shorter, more readable and easy to implement. The key is the groupby function that simplifies our life dramatically here. But let's start from the top.

The first line is a a generator that iterates over the lines in the file, strips newlines and skips empty lines. The second line defines a lambda function isheader that recognizes header lines. In the third line it gets interesting. groupby groups a stream of data (here lines) according to a key function. If you are not familiar with groupby, here a small example to understand it works:

>>> from itertools import groupby
>>> numbers = [1, 2, 3, 4, 5, 1]
>>> gt2 = lambda x : x > 2
>>> [(key, list(group)) for key, group in groupby(numbers, gt2)]
[(False, [1, 2]), (True, [3, 4, 5]), (False, [1])]

We group a list of numbers depending on if they are greater than 2 or not. groupby reads the numbers one-by-one and whenever the key function tt>gt2 changes its value it creates a new group. groupby returns an iterator over the keys and groups it produces, and each group itself is an iterator over its elements. In the example above we use list to convert the group iterator to a list and then use a list comprehension to print the keys and groups.

Note that groupby almost does what we need to read the records in our FASTA file. We can group the lines and start a new group whenever we encounter a header line. Since we don't need the result of the key function we can simply write:

chunks = (list(g) for _, g in groupby(lines, isheader))

Each chunk is a list of lines, containing either all header lines or all sequence lines. For instance, if we print the chunks

for chunk in chunks:
    print chunk

we get the following output (sequences are shortened for readability):

[';LCBO - Prolactin precursor', '; a sample sequence in FASTA format']
['MDS...LSS', 'EMF...YHL', 'VTE...DED', 'ARY...NNC*']
['>MCHU - Calmodulin']
['ADQ...GTID', 'FPE...REA', 'DID...TAK*']
['LCL...NLV', 'EWI...DFLG', 'LLI...IVIL', 'GLM...AGX', 'IENY']

What is left to do is to bring the header lines and their sequence lines together, and there is a very neat trick to achieve this. Remember this one! As you know, zip zips iterables:

>>> zip([1, 2, 3], 'abc')
[(1, 'a'), (2, 'b'), (3, 'c')]

Under Python 2.x zip return a list and under Python 3.x it returns an iterator. izip under Python 2.x does the same as zip under Python 3.x and returns an iterator. More importantly, however, zip and izip operate on iterators and here comes the trick! If we want to group elements of an iterable let's say in tuples of size 2 we can pass the same iterator twice and zip will do the job:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers)
[(1, 2), (3, 4), (5, 6)]

Note that we wrap numbers into an iterator by calling iter(). Without that, the result would be different:

>>> numbers = [1, 2, 3, 4, 5, 6]
>>> zip(numbers, numbers)
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]

What's going on here? zip reads the an element from numbers and reads again to build the first tuple. Since numbers is an iterator the element pointer progresses and we get (1, 2). We could equally easily group numbers in triples:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers, numbers)
[(1, 2, 3), (4, 5, 6)]

Obviously, this works also for our FASTA file reading. We can group the chunks of header or sequence lines in tuples of size 2 to bring header and sequence together. We use izip because we don't want to collect all sequences in memory. So the fourth line of

seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

the itertools solution above means that we group header lists and sequence lists in tuples h, s. But we need only the first header line and the concatenation of the sequence lines and therefore return (h[0], ''.join(s)) in the generator expression.

There you go! A much simpler implementation to read FASTA files and other line-structured files. Can we do even better? Using a library for data flows such as nuts-flow leads to even shorter and more readable code:

from nutsflow import *
from nutsflow import _

isheader = lambda line: line[0] in {'>', ';'}
seqs = (open('fasta.txt') >> Map(str.strip) >> Filter(_) >> 
        ChunkBy(isheader) >> Chunk(2) >> MapCol(0,next) >> MapCol(1,''.join))

The >> operator in nuts-flow indicates the chaining of iterators and defines the data flow. Here we open the file, map the string stripping function on each line to remove white spaces and then filter out empty lines. The underscore in Filter(_) functions as an anonymous variable here. Similar to Python's groupby, ChunkBy groups the lines according to the isheader predicate. We then call Chunk(2) to bring header and sequence lines together in tuples of the form (header, sequence) -- no need for the zip-trick we used before. Finally, we map next() on column 0 and join on column 1 of these tuples to extract the first header line and to concatenate the sequence lines.

That's it. Can't get better than that ;)

Tuesday, May 2, 2017

Fun with itertools in Python

Let's say we are given a file 'number1.txt' that contains a header, numbers and blank lines, and the task is to sum up all numbers within the file. Here an example file



If the file can be loaded into memory this is easy

>>> lines = list(open('numbers1.txt'))
>>> sum(int(l) for l in lines[1:] if l.strip())

open() returns an iterator over the lines in the file, which we collect in a list. Alternatively we could have used open().readlines(). We call lines[1:] to skip the header line and test for non-empty lines with if l.strip(). A sum over a list comprehension finally returns the wanted result. Note that for this simple example we don't bother with closing the file, which you should do for robust code.

If the file is too big to be loaded into memory things are getting a bit ugly. open() returns an iterator, which is good but we need to skip the first header line and cannot apply list slicing to an iterator. However, we can advance the iterator by calling next() to solve this issue

>>> lines = open('numbers1.txt')
>>> next(lines)
>>> sum(int(l) for l in lines if l.strip())

But what if there are multiple header lines and multiple files we want to sum over? Let's implement this in plain Python and then use itertools to solve this more elegantly.

from glob import glob

def filesum(fname, skip=1):   # sum numbers in file
  lines = open(fname)
  for _ in xrange(skip):
  return sum(int(l) for l in lines if l.strip())

sum(filesum(fname) for fname in glob('numbers*.txt'))

As you can see things are getting a tad ugly, especially with respect to the skipping of multiple header lines. We are going to use islice, ifilter, and imap from the itertools library to clean things up. Note that the names of most functions in itertools start with an 'i' to indicate that they are returning iterators and do not collect results in memory.

Similar to list slicing, islice() takes a slice out of an iterator. Here we use it to skip one header line

>>> from itertools import islice
>>> list(islice(open('numbers1.txt'), 1, None))
['1\n', '2\n', '\n', '3\n', '4']

where islice(iterable, start, stop[, step]) takes an iterable as input and skips the first start elements of it, if called with None for the stop parameter. Since islice returns an iterator we wrap it into a list to display the result.

The next step is to filter out the blank lines and we employ ifilter() for this purpose

>>> from itertools import islice, ifilter

>>> lines = islice(open('numbers1.txt'), 1, None)
>>> list(ifilter(lambda l: l.strip(), lines))
['1\n', '2\n', '3\n', '4']

In the next step we convert the number strings into integers by mapping the int function on the lines of the file using imap()

>>> from itertools import islice, ifilter, imap

>>> alllines = islice(open('numbers1.txt'), 1, None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> list(imap(int, numlines))
[1, 2, 3, 4]

Note that imap, in contrast to map returns an iterator in Python 2, while in Python 3, map is equivalent to imap. Having an iterator over integer numbers, we can now sum them up

>>> from itertools import islice, ifilter, imap
>>> SKIP = 1
>>> alllines = islice(open('numbers1.txt'), SKIP , None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> sum(imap(int, numlines))

What is left to do is to aggregate over multiple files and similarly to our first implementation we define a separate function filesum for this purpose

from glob import glob
from itertools import islice, ifilter, imap, chain

def filesum(fname, skip=1):
    isnumber = lambda l: l.strip()
    lines = islice(open(fname), skip, None)
    return sum(imap(int, ifilter(isnumber, lines)))

sum(imap(filesum, glob('numbers*.txt')))

If you compare both implementations, I hope you will find the itertools-based code a bit more readable and elegant than the plain Python code presented at the start. But still the code seems overly complex for such a simple problem. An alternative to itertools is nuts-flow, which allows implementing data flows, such as the one above, much more easily. For instance, to sum up the numbers within the file we could simply write

>>> from nutsflow import Drop, Map, Sum, nut_filter
>>> IsNumber = nut_filter(lambda l: l.strip())

>>> open('numbers1.txt') >> Drop(1) >> IsNumber() >> Map(int) >> Sum()

where Drop(1) skips the header line, IsNumber filters out the numbers (non-empty lines), Map(int) converts to integers and Sum sums the numbers up. The explicit data flow and linear sequence of processing steps renders this code considerably more readable and it can easily be extended to operate over multiple files

>>> from glob import glob
>>> from nutsflow import Drop, Map, Sum, nut_filter

>>> IsNumber = nut_filter(lambda l: l.strip())
>>> filesum = lambda f: open(f) >> Drop(1) >> IsNumber() >> Map(int) >> Sum()
>>> glob('numbers*.txt') >> Map(filesum) >> Sum()

More about nuts-flow in a follow-up post. That's it for today ;)

Monday, May 1, 2017

lambda functions in Python

The most common (and recommended) way to implement a function in Python is via the def keyword and function name. For instance

def add(a, b): return a + b 

is a function that adds two numbers and can be called as

>>> add(1, 2)

We can (re)define the same function as a lambda function

add = lambda a, b: a + b

and it can be invoked with exactly the same syntax as before

>>> add(1, 2)

However, we can also define and invoke a lambda function anonymously (without giving it a name). Here some examples

>>> (lambda a, b: a + b)(1, 2)      # add
>>> (lambda a: a * 2)(2)            # times 2
>>> (lambda a: sum(a))([1, 2, 3])   # sum of list
>>> (lambda *a: min(a))(1, 2, 3)    # min of list

The syntactical differences between conventional function definitions via def and lambda functions are that lambda functions have no brackets around their arguments, don't use the return statement, do not have a name and must be implemented in a single expression.

So, what are they good for? Fundamentally there is no need for lambda functions and we could happily live without them but they make code shorter and more readable when a simple function is needed only once. The most common use case is as key function for sort:

>>> names = ['Fred', 'Anna', 'John']
>>> sorted(names)                        # alphabetically
['Anna', 'Fred', 'John']

>>> sorted(names, key=lambda n: n[1])    # second letter
['Anna', 'John', 'Fred']

Especially for collections of data structures key functions come in handy for sorting

>>> contacts = [('Fred', 18), ('Anna', 22)]

>>> sorted(contacts)                               # name then age
[('Anna', 22), ('Fred', 18)]

>>> sorted(contacts, key=lambda (name, _): name)   # name
[('Anna', 22), ('Fred', 18)]

>>> sorted(contacts, key=lambda (_, age): a)       # age
[('Fred', 18), ('Anna', 22)]

>>> sorted(contacts, key=lambda (n, a): a, n)      # age then name
[('Fred', 18), ('Anna', 22)]

Note the brackets in lambda (name, _) and similar calls that result in tuple unpacking and a very readable sorting implementation by name or age (removed in Python 3, see PEP 3113). We could have achieved the same by defining a conventional function

>>> def by_age((name, age)): return age
>>> sorted(contacts, key=by_age)
[('Fred', 18), ('Anna', 22)]

which is also very readable but longer and pollutes the name space with a by_age function that is needed only once. Note that lambda functions are not needed if we already have a function! For instance

>>> names = ['Fred', 'Anna', 'John']

>>> sorted(names, key=lambda n: len(n))    # don't do this
['Fred', 'Anna', 'John']

>>> sorted(names, key=len)                 # simply do this
['Fred', 'Anna', 'John']

An elegant example for the usage of lambda functions is the implementation of argmax, which returns the index of the largest element in a list (and the element itself)

>>> argmax = lambda ns: max(enumerate(ns), key=lambda (i,n): n)
>>> argmax([2, 3, 1])
(1, 3)

In Python 3 we can use the following implementation instead, which is also short but less readable

>>> argmax = lambda ns: max(enumerate(ns), key=lambda t: t[1])
>>> argmax([2, 3, 1])
(1, 3)

While applications for lambda functions in typical Python code are limited they occur frequently in APIs for graphical user interface, e.g. to react on buttons pressed or in code that employs a functional programming style, which uses filter, map, reduce or similar functions that take functions as arguments. For instance, the accumulative difference of positive numbers provided as strings could be computed (in functional style) as follows

>>> from operator import sub  
>>> numbers = ['10', '-5', '3', '2']  # we want to compute 10-3-2
>>> reduce(sub, filter(lambda x: x > 0, map(int, numbers)))  

However, list comprehension and generators are often more readable and recommended. Here the same computation using a generator expression

>>> reduce(sub, (int(n) for n in numbers if int(n) > 0)) 

Note, however, that we have to convert strings to integers twice and that there is no elegant way to compute the accumulative difference without using reduce, which is not available in Python 3 anymore :( So in plain, boring, imperative Python we would have to write

numbers = ['10', '-5', '3', '2']
diff = None
for nstr in numbers:
   n = int(nstr)
   if n <= 0: 
   diff = n if diff is None else diff - n

Pay special attention to the ugly edge case of the first element when accumulating the differences. We need to initialize diff with a marker value, here None, and then have to test for it in the last line. Ugly! The functional implementation is clearly more readable and shorter in this case (and similar problems).

Luckily the functools library comes to the rescue and brings back reduce. Also have a look at itertools, which provides many useful functions for functional programming. Similarily, there is nuts-flow, which allows even more elegant code. Here the accumulative difference using nuts-flow

>>> from operator import sub 
>>> from nutsflow import _, Reduce, Map, Filter

>>> numbers = ['10', '-5', '3', '2']
>>> numbers >> Map(int) >> Filter(_ > 0) >> Reduce(sub)

That's it. Have fun ;)

Saturday, November 7, 2015

Dynamic programming: a systematic approach

As Wikipedia says: dynamic programming is a method for solving a complex problem by breaking it down into a collection of simpler sub-problems, solving each of those sub-problems just once, and storing their solutions.

Apart from being a frequent question in coding interviews, dynamic programming allows to answer seemingly very difficult questions efficiently and with little code. However, the tutorials I read, said little about a systematic approach to solve dynamic programming problems. Here, I line out the steps and apply them to an example problem:


  1. Define a function F that measures the quality of a solution
  2. Find the problem state S function F(S) depends on
  3. Define a recursive formula for F(S) from simpler states
  4. Implement the algorithm
    1. Top down: recursive with memoization
    2. Bottom up: iterative with tabulation
  5. Keep track of the states that resulted in the optimal solution


As an example we will solve the following, common interview question via dynamic programming:

Find the minimum number of coins with values 1, 3 or 5 that sum up to 11.

This question appears in various disguises but always boils down to the finding the minimum number of values that sum up to some target value.

Step 1

"Define a function F that measures the quality of a solution". The problem states that the minimum number is to be found. The quality of a solution is therefore described by the number of coins required (the lower the better):

  F = number of coins

For instance, we can sum up 3+3+3+1+1 (5 coins), or 5+5+1 (3 coins) to achieve 11. The latter solution with 3 coins is obviously the better (actually the optimal) one.

Step 2

The second step is: "Find the problem state S function F(S) depends on". We are looking for a state that F depends upon. In this case the number of coins obviously depends on the target value. For example, if the target value is 1, then only one coin is needed. If the value is 2, then we need two coins (1+1), if the value is 3 then one coin is sufficient again, and so on.

  S = target value (0 to 11)

Step 3

The third step is the hardest: "Define a recursive formula for F(S) from simpler states". What is a simpler state? Here, the smaller the target value the easier the solution. Since we need a recursive formula we always pick the simplest possible state first (target value is zero) and make that the exit/start case for our recursive formula. If the target value is zero (S = 0) then we need no coins at all:

  F(0) = 0

Next we need to compute F(S) from smaller states S. Let's write down some intermediate formula for that:

  F(S) = 1 + F(S_smaller)   with S_smaller < S

We compute F(S) (= number of coins for target value S) by taking F(S_smaller) for some smaller state and adding one coin (remember F is the number of coins and not the target value). Which coin we don't know yet but we know we will need one extra coin to go from S_smaller to S, therefore F(S) = 1 + F(S_smaller).

Let's say we have several smaller states (= target values) to choose from, which one do we pick? It should be the one that requires the smallest number of coins because that way we add the least number of coins to F. Therefore we write:

  F(S) = 1 + min F(S_smaller)
           S_smaller < S

What are the smaller states we can produce. Given a a target value S we can produce a smaller S by taking away a coin (S_smaller = S-c). We have three different coins (1,3,5). We try all three of them and take the one the results in the smallest number of coins. We just have to make sure that S_smaller is never negative (S-c >= 0). With that the complete recursive formula becomes:

  F(0) = 0
  F(S) = 1 + min F(S-c) 
           c in [1,3,5] 
           S-c >= 0

To summarize: If the target value is zero (S=0) we need no coins (F(0) = 0). Otherwise the number of coins F(S) is one coin plus the minimum number of coins (min F(S-c)) we find by taking one of the three coins (1,3,5) away and provided the resulting smaller target value does not become negative (S-c >= 0).

Step 4

From now on things become easy and the implementation almost mechanical. The recursive formula can directly be translated into Python code:

def F(S):  
  if S == 0: return 0
  return 1 + min( F(S-c) for c in [1,3,5] if S-c >= 0 )

Note that the recursive solution above is of exponential complexity and not yet a dynamic programming solution, which will implement next.

Step 4.1

For the top down, dynamic programming algorithm we only need to add memoization to the recursive implementation. This is achieved by storing already computed results in a cache and use them if available:

fs= {0:0}   #cache
def F(S):
  if S in fs: return fs[S]
  fs[S] = 1 + min( F(S-c) for c in [1,3,5] if c <= S )
  return fs[S]

Note that we can fold in the exit condition for the recursion into the initial value for the cache. I also rewrote S-c >= 0 to the equivalent c <= S, because it is a bit shorter.

Step 4.2

The bottom up solution with tabulation can easily be derived by allocating the cache beforehand and filling it iteratively from the start (bottom up):

S_max = 11
F = [0]*(S_max+1)  #tabulation == cache
for S in range(1, len(F)):
  F[S] = 1 + min( F[S-c] for c in [1,3,5] if c <= S)  

Since F[S] is computed from smaller states S we know that F[S-c] will be filled when we need it. We just need to add F[0] = 0 to the beginning of the array to have a starting point.

As for the computational complexity: If you look at the code, you see that there is a single loop depending on the length of array F, which itself depends on the target value S. Computation of min(F[S-c]) is constant for a fixed number of coins and therefore we have linear time complexity.

Step 5

The algorithms above allow us to compute the minimum number of coins that sum up to a given target value. But often we also want to know which coins where used. In general, this is achieved by keeping track of the optimal, intermediate states and their coins, and then tracing back from the end state.

The following code adds tracing to the recursive, top down solution. At each step, we store the coin c_min that was used to produce an optimal solution f_min for state S in a directory cs. The function coins() recursively traces back from the end state (S=11) and returns a list of coins:

fs = {0:0}
cs = {0:0}
def F(S):
  if S in fs: return fs[S]
  f_min, c_min = min( (F(S-c), c) for c in [1,2,5] if c <= S )
  fs[S], cs[S] = 1 + f_min, c_min
  return fs[S]
def coins(S): return [cs[S]] + coins(S-cs[S]) if S else []    
print("minimum number:", F(11))
print("coins", coins(11))

If you find coins(S) confusing, this more explicit but longer version that prints out the coins may help your understanding:

def coins(S): 
  if S > 0:

To complete the example let us also add tracing to the iterative solution. It is essentially identical to the recursive algorithm but we store the states and coins in an array instead of a dictionary.

S_max = 11
F = [0]*(S_max+1)
cs = [0]*(S_max+1)
for S in range(1, len(F)):
  f_min,c_min = min( (F[S-c],c) for c in [1,3,5] if c <= S)
  F[S], cs[S] = 1 + f_min, c_min
def coins(S): return [cs[S]] + coins(S-cs[S]) if S else []    
print("minimum number:", F[S_max])
print("coins", coins(S_max))

As you can see, dynamic programming can produce very elegant and short algorithms for complex problems. However, there are two conditions that need to apply: overlapping sub-problems and optimal substructure. The overlapping sub-problems condition is easy to understand. Let us have a look at the recursive algorithm without memoization again and print out the F(S) it computes for a smaller target value of 5:

def F(S):
  print("F(%d)"%S, end=" ")
  if S == 0: return 0
  return 1 + min( F(S-c) for c in [1,3,5] if S-c >= 0 )


The print out shows that some F(S) are computed multiple times. For instance F(1) is calculated three times:

>>>F(5) F(4) F(3) F(2) F(1) F(0) F(0) F(1) F(0) F(2) F(1) F(0) F(0) 

Memoization or tabulation stores already computed solutions and so avoids repeated computation of the same result. If, however, each result is unique and is never used again, the recursive algorithm is as efficient as the dynamic programming approach because caching results saves nothing. For instance, if there is only one coin value no result is repeatedly computed:

def F(S):
  print("F(%d)"%S, end=" ")
  if S == 0: return 0
  return 1 + min( F(S-c) for c in [1,3,5] if S-c >= 0 )

>>>F(5) F(4) F(3) F(2) F(1) F(0) 

Overlapping sub-problems are required, and the larger the overlap the more efficient dynamic programming via memoization or tabulation becomes compared to recursive or brute force solution.

The second condition of an optimal substructure requires that an optimal solution computed for a smaller state can be used to derive an optimal solution of a bigger state. Otherwise, we cannot come up with a recursive formula that allows us to compute F(S) from smaller states.

A final note: the time and space complexity of the top down and the bottom up algorithms are the same (linear for the coins problem). The top down (recursive) algorithm is easier to derive from the recursive formula but the iterative (bottom up) solution is a little bit shorter and maybe easier to understand.

Sunday, October 11, 2015

Interview questions for Data Scientists

In preparation for a coding interview I collected numerous coding questions for Data Scientist. The question are not sorted and I have done little to clean them up. Still they might be useful for others:

What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?

What is: collaborative filtering, n-grams, map reduce, cosine distance?

How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?

How would you come up with a solution to identify plagiarism?

How to detect individual paid accounts shared by multiple users?

Should click data be handled in real time? Why? In which contexts?

What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?

What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation?

How do you handle missing data? What imputation techniques do you recommend?

Compare SAS, R, Python, Perl

What is the curse of big data?

You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)

How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?

What are hash table collisions? How is it avoided? How frequently does it happen?

How to make sure a mapreduce application has good load balance? What is load balance?

Examples where mapreduce does not work? Examples where it works very well?

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

What is star schema? Lookup tables?

Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.

What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?

Do you think 50 small decision trees are better than a large one? Why?

Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?

Why is mean square error a bad measure of model performance? What would you suggest instead?

How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?

What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?

Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?

Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?

How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)

Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

What is root cause analysis? How to identify a cause vs. a correlation? Give examples.

How would you define and measure the predictive power of a metric?

How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set - the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?

What is POC (proof of concept)?

Is it better to have too many false positives, or too many false negatives?

Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.

How does Zillow's algorithm work? (to estimate the value of any home in US)

Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?

What is an efficiency curve? What are its drawbacks, and how can they be overcome?

What is a recommendation engine? How does it work?

What is an exact test? How and when can simulations help us when we do not use an exact test?

What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points - each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?

Give a few examples of "best practices" in data science.

You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations?

More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

Find the second largest element in a Binary Search Tree

Write a function that takes in two sorted lists and outputs a sorted list that is their union

Generating a sorted vector from two sorted vectors

How many people must be gathered together in a room, before you can be certain that there is a greater than 50/50 chance that at least two of them have the same birthday (

How do you take millions of users with 100's of transactions each, amongst 10k's of products and group the users together in a meaningful segments

How do you know if one algorithm is better than other?”

You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

given n samples from a uniform distribution [0, d], how to estimate d?

The interview drew a hypothetical histogram of number of purchases per user.) There are a lot of users that make a small number of purchases, and few users that make a large number of purchases. So the histogram is peaked near zero and has a tail off to the left. Based on this, what do you expect the plot of average revenue per user to look like?

There are 25 horses. You can race any 5 of them at once, and all you get is the order they finished. How many races would you need to find the 3 fastest horses?

Dynamic programming/backward induction on a multi-stage decision making problem

Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket?

How do you test whether a new credit risk scoring model works? What data would you look at?

Given a set of specifications, make a program that would find if a given list of words was in a provided grid of letters

How would you sort a billion integers? Merge sort algorithm Palindrome test using recursion. Statistics: A/B testing process

Big data related question. Hadoop system. Top K record selection. External merge sort. Very basic JAVA questions: anonymous class. Implement some function

How would you classify a group of claim adjuster write-ups to detect fraud?

Find the median of a large dataset

Given a particular data set, suggest three algorithms, one regression, one clustering, one classification

If we were testing product X, what metrics would you look at to determine if it is a success

You're at a casino with two dice, if you roll a 5 you win, and get paid $10. What is your expected payout? If you play until you win (however long that takes) then stop, what is your expected payout?

In a social network, calculate the minimum number of steps between any two nodes

How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections

How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences?

Write a SQL query to compute a frequency table of a certain attribute involving two joins. What if you want to GROUP or ORDER BY some attribute? What changes would you need to make? How would you account for NULLs?

Given two binary strings, write a function that adds them . You are not allowed to use any built in string to int conversions or parsing tools. E.g. Given "100" and "111" you should return "1011". What is the time and space complexity of your algorithm?

Lets say the population on Facebook clicks ads with a click-through-rate of P. We select a sample of size N and examine the sample's conversion rate, denoted by hat{P}, what is the minimum sample size N such that Probability( ABS(hat{P} - P) < DELTA ) = 95%. In other words (this is my translation), find the minimum sample size N such that our sample estimate hat{P} is within DELTA of the true click through rate P, with 95% confidence.

return the largest palindrome from a very long string

The graph 6-degree Bacon Number questions is their favorite.

How many birthday posts occur on Facebook on a given day?

In the midst of presenting work I had done on time-series data using Python, I was asked how to fill in missing values using PostgreSQL commands.

How to determine the size of a heel from pictures using computer analysis

SQL merges joins, how will you apply machine learning to a business case, explain the algorithm and why you chose it

What goes into the calculation of test duration

Walk us through a complex analysis or experiment that you conducted

Extract data from SQL database and attempt to find differences between fraudulent and non-fraudulent internet purchases (unsupervised)

How would you get all the unique values from an attribute in R?

What is the probability of getting a 5 on throwing dice 7 times?

Difference between box plot and histogram?

What model would you use to categorize new observations from a training set?

How to do treat colinearity? How do you inspect missing data and when are they important?

Explain what is an ROC curve

Did I use a Q-Q plot?

Given activity logs from the product, what features would you build to predict churn?

What is the GIL in Python?

Fit a model to data

Explain what is target leakage

Write down a function that received 1- s: string 2- k: int and returns the top k most frequent words in the string

There is a building with 100 floors. You are given 2 identical eggs. How do you use 2 eggs to find the threshold floor, where the egg will definitely break from any floor above floor N, including floor N itself

You randomly draw a coin from 100 coins - 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, whats the probability that the coin is unfair

How to build a data structure for arithmetic expressions

Given a list of integers (positive & negative), write an algorithm to find whether there's at least a pair of integers that sum up to zero. How do you improve your algorithm? (re: Big-O notation)

Derive logistic regression

Given a list of words, find all the anagrams

Given a list of integers, find all combinations that sum to a given integer

Detailed Machine learning algorithms and pattern recognition methods with their objective functions

What is your favorite model and why?

Given a huge collection of books, how would you tag each book based on genre?

How would you build profiles for user searches on google?

Write a sorting algorithm for a numerical dataset in Python?

Explain SIFT (scale invariant feature transform) workflow, Support vector Machines, Newton Optimization method.”

How do you count a table of likes with timestamps?

What was the angle between the clock hands at 3:15?

List 3 types of unit root tests in addition to the augmented Dickey-Fuller test

Probabilities when rolling two dice twice with the second roll depending on the outcome of the first, the German tank problem, breaking a 4-digit code with different prior information etc. In total: know your combinatorics and Bayes rule in-and-out!

You own a museum and consider putting up an exhibition by a new artist. How would you go about the decision? What kind of information / data would you look for, how would you use it? - You want to buy a TV on How do you make a decision? - You performed A/B testing on a change in a search engine algorithm. Although in B, the coders made a mistake which led to really poor search results, user clicked on more results as well as on more ads. How do you interpret the results, what do you do next?

How to compute an inverse matrix faster by playing around with some computational tricks

Hypothesis testing, p value, different classification algorithms, logistic regression vs neural network?

What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works? How would you then figure out an acceptable cost of customer acquisition?

How would you go about assessing whether insider trading is occurring, using the data of your choice?

Removing the missing values. What if it causes bias? What will you do then?

Find prime numbers less than 1 billion using map reduce framework.

How do you draw a uniform random sample from a circle in polar coordinates?

If there are 30 groups of servers, each with an uptime of 0.99, and you select 1 at random, what is the probability that at least one server in the group chosen will be up?

Case study - what features would you use to determine credit risk given transaction history from the past two years. Explain a simple map reduce problem Read in a very large file of tab delimited numbers using python and count frequency of each number (don’t overthink this. Use python done in a few lines) Whats more likely: Getting at least one six in 6 rolls, at least two sixes in 12 rolls or at least 100 sixes in 600 rolls? Find all the combinations of change you can for a given amount

Newton's method to calculate the square root?

What do you know about A/B testing in the context of streaming?

1) Check if an integer is a palindrome (do not convert the integer to string) 2) Given 2 sorted arrays of integers, code to find a number from each array such that their sum are closest to some integer K? 3) What is the degree of freedom for lasso?

k-means clustering ,linear classifier,mySQL replication,stored procedure,R language

Given the set a a b b a a c c b b of unknown length, write an algorithm that figures out which occurs most frequently and with the most continuous repetition.

How do you model, a very low probability event?

Consider a game with 2 players, A and B. Player A has 8 stones, player B has 6. Game proceeds as follows. First, A rolls a fair 6-sided die, and the number on the die determines how many stones A takes over from B. Next, B rolls the same die, and the exact same thing happens in reverse. This concludes the round. Whoever has more stones at the end of the round wins and the game is over. If players end up with equal # of stones at the end of the round, it is a tie and another round ensues. What is the probability that B wins in 1, 2, ..., n rounds?

What is a statistical method to control the number of features for large sparse matrices?

How to put the same number of red and blue beans into two baskets, so if you randomly grab a basket you have a better chance to get red beans?

Write hadoop map/reduce to compute product of N matrices?

Given you have a regression with m variables and another n are available, would you want to include them? What would you do next?

How to tell the difference between classifiers. what is the difference between Lasso and Ridge regression?

Sunday, February 1, 2015

Download Japanese radicals

I wanted to create a database of Japanese Kanji radicals and their readings/meaning. If you have no idea what I am talking about, this is the time to leave because nothing of the following is likely to be of any interest to you ;)

Now, Wikipedia has a nice table of Japanese radicals but it is in HTML and the data is not in a directly accessible format. So, time for some Python-fu!

The code below downloads the Wikipedia page, extracts the data using regular expressions, fixes some of the broken table entries and writes a tab-separated file of the following format:

1 一 1 one いち 42
2 丨 1 line ぼう 21
3 丶 1 dot てん 10
4 丿 1 bend の 33
5 乙,乚 1 second,latter おつ 42
6 亅 1 hook はねぼう 19
7 二 2 two ふた 29
8 亠 2 lid なべぶた 38
213 龜 11 turtle,tortoise かめ 24
214 龠 17 flute やく 19

Note that this is some quick and dirty code that is not especially elegant or efficient but it does the job and if you ever need to download some radical data, here it is:

import re
import urllib.request

def load_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
def parse_rows():          
    url = ''
    html = load_page(url)
    r_radical = r'<a.*?title="wikt:Index:Chinese radical/.">(.+?)'
    r_alt = r'</a>(.*?)</span></td>\n'
    r_strokes = r'<td>(.*?)</td>\n'
    r_meaning = r'<td>(.*?)'
    r_reading = r'<span.+?xml:lang="ja">(.*?)</span>.*?</span></td>\n'
    r_freq = r'<td>(.*?)</td>'
    regex = re.compile(r_radical+r_alt+r_strokes+r_meaning+r_reading+r_freq)
    for i,(radical,alt,strokes,meaning,reading,freq) in enumerate(regex.findall(html)):
        no = i+1
        alt = [ch for ch in set(alt) if ord(ch) > 256]
        if no == 125: alt = ['耂']   #fix for bad format of radical 125
        radicals = ','.join([radical]+alt)
        meaning = meaning.replace(', ',',').strip()
        if no == 80:   #fix for bad format of radical 80
            reading = 'なかれ;はは'
        if no == 132:   #fix for bad format of radical 132
            reading = meaning
            meaning = 'own'
        yield no,radicals,strokes,meaning,reading,freq
def write_rows(filepath):
    with open(filepath, mode='w', encoding='utf-8') as f:
        for row in parse_rows():      
if __name__ == "__main__":

Sorry, for the mediocre formatting and the lack of syntax-highlighting but the Python plugin gets confused with regular expressions containing HTML code and this is hand-formatted code. A nicer version of the code can be found in my learning blog.

Monday, January 19, 2015

Unicode output in Eclipse console

After much googling and trying I finally found the trivial (and obvious) solution on how to correctly display unicode characters in the Eclipse console window. One simply has to specify UTF-8 output in the Run Configuration (click on the image to enlarge it):

There is no special code (e.g. PrintWriter with UTF-8 format) required. Something like

System.out.println("\u00C6\u00D8\u00C5") // Danish letters Æ Ø Å 

will work just fine.