Thursday, June 13, 2013

The beauty of FP (and Scala)

I recently came across a coding problem that beautifully demonstrates the benefits of functional programming and - in this case - Scala. The following (Python) code shows a simplified version of the problem.

numbers1 = [1,2,3]
numbers2 = [1,2,3]
pairs = [(n1,n2) for n1 in numbers1 for n2 in number2]
processed = [process(n1,n2) for n1,n2 in pairs]
for p in processed: print p,  

As you can see it is trivial code that pairs the numbers in both lists, processes the pairs and prints out the results of the process function. Obviously, one could simply write

for n1 in numbers1:
  for n2 in numbers2:
    print process(n1,n2) 

and it would be better code but the previous example resembles the structure of the real, more complex code that has several additional processing steps.

To come back to the previous example, there are two possible problems: 1) let's assume it is not a few numbers but many 2) and that the process function is computationally expensive. For large number-lists memory consumption becomes an issue (since all pairs are stored in a list) and the processing will be time consuming.

Let's address the first problem first. We can easily replace list comprehensions by generators and would then iterate over number-pairs instead of storing them and their results. Easy.

pairs = ((n1,n2) for n1 in numbers1 for n2 in number2)
processed = (process(n1,n2) for n1,n2 in pairs)
for p in processed: print p,  

Problem two is harder. Let's assume the order of the results is of no importance and on a multi-core computer it would be nice to distribute the processing over the cores. But as far as I am aware there is no really elegant way to achieve this with Python. I'll show you soon what I mean by "elegant". Let's switch to Scala.

  
  val numbers1 = List(1,2,3)
  val numbers2 = List(1,2,3)
  val pairs = for(n1 <- numbers1; n2 <- numbers2) yield (n1,n2)
  val processed = pairs map { case (n1,n2) => process(n1,n2)}
  processed foreach (print)

As you can see the Scala code does the same as the first Python program. To change from list comprehensions to generators/iterators in the Scala code it is sufficient to transform the first list into an iterator (using toIterator).

  
  val numbers1 = List(1,2,3).toIterator
  val numbers2 = List(1,2,3)
  val pairs = for(n1 <- numbers1; n2 <- numbers2) yield (n1,n2)
  val processed = pairs map {case (n1,n2) => process(n1,n2)}
  processed foreach (print)

In the Python version we had to replace two list comprehensions by generators. In the Scala version we need to modify only one list. pairs and processed automatically become iterators because their inputs are iterators. In languages without generators this simple change would require substantial restructuring of the code.

Let's assume memory is not the problem but we want to speed things up. We simply add .par after pair to convert the list to a parallel collection and the processing is now performed concurrently. Beautiful!

    
  val numbers1 = List(1,2,3)
  val numbers2 = List(1,2,3)
  val pairs = for(n1 <- numbers1; n2 <- numbers2) yield (n1,n2)
  val processed = pairs.par map {case (n1,n2) => process(n1,n2)}
  processed foreach (print)

Can we have both - speed and low memory consumption? Not without effort. A parallel collection cannot be an iterator and vice versa. We could use futures but it wouldn't be trivial. However, this example shows how easy it is to tune functional code or parts of code to be memory efficient or fast/concurrent.

Tuesday, June 4, 2013

You shall not use tabs

There have been many debates whether spaces or tabs are better for indentation to structure code. There are good arguments for both sides but no clear winner, which indicates that in the end it doesn't really matter (The hotter the argument the less the value).

Where everyone agrees is that they should be used consistently and that the layout of code is important. Mixing spaces and tabs will lead to messed up layouts and will cause problems in languages where indentation has meaning, e.g. Python.

I have a slight preference for space over tabs for three reasons:

  • Spaces are absolute while tabs are relative. When switching editors the indentation does not change. The true intent/indentation is preserved.
  • Many IDEs support the automatic conversion from tabs to spaces but usually not the other way around.
  • Spaces constitute the lowest common denominator. Every editor must support spaces. Tabs are optional.

None of the above reasons is especially strong but if it has to be spaces or tabs, I vote for tabs.

ps. For more coding laws see here.

Sunday, June 2, 2013

The holy laws of coding

Well, here we go. My holy laws of coding. Let there be irritation, indignation and infuriation.

  1. You shall not use tabs
  2. You shall not use bad names
  3. You shall not ignore the idioms and style guide
  4. You shall not write more than 80 characters per line

  5. You shall not write more than 10 lines per function
  6. You shall not write more than 10 functions per class/module/file
  7. You shall not duplicate code

  8. You shall not use a file when you can use a data structure
  9. You shall not ignore the libraries
  10. You shall not miss to refactor your code

To help coding novices, I designed 10 "holy laws", which I think are useful guides to write "good" code (whatever THAT is). I prefer to call them laws because rules are too easily broken but they can and should be broken IF there is a good reason.

Why are they important? What are the benefits? More details will follow in subsequent posts but as an outline: I think there are three main issues beginners struggle with that these laws try to address.

  • Readability
  • Complexity
  • Efficiency

Readability: Due to lack of experience beginners tend to write code that is difficult to read by other and themselves. The first four laws aim to improve readability.

I believe the mind of an experienced programmer works a bit like that of an experienced chess player. An advanced player sees a position and with one look acquires a rough understanding of its possibilities and flaws. Similarly, an advanced programmer often can get an understanding what a piece of code does, just be glancing at it. However, if one swaps chess pieces with jelly beans or writes code that violates the common style, understanding is much harder to gain.

Complexity: What causes beginners and multi-million dollar projects to fail is unmanaged complexity. A function with thousand lines of code or hundred functions with 10 lines: to the machine it is the same but to the human mind it is entirely different. Complex problems not broken into small and independent pieces are a certain recipe for disaster. Rules 5 to 7 help to manage complexity and violating them will limit the difficulty of problems one can solve.

Efficiency: Programming novices tend to ignore available libraries or data structures and stick to simple file-manipulation approaches that are inefficient and error prone. Rules 8 to 10 aim to increase efficiency and robustness, and also foster the progression of skills.

Finally have a look at How To Write Unmaintainable Code for great examples of bad code ;)

Sunday, May 26, 2013

Reading FASTA files with Scala

FASTA is a simple file format to store nucleotide or peptide sequences. A sequence in FASTA format starts with a greater-than sign followed by the sequence name (and possibly other information) and then several lines of nucleoide or amino acid letters. A multi-FASTA file contains multiple FASTA sequences. Here an example taken from wikipedia:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

It is a common task to read FASTA sequences from a file and I was aiming for an elegant/short implementation in Scala. Here is a recursive version I like:

  
  def readFasta(filename: String) = {
     import scala.io.Source
     def parse(lines: Iterator[String]): List[(String,String)] = {
       if(lines.isEmpty) return List()
       val name = lines.next.drop(1)
       val (seq,rest) = lines.span(_(0)!='>')
       (name, seq.mkString)::parse(rest)
     }
     parse(Source.fromFile(filename).getLines())
  }

It defines an internal function parse that takes an iterator over the lines in the file (generated by Source.fromFile(filename).getLines() and returns a list of tuples, where each tuple contains the sequence name and the string of nucleotide or amino acid letters.

The recursion stops when the iterator is empty (lines.isEmpty) otherwise a line is read, which must be the name and the first letter is dropped to get rid of the greater-than sign (lines.next.drop(1)). After the line with the name span(_(0)!='>') is used to take the following lines until the next sequence name is found and store them in seq. _(0)!='>' tests whether the first letter of the line is the greater sign. The remaining lines are stored in rest.

Actually, "storing" is the wrong word, since Scala beautifully returns two new iterators when span is called on an iterator. It is worth noting, that nowhere a list of the read lines is stored. It is turtles/iterators all the way through except for the last line ((name, seq.mkString)::parse(rest)), where the return list is created. (name, seq.mkString) builds a tuple with the sequence name and the the sequence letters - the latter is constructed by gluing all sequence lines together with mkString. This tuple is the head of the return list and the tail is build recursively by calling parse for the remaining lines.

Let's add three minor improvements. It will make the code less elegant but more robust. Firstly, lines could have leading or trailing white spaces. They shouldn't but hey, you know how it is. To play it save we add a to the part where the lines are read from the file (last line in code). Secondly, there could be empty lines that we filter out via filterNot(_.isEmpty()).

Thirdly, the line with the sequence name frequently contains additional annotation that is typically separated by the pipe "|" symbol. But there is really no solid standard and it could be in any format - and it will. We therefore define a regular expression to extract the name from the header line and tweak it if necessary. The expression """>(.+?)(\\|.+)?""".r has two groups. The first captures the name the second captures any annotation starting with the pipe "|" symbol, which needs to be escaped (because it functions as OR in Regex language). The final code looks like this:

  
  def readFasta(filename:String) = { 
     import scala.io.Source
     val header = """>(.+)(\\|.+)?""".r
     def parse(lines: Iterator[String]): List[(String,String)] = {
       if(lines.isEmpty) return List()
       val header(name,annotation) = lines.next
       val (seq,rest) = lines.span(_(0)!='>')
       (name, seq.mkString)::parse(rest)
     }
     parse(Source.fromFile(filename).getLines().map(_.trim).filterNot(_.isEmpty()))
  }    

While elegant the above solution has the problem of being recursive (specifically not even tail recursive), which will cause a stack overflow for files with large numbers of sequences. The following iterative implementation is ugly but I haven't found any really nice alternative. Take it as a temporary implementation ;)

  
  def readFasta(filename:String) = {
    val header = """>(.+)(\\|.+)?""".r
    var lines = Source.fromFile(filename).getLines().filterNot(_.isEmpty())
    var sequences = List[(String,String)]()
    while(lines.hasNext) {
      val header(name,annotation) = lines.next
      val (seq,rest) = lines.span(_(0)!='>')
      sequences = (name, seq.mkString)::sequences
      lines = rest
    }
    sequences.reverse
  }  

Also interesting is this solution that uses parser combinators to read FASTA sequences.

Alexandre Masselot has posted an interesting problem (see below). What if you want an iterator over the sequences instead of a list? We can easily modify the code above to achieve this by creating and returning an iterator when calling readFasta. An iterator needs a hasNext method, which essentially replaces the condition of the while loop used above and a next method, which contains the body of the while loop:

  
  def readFasta(filename: String) = 
    new Iterator[(String, Iterator[Char])] {
    val header = """>(.+)(\|.+)?""".r
    var lines = Source.fromFile(filename).getLines().filterNot(_.isEmpty())
    def hasNext = lines.hasNext
    def next = {
      val header(name, annotation) = lines.next
      val (seq, rest) = lines.span(_(0) != '>')
      lines = rest
      (name, seq.flatMap(_.iterator))
    }
  }

If you look at the code closely, you will notice on more change. I have replaced (name, seq.mkString) by (name, seq.flatMap(_.iterator)). This is where Scala shines. Instead of concatenating the lines to a sequence via mkString we can simply call flatMap(_.iterator) to convert lines to iterators over line characters and then concatenate the iterators via flatMap.

In practice, you will find an iterator over sequences frequently useful, but an iterator over sequence characters less so. Many computational evaluations of sequence data will require random access. If memory consumption is an issue, compressing the sequence is probably a better way. Also in many cases a set of sequence data will be related and storing only the differences to a reference sequence (e.g. the human reference genome) can be a very efficient representation of large sequence data.

Friday, May 24, 2013

Test if string has all unique characters

Here comes another typical question from coding interviews: "Test whether all characters within a string are unique". An inefficient solution with O(n^2) complexity is to loop over all characters of the string and test in a second, inner loop whether an equal character can be found:

  
def all_unique(str):
    for ch1 in str:
        for ch2 in str:
            if ch1==ch2:
                 return False
    return True 

This is obviously not what the interviewer is after. A simple and efficient solution is to convert the string to a set of characters and to compare the number of elements in the set with the length of the string:

 
def all_unique(str): 
    return len(set(str)) == len(str)   

Too easy and therefore the interviewer doesn't like it either ;-). What is wanted is a mapping of characters to a boolean/bit array that indicates which character has been found in the string so far:

  
def all_unique(str):
    test = [False]*256
    for ch in str:
        idx = ord(ch)
        if test[idx]: return False
        test[idx] = True
    return True 

This approach has linear time complexity but requires an array with the length of the size of the character set.

Implement a queue using two stacks

Implementing a queue using two stacks is a typical question for coding interviews. It is a bit artificial, at least I haven't come across any real need, but hey, maybe you need to implement a queue in FORTH. Apart from that it is a nice puzzle ;)

The key insight is that a stack reverses order (while a queue doesn't). A sequence of elements pushed on a stack comes back in reversed order when popped. Consequently, two stacks chained together will return elements in the same order, since reversed order reversed again is original order.

How to do it? We use an in-stack that we fill when an element is enqueued and the dequeue operation takes elements from an out-stack. If the out-stack is empty we pop all elements from the in-stack and push them onto the out-stack. Here is the implementation in Python:

  
class Queue(object):
    def __init__(self):
        self.instack = []
        self.outstack = []
    
    def enqueue(self,element):
        self.instack.append(element)
   
    def dequeue(self):
        if not self.outstack:
            while self.instack:
                self.outstack.append(self.instack.pop())
        return self.outstack.pop()    

Let's try it:

  
q = Queue()
for i in xrange(5):
    q.enqueue(i)
for i in xrange(5):
    print q.dequeue()
0 1 2 3 4

The complexity is amortized O(1) for enqueue and dequeue.

Tree iterator in Python

Sometimes you have a tree data structure and want to iterate over the nodes of the tree in breadth-first or depth-first order. This also seems to be a question that is occasionally posed in coding interviews. Since it is fun and especially easy to implement in Python let's do it. First things first. We design a Node class that represents the nodes of the tree.

  
class Node(object):

    def __init__(self, value, children=[]):
        self.value = value
        self.children = children

The constructor __init__ takes the value of the node and a list of child nodes that is empty by default. We can now construct a tree such as the following

  
root = Node(1, [Node(2, [Node(4), Node(5)]), Node(3)])

To be able to display the tree we add two method to the Node class. __str__ returns a string representation of a node and showTree performs a recursive, depth-first traversal of the tree and prints the node values indented by their hierarchical level.

  
class Node(object):

    def __init__(self, value, children=[]):
        self.value = value
        self.children = children

    def __str__(self):
        return str(self.value)

    def showTree(self, level=0):
        print "%s%s" % ("."*level, self)
        for node in self.children:
            node.showTree(level+2)

Running root.showTree() for the tree defined above produces the following output. 1 is the root node with two children 2 and 3, where node 2 has 4 and 5 as children.

1
..2
....4
....5
..3

Now, the iterator. There is two ways (actually there are more) to iterate over the nodes. Level by level (= breadth first) or the same way showTree works (depth first). Both are equally easy. The only difference is one implementation uses a stack while the other uses a queue. Let's start with breadth first. The following code implements an iterator class. As such it has to provide two methods: __iter__, which returns the class object itself and next, which returns the next node every time it is called.

  
class BreadthFirst(object):

    def __init__(self, node):
        self.queue = [node]

    def __iter__(self):
        return self

    def next(self):
        if not self.queue: raise StopIteration
        node = self.queue.pop(0)
        self.queue += node.children
        return node

The constructor takes a node and puts it into a list that we will treat as a queue. Therefore we name it "queue" and not "nodes" or even worse "list". What happens in the next method? When the queue is empty (not self.queue) a StopIteration exception is raised that stops the iterator. If the queue is not empty a node is dequeued (terrible word) via queue.pop(0) from the beginning of the list and the children of the node are added to the end of the list (enqueued). The iterator is then called as follows and returns the tree nodes in breadth-first order.

  
for node in BreadthFirst(root):
    print node,

1 3 2 5 4

Using lists as queues is not very efficient and there is a Queue class in Python, which is likely to be faster but the beauty lies in the fact that the implementation of the breadth-first iterator is mirror image of the depth-first iterator. As mentioned before, the only difference is that the list is now used as a stack:

  
class DepthFirst(object):

    def __init__(self, node):
        self.stack = [node]

    def __iter__(self):
        return self

    def next(self):
        if not self.stack: raise StopIteration
        node = self.stack.pop()
        self.stack += node.children
        return node

C'mon, admit it! This IS beautiful. Finally, using the iterator generates the following output

  
for node in DepthFirst(root):
    print node,
1 2 3 4 5

That's it.

Wednesday, May 22, 2013

Fun with Hopfield and Numpy

Hopfield networks are fun to play with and are very easily implemented in Python using the Numpy library. Simple as they are, they are the basis of modern machine learning techniques such as Deep Learning and programming models for quantum computers such as Adiabatic quantum computation.

We are going to use a Hopfield network for optical character recognition. This is obviously a toy example and there are much more better machine learning techniques for this kind of problem (e.g. deep learning) but it is a nice demonstration.

Among other things Hopfield networks work as autoassociative memories, allowing to store and to recall patterns, e.g. images of letters. We want to store two letters and then try to recognize noisy versions of those letters. In the following we first create images of the letters "A" and "Z" from a string representation:

  
A = """
.XXX.
X...X
XXXXX
X...X
X...X
"""

Z = """
XXXXX
...X.
..X..
.X...
XXXXX
"""

Hopfield networks operate on binary vectors and we therefore define a function to_pattern() that takes a letter image and transforms it into a bipolar vector by gluing all image rows together, and converting "X" to +1 and "." to -1.

  
def to_pattern(letter):
    from numpy import array
    return array([+1 if c=='X' else -1 for c in letter.replace('\n','')])

For instance, calling to_pattern(A) returns then

[-1  1  1  1 -1  1 -1 -1 -1  1  1  1  1  1  1  1 -1 -1 -1  1  1 -1 -1 -1  1]

To visualize a pattern as an image we write a function display(pattern) that reshapes the pattern vector to a matrix and displays it as an image using the pylab function imshow().

def display(pattern):
    from pylab import imshow, cm, show
    imshow(pattern.reshape((5,5)),cmap=cm.binary, interpolation='nearest')
    show()

Calling display(to_pattern(A)); display(to_pattern(Z)) produces the following pictures:

Having all the basic elements in place let's get serious. First thing to do is to create the Hopfield network from the given patterns. Training a Hopfield network is extremely simple and requires only computation of the outer products of all pattern vectors with each others and summation of the resulting matrices.

$$W = ∑↙{k}↖n \bo p_{k} \bo p^T_{k} $$

The resulting matrix W specifies the weights of the edges between the nodes of the network and the nodes represent the features of the patterns (= pixels in an image). There are variants but typically the weight matrix is normalized (division by the number of patterns) and diagonal elements are set to zero (subtraction of the identity matrix I), which leads to the following expression:

$$W = 1/n ∑↙{k}↖n \bo p_{k} \bo p^T_{k} - I$$

Let us organize all our patterns in a matrix (rows are patterns)

 
patterns = array([to_pattern(A), to_pattern(Z)])
and the implementation of the training formula is straight forward:

  
def train(patterns):
    from numpy import zeros, outer, diag_indices 
    r,c = patterns.shape
    W = zeros((c,c))
    for p in patterns:
        W = W + outer(p,p)
    W[diag_indices(c)] = 0
    return W/r

To recall a stored pattern from a Hopfield network the pattern itself (or a partial or nosy version of it) the dot product of the pattern vector and the weight matrix is computed. This produces a new pattern vector that is binarized and then fed back into the system:

$$\bo p^{(t+1)} = sgn( \bo W \bo p^{(t)})$$

The operation is repeated until the pattern vector becomes stable or after a specified number of iteration has been performed. We keep the implementation simple and stop after 5 steps, which is typically sufficient.

  
def recall(W, patterns, steps=5):
    from numpy import vectorize, dot
    sgn = vectorize(lambda x: -1 if x<0 else +1)
    for _ in xrange(steps):        
        patterns = sgn(dot(patterns,W))
    return patterns

Note that in contrast to the recall formula the code performs the recall on all pattern within the pattern matrix at once, while the formula describes the recall for an individual pattern vector. Also the pattern matrix contains the patterns a rows and therefore the dot product W*p in the formula becomes dot(patterns,W) in the code (Since W is symmetric, transposing W is not required).

To test the network I created noisy version of the training patterns by randomly flipping 10 bits. The image sequence below shows the resulting input patterns and the output patterns after the second and third iteration. As you see the original patterns have been recalled perfectly after three iterations (from left to right).

Patterns are stored as low energy states (attractors) within the system and the recall procedure describes the trajectory of an input pattern to its attractor. Each step reduces the energy and the energy for a pattern is defined as follows:

$$E(\bo p) = \bo p \bo W \bo p^T$$

Here is the corresponding code to compute the energy for the pattern in the pattern matrix.

  
def hopfield_energy(W, patterns):
    from numpy import array, dot
    return array([-0.5*dot(dot(p.T,W),p) for p in patterns])

A few final comments. The Hopfield network does not only store the original patterns but also their inverse versions and possibly linear combinations of patterns. Furthermore the capacity of the network is limited. For random patterns the network can store roughly 0.14*N pattern without errors (N = Number of network nodes). There are many variations of the Hopfield network with different training and recall procedures that give better performance but the beauty of the original Hopfield network is in its simple implementation but complex behavior.

Last but not least cudos to the developers of jqmath, the JavaScript module used here to display the mathematical expressions.

more reading

Hopfield network
Hopfield network
Weave a neural net with Python
Associative recall of memory without errors
Minimum Probability Flow Learning
Finite memory loading in hairy neurons
Error tolerant associative memory

Friday, May 17, 2013

Coding interview questions

In preparation for a coding interview I collected numerous coding questions from the internet. Since they might be useful for other job applicants as well, I post them here.

The question are not sorted in any way, of varying difficulty and there are probably some duplicates. I have fixed only a few typographical errors but the questions are largely as I found them. Don't hold me responsible for bad spelling or grammar ;-)

Two books that I also found quite useful for my preparation were "Cracking the Coding Interview" by Gayle Laakmann and "Programming Interviews Exposed" by John Mongan. Well, here comes the list of questions:



Find a string in a very long string - a needle in a haystack.

Given a binary search tree and an integer k, describe the most efficient way to locate two nodes of the tree that sum to k.

Please describe a data structure that accomplishes the functions INSERT and FIND-MEDIAN in the best possible asymptotic time.

If you have one billion numbers and one hundred computers, what is the best way to locate the median of the numbers?

Say you are implementing exponentiation for a certain processor. How would you minimize the number of multiplications required for this operation?

You are given a grid, with points on the intersections (think a map of streets, people are standing on random corners). Write code to calculate the point on the grid that is the shortest distance from every point on the grid.

Write a function that finds the median of a set of three numbers, also find the Big O. Can it be done with only 2 comparisons, or do you need 3? .

Design and describe a system/application that will most efficiently produce a report of the top 1 million Google search requests.

Design and write a Sudoku solver.

How to computer the longest increasing sequence in an array with only one extra variable.

Implement core functionality of a skip list.

Given an array of different integers, print out all the inversion pairs. (i.e, a[i] > a[j] where i < j).

Given a 2D boolean array. Return the length of the biggest square where all elements inside it is true.

Get the two highest number in a binary search tree.

Find the maximum sum possible in an array of positive integers by selecting the elements in such a way that no two elements are next to each other.

Same as above but another one is taken out again. Find the two missing numbers.

Given a set of strings, a number 'n', and a function that takes a string and gives back a score, find the n largest scored strings in the set.

Given two numbers m and n, write a method to return the first number r that is divisible by both (e.g., the least common multiple.

You are given a text file too large to fit in memory and 3 strings A, B, C. For each string, you have a sorted array listing the positions of the string in the file (e.g., inverted indices). Find the smallest window containing the 3 strings. Order of appearance does not matter.

Given 2 numbers x and y, check if x is an integer power of y. For instance, x = 8, y = 2 would return true and x = 10 and y = 2 would return false since the first one is an integer power but the second isn't.

Suppose you have an arbitrarily connected graph with n nodes. Come up with an algorithm to identify each set of connected nodes (i.e. identify all the islands in the graph). What's the complexity? Can you find a solution in O(n log n)?

Implement a random number generator such that the random number generated is always in a particular range. Perform through time complexity analysis of it. How would you improve the solution.

Get the kth largest number from two sorted arrays.

Find median given two sorted arrays.

Write a function to convert a collection of strings into a single string and a function to convert it back.

Write a method which will remove any given character from a String.

Write a program to find all prime number up to a given numbers.

Write program for word-wrap which should work on any screen size.

Write code for Generate Random No in a range from min to max.

Describe the design and implementation of a Tic Tac Toe game.

Given a graph, find if it represents a tree.

You have a sorted array of real numbers drawn from the uniform distribution between 0 and 1. How can you quickly find a number in it?

If you have a ransom letter and magazines, can you construct the ransom letter from the words in the magazine.

Find the intersection of two integer lists.

Write a program that reads a file, randomizes it's lines and writes the output to another file.

Find the balance point in an array. (The index where the sum of the elements to the left it is the same as the sum of the elements to the right of it.) .

Implement iterator for a list of lists which may contain empty lists.

Given a list of sorted words, output the order of the letters used to sort them.

Given a base 10 number, print the hexidecimal (base 16) representation of that number.

Sort integer array in linear time.

Given a bunch of N random points in space, how would you draw a line such that N/2 of them were to the left of the line and N/2 to the right of it. Complexity.

In a sorted array, find the number closest to a given number .

Describe the algorithm for a depth-first graph traversal.

Given List of Strings. Return a list of lists of Strings where each to anagram Strings in the input list exist in the same list in the output list.

How would you calculate the number of 1's in a list of 10,000 integers.

Given a file consisting of names, what data structure would you use to valid if a name is in the list. What if now the we say a name is valid if the if it differs no more than one character against a name in the file?

Write a method that generates a random sequence of numbers of specific percentages.

If you have 10 bags of marbles with 10 marbles each and one bag has marbles that weigh differently than the others, how would you figure it out from one weighing.

Given a binary tree, print out the whole tree by levels.

Implement a function that takes a string with '?' marks and returns all possible values when '?' is replaced with '0' or '1.

Implement a hashmap.

Write function to compute n-th Fibonacci number? Both iterative and recursive?

How do you reverse a singly linked list?

How do you find depth of binary tree?

Write code to print Inorder traversal of a tree?

Print out all leaf node of a binary tree?

Write a program to sort numbers using quick sort?

Write a program to implement binary search algorithm.

In an array 1-100 exactly one number is duplicate how do you find it?

Given two arrays, 1,2,3,4,5 and 2,3,1,0,5 find which number is not present in the second array.

How will you determine two graphs are the same.

What is the strategy pattern in Java.

Find the number of connected components in a Graph.

Write code to find the second shortest path between two given nodes in an undirected graph.

Write a function to return if a number is a palindrome (eg, 113848311) .

Reverse the word order in a string.

Reverse bits of a digit.

Write a method that finds depth of a (non-balanced) binary tree.

Convert decimal number 99 to base 7.

Define an algorithm that converts a string to an integer without using a built in method like int() .

Write a program to find depth of binary search tree without using recursion.

Write an iterator over multiple collections.

Most phones now have full keyboards. Before there there three letters mapped to a number button. Describe how you would go about implementing spelling and word suggestions as people type.

Given a dictionary of English words, find sets of anagrams. For instance, “pots”, “stop”, and “tops” are all anagrams of one another because each can be found by permuting the letters of the others.

Although Quicksort uses only O(log n) stack space on the average, it can use linear space in the worst case. Explain why, then modify the program to use only logarithmic space in the worst case.

Print all permutation of String both iterative and Recursive way.

Write code to check a String is palindrome or not?

Write code to check whether a number is power of two or not?

Design an algorithm to find the frequency of occurrence of a word in an article?

Large data set, in memory sorting with limited RAM.

Implement a special type of integer iterator. It will be a wrapper over a regular integer iterator, except that this iterator will only iterate over negative numbers. Show how you'd write the next() and hasNext() functions to work with the next negative integer instead of just the next integer.

Sorting in general with big-O complexity, followed by partition sort (quick sort), followed by finding the k-th largest element without full sorting.

Implement peek() and pop() from java iterator(). for example [1,2,3,4,5], peek() = 1, pop() = 1, peek() = 2, peek() = 2, pop() = 2 .

Mutable, non mutable classes in Java.

Declaring constants in Java.

Count the number of set bits for an 32 bits integer. How to improve the process for the second time use if the memory is unlimited.

What is the complexity of Quicksort? (mean and worst case).

Find the minimum window in a string which contains a list of characters .

Implement a set class with the following method: get(), put(), getRandom() .

Given a graph node, write a code to return a copy of the sub-graph starting with that node.

Create a Java method that clones a DAG (Directed Acyclic Graph).

Describe preorder, inorder, postorder.

What's the difference between a hashtable and a hashmap (in Java).

Explain the difference between Array Lists, Linked Lists, Vectors, Hash Maps, (from Java's JDK) etc. and when one choice is better of another.

Delete an element from a singly linked list in Java.

Out of 10 coins, one weighs less then the others. You have a scale. How can you determine which one weighs less in 3 weighs? Now how would you do it if you didn't know if the odd coin weighs less or more?

Addition of 2 big numbers (represented by 2 int arrays, assuming each integer can only be 0-9) .

Compute all the intersections of two sets of segments in a line.

Given a set of cities, each with a given population, select randomly a city with a probability that is proportional to the population.

You have a collection of html documents grouped by a source web site . You need to count frequencies of bi-grams across all documents and present it in a sorted order (may be return top N frequent bi-grams). How would you approach it? How to filter out non-significant bi-grams? How to distribute and merge? What data structures to use to keep counters?

Write code to shift a string with rotation [a, b, c << 2 = 2 c, a, b].

Convert a one dimensional array of Strings containing [a,b,c, d,e ...z] to 2 dimensional array of a2[aa,bb,cc] .

Write a string splitting function that does not copy the string.

Partition an array in such way zeros to be moved on the left side of the array, other numbers on the right side of the array. Extra storage not allowed, only in-place.

Write an algorithm to test if n is power of 2.

How to replace a string in a templatized string where each placeholder starts with a specific delimiter, you are given a map that maps the placeholders to associated values.

Ref to the game, Boggle, given a 3x3 grid of letters represented as a char array, generate all possible word combinations. The legal character sequences are horizontally or vertically adjacent chars. So in [A, B, C; D, E, F; G, H, I], ABCFI is legal but DFH is illegal.

Write the code to determine if a 2D point is inside a 2D closed polygon.

Implement a blocking queue in Java.

Implement heap-based priority queues to run as quickly as possible; at what values of n are they faster than sequential structures.

Algorithm for converting uniformly distributed random numbers into a logarithmic distribution.

Given a picture of a skip list data structure, implement the algorithm to maintain skip lists (add, search, delete).

Base class has one method takes argument as Object and inherited class overrides it to pass String as parameter.If called with argument of Object whcih method will be called.

Design and implement a garbage collection algorithm in any programming language. also needed to give out run time analysis.

you have a sequence where each number is a multiple of 2 or 5 (so: 2^i * 5^j). he gave the beginning of the sequence as 1,2,3,4,5,8,10,16... and asked me to find an algorithm to calculate the next number in the sequence.

Write code to check whether every element of an array has been visited (you are given a starting position and each position stores a pointer to a subsequent position).

Sample uniformly on a circle centered at zero with radius 1.

Write algorithm to compute a Log to the base 2 of a number (integral results no need for floating point). Solution should not assume a particular size of integer .

Implement a base 3 adder which takes two strings as input and returns a string .

Quad tree bitmap: describe data structure and an algorithm.

For a n-node tree, you are given a left leaf node pointer and right leaf node pointer. Write a function to find the number of nodes between the leaves pointed by the left and the right pointer.

You have a simple search consisting of search terms that are OR'ed together, ie: Cats AND Dogs. You have a set of documents possibly containing those terms and you have the positions of those terms in each document. How would you search for Cats AND Dogs.

You're given a string, and you want to split it into as few strings as possible such that each string is a palindrome.

What are the issues with floating point calculation? For example, 1.0f + e-30, is there any issue with the statement.

Given a sorted array A[1..n] with n integers, and an integer t, find all pairs (x,y) of elements in A such that x+y is smaller than t.

You are given a huge stream of geo coordinates. Unordered. Return list of objects in a specified radius from some point X.

Write classes representing expression tree for a simple calculator. You care only about constants and basic binary operators. Write function evaluating expressions.

Suppose you have a set objects, identified by a unique name. How do you store them so that you can retrieve them easily?

Given a tree and leaf node, write the code to make the leaf node as root of tree.

Find the longest word in a dictionary, such that the word can be built one character at a time, and all substrings are words in the dictionary. A new character can be added anywhere.

You have a n number of cities. Lets say city 1 has some information that needs to be sent to all other n-1 cities using minimal cost. Cost between each pair of cities is given. any number of cities can transmit the information once they receive the information but the overall total cost should be minimum.

Write a function to modify a singly linked list, appending copies of the first N-1 elements to the end in reverse order. For example, ['A', 'B', 'C', 'D'] becomes ['A', 'B', 'C', 'D', 'C', 'B', 'A'].

Given an array of structs: write a function that finds equal values in all the structs in the array.

How to find the max number of a shift buffer queue. For instance, there is an array like 5, 6,8,11,1,2, how to find the max number, which is 11 in this example.

Given inputs from Google Search, you have K chunks. Each chunk is individually alphabetically ordered (apple, banana, cat) ... (*apple, *banan, *cat). You want to merge all chunks into a single list. How would you do it? What limitations are there to your approach?

What's the difference between finally, final and finalize in Java.

Can an abstract class be instantiated in Java? And does it have a constructor.

If we had a list of n nodes, what are the maximum number of edges there can be for a directed acyclic graph.

Write an algorithm for integer multiplication.

Design a class to serialize / deserialze a graph.

Describe a method to sift through dozens of emails and rank them based on 'importance'.

Given a list of integers of at least length 7, print the average of each consecutive 7 number long subsequence .

Write a memory manager oriented towards millions of very small allocations (less than 10 bytes).

Design an in memory word processor. Implement cut, copy, paste, formatting, recoverability.

What's the difference between a non-exclusive lock and an exclusive lock.

Generalized tower of Hanoi problem (n pegs).

5 stacks of coins that looks identical. each stack has 5 coins each. 4 of those stacks weigh the same, 1 stack has coins that weigh exactly 1 pound less. using a digital scale once, find out the position of the stack that is less weight.

Given standard 3x4 telephone touch pad, give all number combinations of letters you can make using only the chess "rook" move (ie, L shape: 3 down, 1 over). I think it was all 7 digit combinations. Letters are standard phone letters (2 button is "ABC", etc). * and # not valid.

Implement a program to play the battleship game.

Serialize and deserialize a collection of strings into a single one.

Design an iterator for a collection of collections in java. The iterator should hide the nesting, allowing you to iterate all of the elements belonging to all of the collections as if you were working with a single collection.

Millions of cache entries, each with a unique key, stored in several servers. how to sort all entries whose key falls within a specified range using Map Reduce? What if one server has a very small range of keys (say 1 to 100) and another server has a very large range of keys (say 100 to 1000000)? .

Rotate a one-dimensional vector of n elements left by i positions. For instance, with n=8 and i=3, the vector abcdefg is rotated to defghabc. Simple code uses an n-element intermediate vector to do the job in n steps. Can you rotate the vector in time proportional to n using only a few dozen extra bytes of storage?

Write functions for the following date problems: given two dates, compute the number of days between them; given a date, return it’s day of the week; given a month and year, produce a calendar of the month as an array of characters.

Build the fastest possible complete function to generate a sorted array of random integers without duplicates. (You need feel constrained to use any prescribed interface for set representation).

How do you sort Java object using Comparator?

Was asked how I would implement division without using division or mod operators. The answer should return in the form quotient r remainder.

Write a program to implement blocking queue in Java?

Write a program for producer-consumer problem?

Estimate the total storage size of GMAIL.

How to add a counter to www.google.com to track the billionth user.

You have a 64bit integer counter set to 0.

How long it will take to overflow the counter given that you are incrementing it at 4Ghz speed.

Quickly estimate 2^64 without using a pen/papar.

Given an integer, its digits are stored in the array, the very last digit is the very last element of the array. Write a function that adds one to this integer.

Given a large network of computers, each keeping log files of visited urls, find the top ten of the most visited urls. The result list must be exact, and the maps are too large to transmit over the network (especially sending all of them to a central server or using MapReduce directly, is not allowed.

A specific implementation of B-Tree when looked from one side.

You are given a sorted list of disjoint intervals and an interval, e.g. [(1, 5), (10, 15), (20, 25)] and (12, 27). Your task is to merge them into a sorted list of disjoint intervals: [(1, 5), (10, 27)].

Given a picture of pixels represented with RGB values, convert it to 256-colors (each pixel represented with an index from a 256-sized table holding the actual colors) .

Implement a counter with a method to retrieve the number of increments in the last 60 seconds.

How many ways can you make change given the following denominations? ie, if numCentsToMake is 6 and denominations is [25, 10, 5, 1], then it should return 2: either a nickel and a penny or 6 pennies. int numWaysToMakeChange(int numCentsToMake, int[] denominations) .

Implement a stack that pops out the most frequently added item.

Discuss and implementation of flattening a binary search tree into string representations.

How would you go about testing a distributed system such as Gmail, before releasing it to the public. How would you simulate realistic server load ?

We have m slots for ads and n ads, each ads will have different revenue on different slot, design an algorithm to find out the best fit (find m ads in n ads and order them so that they can make max money,.

Given a word that consists of two words such as "aman" give all the combination of valid two words. You are given a dictionary to test against. Example: aman--> a man, am an. watermelon--> water melo.

How to design a search engine? If each document contains a set of keywords, and is associated with a numeric attribute, how to build indices?

how many ways you can write n as N summation of other numbers 5 = 1+ 2 +2 = 2+ 2 + 1 = 1 + 3 +1 = 1 + 4 etc.

If a user types in a n digit number on the telephone, how do you write a function to deduce if the number constitutes a valid word. For example, if the user enters 123, then can a valid word be made out of (a/b/c) + (d/e/f) + (g/h/i) .

How to select a random sample of size K from a stream of numbers.

Implement a fast exponentiation function.

Write a function to calculate the angle between hour and minute hand on a clock.

Write a program to compute if one string is a rotation of another.

Given a set of integers from 1 to 100, find out the number of combination that sum up to 100. Each integers can only be used once.

You need to write a function to climb n steps you can climb either 1 step at a time or 2 steps a time, write a function to return number of ways to climb a ladder with n step.

How do you find middle element of a linked list in single pass. How do you find 3rd element from last in single pass.

How would you determine if someone has won a game of tic-tac-toe on a board of any size?

Given an array of integers which is circularly sorted, how do you find a given integer.

Find a loop in a linked list. How long is the loop.

Algorithm to get Index of a permutation when all permutations arranged in lexicographical order and its complexity.

Given two binary search trees, write function which tells if two such trees are the same – (i.e. same info in the nodes, same branching to left and right at every node.

Come out with an algorithm for getting the column number provided the column name in a excel sheet and vice versa. Excel has a naming convention of A,B..Z,AA,AB,AC..ZZ,AAA... This had to be converted to the column numbers. A will be 1 and AA will 27. Also the algorithm to find the name provided column number.

Write a function that computes the intersection of two arrays. The arrays are sorted. Then, what if one array is really larger than the other array.

Find the common numbers in two text files.

Given a list of integer e.g. (1,2,4,5,6,7,8...). Find a pair with a given sum.

Find 3 numbers in an array adding up to a given sum S.

In a sorted matrix (sorted by rows and columns), find an element.

Find kth largest element in unsorted array.

Given the pre-order and in-order traversing result of a binary tree, write a function to rebuild the tree.

Given two strings. Remove the letters in order from the first string that are in the second.

Create a function to return the angle between the clock hands if given the hours and minutes.

Giving a windows size K and an array of size N, find the minimum of each window as it slides through the array.

Write code for Fibonacci algorithm (iterative or recursive) and explain what's the performance.

For a given maze, how to write a function to check if there's a way from the top-left grid to bottom-right grid with obstacles in the middle.

Given an array of numbers, there is one number that has a duplicate. How would you find the number.

Find element in circular sorted array.

You are climbing a stair case. Each time you can either make 1 step or 2 steps. The staircase has n steps. In how many distinct ways can you climb the staircase.

Remove a node from a singly linked list without knowing the head node. All you have is the node itself.

Given a list of strings return the substring representing the longest common prefix .

Find longest common substring.

Write a priority queue<E> with three methods. Insert(E Element, int priority). E Pop(), update(E Element, int new_priority).

Describe what happens when making a google search, specifically with client and server .

Implement a binary tree iterator class .

Given search terms, find the minimum window containing all the terms in a string .

How would you find the power set of a set of numbers .

How would you implement a stack to achieve constant time for "push", "pop" and "find mininum" operations.

You are given a Binary-Search-tree. Find duplicated values.

How to reverse the words in a sentence in place.

If you have a computer with no division operator, how do you do it .

How to find two missing integers in an unsorted array.

You are given a number of coins, how can you generate the probability of 1/3, 2/3, etc.

How would you sort an array of one millions integers.

Given a grid that contains rectangles, write a function that will return all the rectangles that overlap with each other.

You have all of the prices for a given stock for the next year. You can buy once and sell once in that year. How do you determine when to buy and sell to maximize your profit?

Write an algorithm to sort and merge 2 LARGE data streams on a system where storage is essentially unlimited (but slow) and RAM is limited.

Find the median of a very large dataset distributed over several network nodes.

Given unsorted sequence of billions of numbers that cannot all fit in memory at the same time, find the median of these values.

Implement a method that matches an entire string with star wildcard pattern, e.g. returns true for ("*ogle", "Google"), but false for ("fragile*", "agile") - without using regular expression language support.

How to count frequencies of characters in a string. What if the string is too huge? When is it reasonable to distribute across machines? How to find the most frequent character in the distributed scenario with a minimum data exchange between machines?

Write a program for finding the kth-smallest element in the array x[0...n-1] in O(n) expected time. Your algorithm may permute the elements of x.

You are given two sets of integers, sizes M and N with M < N. Perform inner equal join on these two sets (i.e., find intersection of the two lists). How to perform it if both the lists are in files and the available memory is of size K < M < N.

You are a parking lot attendant in a lot that has one open spot, and you want to move the cars from their original positions into a new arrangement. Create a program that will print out instructions on how to move the cars most efficiently.

Given a sequential file that contains at most four billion 32-bit integers in random order, find a 32-bit integer that isn’t in the file (and there must be at least one missing...why?). How would you solve this with unlimited main memory? How would you solve it if you could use several external files but only a few bytes of main memory?

Given a file containing at most ten million 7-digit integers with no duplicates. What is an efficient way to print these numbers in ascending order using just 1.5Mb RAM and reading the data just once? What are the consequences of only having 1Mb of RAM and no other storage? How would your answer change if duplicates were permitted?

N pots, each with some number of gold coins, are arranged in a line. You are playing a game against another player. You take turns picking a pot of gold. You may pick a pot from either end of the line, remove the pot, and keep the gold pieces. The player with the most gold at the end wins.

Transmute one word to another by the shortest path of 1-letter changes through a path of actual words.

Create a cache with fast look up that only stores the N most recently accessed items (LRU policy). O(1) time complexity is expected.

Find the smallest window/substring in an string that contains all letters in a given string.

Write a function to find out longest palindrome in a given string.

Saturday, March 23, 2013

Testing for a path through a maze

This is another coding interview question. Find a path through a labyrinth. Let's assume the labyrinth is given as a string of the following form:

maze = """
XXXXXXXXXXX
X         X
X XXXXX   X
X  XX   X X
X   XX    X
XXXXXXXXXXX
"""

The first thing to do is to convert this string into a more suitable data structure. A boolean matrix springs to mind, where true indicates an empty cell and false a stone. It is a one-liner in Python to convert the string given in maze into the boolean matrix:

maze = [[cell!='X' for cell in row] for row in maze.split('\n')[1:]]

The function to test for a path from a starting point start given as tuple (row,col) to an end point end can then be implemented as follows:

def find(maze,start,end):
    stack = [start]
    while stack:
        i,j = stack.pop()
        if not maze[i][j]: continue
        if (i,j) == end: return True
        maze[i][j] = False        
        if maze[i][j+1]: stack.append((i,j+1))
        if maze[i][j-1]: stack.append((i,j-1))
        if maze[i+1][j]: stack.append((i+1,j))
        if maze[i-1][j]: stack.append((i-1,j))
    return False

Instead of a recursive solution we use a stack to explore all possible paths with backtracking. We initialize the stack with the starting point and then loop as long as there are points on the stack or the end point has been found. In each iteration we take the top point from the stack (i,j = stack.pop()) and test if the corresponding cell of the maze is empty. If not we jump to the beginning of the loop. If the cell is empty we mark it as visited (maze[i][j] = False) and leave the loop if it is the end point. If we have not reached the end point we push the coordinates of all neighboring empty cells on the stack and keep iterating.

Let's give it a go:

start = (1,1)
end   = (4,9)      
print find(maze, start, end)
True

One last thing. The code to test for empty neighboring cells is simple and clear but repetitive. Another solution is to replace those four lines by a loop. It isn't much shorter (three instead of four lines) and the code becomes more complex though:

def find(maze,start,end):
    stack = [start]
    while stack:
        i,j = stack.pop()
        if (i,j) == end: return True
        maze[i][j] = False
        if (i,j) == end: return True
        for io,jo in [(0,+1),(0,-1),(+1,0),(-1,0)]:
            ni,nj = i+io,j+jo
            if maze[ni][nj]: stack.append((ni,nj))
    return False

Saturday, January 12, 2013

Java to Scala: Motivation

If you are a Java programmer and haven't heard about Scala already you might wonder if it is worth the effort learning. Here I show two examples that motivated me to switch from Java to Scala.

This first one I stole from somewhere ― I just can't remember where from, sigh ― and modified it slightly. It is a simple implementation of a class that stores the contact information (name and phone number) of a person.

public class Contact implements Serializable {
    private final String name;
    private final String phone;

    public Contact(String name, String phone) {
        this.name = name;
        this.phone = phone;
    }

    public String getName() {
        return name;
    }

    public String getPhone() {
        return phone;
    }

    public Contact withName(String name) {
        return new Contact(name, phone);
    }

    public Contact withPhone(String phone) {
        return new Contact(name, phone);
    }

    public boolean equals(Object o) {
        if (this == o)  return true;
        if (o == null || getClass() != o.getClass()) return false;
        
        Contact p = (Contact) o;
        if (name != null ? !name.equals(p.name) : p.name != null) {
            return false;
        }
        if (phone != null ? !phone.equals(p.phone) : p.phone != null) {
            return false;
        }
        return true;
    }

    public int hashCode() {
        int result = name != null ? name.hashCode() : 0;
        result = 31 * result + (phone != null ? phone.hashCode() : 0);
        return result;
    }

    public String toString() {
        return "Contact(" + name + "," + phone + ")";
    }
} 

There are no surprises or difficulties in it. Just a lot of boilerplate code. Any Java programmer will have written hundreds of classes similar to this one - most of them a lot more complex. With the Contact class one can create new contacts from scratch or use one of the copy constructors as shown below:

Contact stefanOffice = new Contact("Stefan", "62606");
Contact johnOffice   = new Contact("John", stefan.getPhone());
Contact stefanHome   = stefanOffice.withPhone("401251497");

These 20 plus lines of the the Contact class in Java boil down to a single line in Scala with essentially identical functionality:

case class Contact(name: String, phone: String)


val stefanOffice = Contact("Stefan", "62606")
val johnOffice   = Contact("John", stefan.phone)
val stefanHome   = stefanOffice.copy(phone = "401251497")

A case class in Scala provides default implementations for setters, getters, equality tests and copy constructors. In many cases they are sufficient and if not they can easily be overridden.

While this example is a bit extreme with respect to the amount of code reduction (typically the factor is between two and five) it is not an unrealistic example. But I also wanted to demonstrate the benefits of Scala on a slightly more complex problem.

We are going to look at some code that generates an index of over keywords. For instance, let's say we have the keywords "Apple", "Ananas", "Mango", "Banana", "Blackberry" and we would like to have an alphabetically sorted index that lists the keywords (again alphabetically sorted) starting with the same letter:

A: Ananas, Apple
B: Banana, Blackberry
M: Mango

Not very challenging; should be easy. Well, here comes the Java implementation and as you can see it is far from elegant:

import java.util.*;

public class Index {    
  public static void main(String[] args) {
    List<String> keywords = 
       Arrays.asList("Apple", "Ananas", "Mango", "Banana", "Blackberry"); 
    Map<Character, List<String>> ch2words = new TreeMap<>(); 
    for(String keyword : keywords) {   
      char firstChar = keyword.charAt(0);     
      if(!ch2words.containsKey(firstChar))      
        ch2words.put(firstChar, new ArrayList<String>());       
      ch2words.get(firstChar).add(keyword); 
    } 
    
    for(List<String> list : ch2words.values()) 
      Collections.sort(list); 
      
    StringBuilder str = new StringBuilder();  
    for(Character firstChar : ch2words.keySet()) {
      str.append(firstChar+": ");
      Iterator<String> words = ch2words.get(firstChar).iterator(); 
      while(words.hasNext()) {
        str.append(words.next());  
        str.append(words.hasNext() ? ", " : "\n");  
      }
    }
    System.out.print(str);      
  }
}

We take the keywords and put them into a TreeMap to map the first letter to the list of words. I chose a TreeMap to have the map entries automatically sorted according to the first character. We then need to sort the keywords in each of the entries of the TreeMap and finally use a StringBuilder to write the formatted index to a String.

Now I am going to write some bad and truly disgusting Scala code that essentially is a one-to-one mapping from Java to Scala. While far from optimal it shows what already be gained without using any novel concepts such as functional programming or pattern matching.

import scala.collection.mutable.{Map,StringBuilder}

val keywords = List("Apple", "Ananas", "Mango", "Banana", "Blackberry")
val ch2words = Map[Char, List[String]]()
for(keyword <- keywords) {
  val firstChar = keyword(0)
  if(!ch2words.contains(firstChar))      
    ch2words(firstChar) = List()       
  ch2words(firstChar) = keyword::ch2words(firstChar)
}
val str = new StringBuilder()
for(ch <- ch2words.keys.toSeq.sorted) {
  str ++= ch2words(ch).sorted.mkString(ch+": ", ", ", "\n")      
}  
println(str)

Largely due to type inference in Scala the code becomes less cluttered and the logical flow is easier to recognize. Also the confusing distinction between primitive data types such as char and their corresponding objects (Character) is gone.

Now let's do it again and see how a solution that takes advantage of Scala features could look like:

val keywords = List("Apple", "Ananas", "Mango", "Banana", "Blackberry")
val ch2words = keywords.groupBy(_(0)).mapValues(_.sorted).toList.sortBy(_._1)
val lines = ch2words.map{case (ch,words) => ch+": "+words.mkString(", ")}
print(lines.mkString("\n"))

First of all it is much shorter. But in addition, it is also more readable provided an understanding of some basic concepts in Scala has been acquired. Let's have a closer lock at the code.

The first line is trivial. It creates a list of all the keywords and that's it. The second line is interesting. First the keywords get grouped by their first character (keywords.groupBy(_(0))). That produces a map of letters to keyword lists:

Map(M -> List(Mango), A -> List(Apple, Ananas), B -> List(Banana, Blackberry))

As you can see neither the letters nor the lists of keyword are sorted. We take the values of the map (those are the lists of keywords) and sort them. mapValues(_.sorted) does the job.

Map(M -> List(Mango), A -> List(Ananas, Apple), B -> List(Banana, Blackberry))

To sort the map according to letters we convert the map to a list, which returns a list of tuples, containing letters and their keywords, and sort the list according to the first tuple component ― the letter (toList.sortBy(_._1)).

List((A,List(Ananas, Apple)), (B,List(Banana, Blackberry)), (M,List(Mango)))

Now the data structure is sorted and we simply need to create a formatted string. Pattern matching is used to extract the letters and the corresponding keyword lists (case (ch,words)) and the lines of the index are constructed by mapping (ch,words) tuples to strings. Finally, all lines are glued together via mkString("\n") and printed out.

A: Ananas, Apple
B: Banana, Blackberry
M: Mango

To conclude, in almost all cases Scala allows one to write shorter code than in Java; and it is frequently even more readable. But it also makes it possible to write code that causes a PERL coder to blush. To quote Spider-man: "With great powers come great responsibilities!".

Wednesday, January 9, 2013

An example class in Scala

This is just a little example class to remind me of the syntax and some of the nifty features in Scala that I tend to forget. I essentially take the example of the Rational class from the excellent book "Programming in Scala" by Odersky et al. and modify it slightly.

Let's start very simple. We call the class "Frac" for fraction instead of Rational and its constructor takes the numerator and denominator of the fraction. The val specifier ensures that the class is immutable. We also override the toString method to get a nice print out.

class Frac(val num:Int, val den:Int) {
  override def toString = "%d/%d" format (num,den)
}

With this class at hand we can now create a fraction and print it and its components:

val a = new Frac(1,2)
println(a)
println(a.num)
1/2
1

Now let's add a companion object Frac that allows us to omit the new keyword when creating a fraction. We also want to ensure that the denominator is greater than zero and Scala provides the handy require function to test pre-conditions. Furthermore, We add a toDouble variable that contains the floating point value of the fraction. Note that we could also have said def toDouble = .. instead of val toDouble = ... . It is a trade off between speed and memory consumption. The nice thing about Scala is that the call interface remains the same and the implementation could be changed without any dramas.

class Frac(val num:Int, val den:Int) {
  require(den > 0)
  val toDouble = num/den.toDouble 
  override def toString = "%d/%d" format (num,den)
}

object Frac {
  def apply(num:Int, den:Int) = new Frac(num,den)
}  

println(Frac(1,2))
println(Frac(1,2).toDouble)
1/2
0.5

So far so good. But one problem with the current implementation is that comparisons such as Frac(1,2) == Frac(1,2) return false. We need to override the equals() method and consequently the method hashCode should also be overridden:

class Frac(val num:Int, val den:Int) {
  ...
  override def equals(other:Any) = other match {
    case that:Frac => (this eq that) || 
       (that.num == this.num && this.den == that.den)
    case _ => false
  }

  override def hashCode = 13*(num+13*den)
  ...
}

Apart from testing for equality it would be convenient if fractions could be ordered. The current class implementation does not allow to say Frac(1,3) < Frac(1,2), for instance. But implementing the compare method of the Ordered trait does the trick.

class Frac(val num:Int, val den:Int) extends Ordered[Frac] {
  ...
  def compare(that:Frac):Int = 
    (this.num*that.den) - (that.num*this.den)     
  ...
}

Operator overloading is another feature of Scala and useful when applied appropriately. For a fraction class it certainly makes sense. To keep things simple the following code is limited to the multiplication of fractions with fractions and fractions with scalars.

class Frac(val num:Int, val den:Int)  {
  ...
  def *(that:Frac):Frac = Frac(this.num*that.num, this.den*that.den)
  def *(c:Int):Frac = Frac(num*c, den)   
  ...
}

This enables us to compute Frac(1,3) * Frac(1,2) and Frac(1,3) * 2 but we still cannot calculate 2 * Frac(1,3) because there is no multiplication method for Int that takes a fraction. For that we need an implicit function, preferably defined within the companion object:

object Frac {
  def apply(num:Int, den:Int) = new Frac(num,den)
  implicit def int2Frac(num:Int):Frac = Frac(num,1)
}  

Depending on scope it might be necessary to import the implicit function of the companion object but then things work as expected:

import Frac.int2Frac
println(Frac(1,3) * Frac(1,2))    // prints 1/6
println(Frac(1,3) * 2)            // prints 2/3
println(2 * Frac(1,3))            // prints 2/3

When multiplying a fraction with an integer scalar we would like to get a fraction back. However, when multiplying with a floating point number we need a floating point number as a result. The implementation is similar; just the direction is different.

class Frac(val num:Int, val den:Int) {
  ...
  val toDouble = num/den.toDouble
  def *(c:Int):Frac = Frac(num*c, den)   // Frac * Int => Frac
  def *(c:Double):Double = toDouble*c    // Frac * Double => Double
  ...
}

object Frac {
  def apply(num:Int, den:Int) = new Frac(num,den)
  implicit def int2Frac(num:Int):Frac = Frac(num,1)
  implicit def frac2double(frac:Frac):Double = frac.toDouble
}  
import Frac._
println(Frac(1,2) * 2.5)   // prints 1.25
println(2.5 * Frac(1,2))   // prints 1.25

Alright, that it's. Of course, there is some functionality missing such as addition, subtraction and division operations. Furthermore, fractions should be normalized to the greatest common divisor as shown in Odersky's code. But I wanted to keep this example simple. To conclude let us put together what have now and see what we can do:

class Frac(val num:Int, val den:Int) extends Ordered[Frac] {
  require(den > 0)
  val toDouble = num/den.toDouble
  def *(that:Frac):Frac = Frac(this.num*that.num, this.den*that.den)
  def *(c:Int):Frac = Frac(num*c, den)   
  def *(c:Double):Double = toDouble*c   
  def compare(that:Frac):Int = 
    (this.num*that.den) - (that.num*this.den)     
  override def equals(other:Any) = other match {
    case that:Frac => (this eq that) || 
       (that.num == this.num && this.den == that.den)
    case _ => false
  }
  override def hashCode = 13*(num+13*den)
  override def toString = "%d/%d" format (num,den)
}

object Frac {
  def apply(num:Int, den:Int) = new Frac(num,den)
  implicit def int2Frac(num:Int):Frac = Frac(num,1)
  implicit def frac2double(frac:Frac):Double = frac.toDouble
}  
import Frac._
println(Frac(1,3) == Frac(1,2))
println(Frac(1,3) < Frac(1,2))
println(Frac(1,3) * Frac(1,2))
println(Frac(1,2) * 2.5)
println(2.5 * Frac(1,2))
println(Frac(1,2) * 2)
println(2 * Frac(1,2))