Monday, December 10, 2012

Newick tree format parser in Scala

Parser combinators in Scala are nice for small, non-time-critical parsers. While typically slower than parser generators such as ANTLR or hand-written recursive descent parser (see also here), they are easier to implement and do not require external syntax or output files. If speed is an issue there is also Parboiled, a parser generator very similar in style to the parser combinators in Scala but apparently faster. I haven't tried it, however.

I needed a parser to read phylogentic tree data and Scala's parser combinators seemed just right. Without going into any details concerning phylogenetic trees, the following figure shows an example tree with a root node R, two internal nodes X and Y, and the leaf nodes A,B,C,D. Each branch has a length annotation, e.g. 0.1.

There are various formats to describe phylogentic trees but to keep it simple let's start with a format defined by the following EBNF:

  tree    ::= identifier length [subtree]
  subtree ::= "(" tree {"," tree} ")"
  length  ::= ":" floatingPointNumber 

Here a tree has a node identifier followed by the branch length (distance to parent), followed by an optional subtree. A subtree is a sequence of trees enclosed in brackets and separated by commas. The branch length is given by a colon followed by a floating point number.

Assuming branch lengths of 0.1 for all nodes (except the root node R), the tree above would be written as

R:0.0 (X:0.1 (A:0.1, B:0.1), Y:0.1 (A:0.1, B:0.1))

The given EBNF can easily be translated into a parser combinator that can read this tree description:

import scala.util.parsing.combinator._

class SimpleParser extends JavaTokenParsers {
  def tree                = ident ~ length ~ opt(subtree)
  def subtree:Parser[Any] = "(" ~> repsep(tree, ",") <~ ")"
  def length              = ":" ~> floatingPointNumber
}

Note that [subtree] translates to opt(subtree) and tree {"," tree} to repsep(tree, ","). Furthermore, subtree requires a return type, since it is a recursive function. I am lazy here and say Parser[Any], since we are going to improve on this code anyway. Let's add a method read() to read a tree description and return the parsing result:

class SimpleParser extends JavaTokenParsers {
  def tree = ident ~ length ~ opt(subtree)
  def subtree:Parser[Any] = "(" ~> repsep(tree, ",") <~ ")"
  def length = ":" ~> floatingPointNumber

  def read(text:String) = parseAll(tree, text)
}

val parser = new SimpleParser()
println( parser.read("R:0.0(X:0.1(A:0.1,B:0.1),Y:0.1(C:0.1,D:0.1))") )

The output is a bit convoluted but the structure reflects the tree structure. Note that opt() returns an Option, which is the reason for the Some and None in the output:

parsed: ((R~0.0)~Some(List(((X~0.1)~Some(List(((A~0.1)~None), ((B~0.1)~None)))), ((Y~0.1)~Some(List(((C~0.1)~None), ((D~0.1)~None)))))))

So far so good, but the current parser output is not very useful. What we want to get back from the parser is a proper data structure of nodes and their children. In the next steps we are going to refactor and extend the current code a bit. We start with a Node class that has a display() method to print a tree:

case class Node(name:String, length:Double, descendants:List[Node]) {
  def display(level:Int=0) {
    printf("%s%s:%.2f\n","  "*level,name,length)
    descendants.foreach(_.display(level+1))
  }
}

To keep the grammar separated from the rest and also to allow comments we factor out the read() method into an abstract super class that takes the node type as type parameter T:

abstract class TreeParser[T] extends JavaTokenParsers {
  val comment = """\\[.+?\\]"""

  def tree:Parser[T]

  def read(file:File):T = read(Source.fromFile(file).getLines().mkString("\n"))

  def read(text:String):T = parseAll(tree, text.replaceAll(comment,"")) match {
    case Success(result, next) => result
    case failure => throw new Exception(failure.toString)
  }
}

Comments are everything between two rectangular brackets (defined by a non-greedy, regular expression) and removed from the input text before running the parser. We also overload the read() method to read a tree from a file and tree is an abstract method to be implement by a subclass. Note that the read() method returns a type T (a node), which will be the root node of the tree.

parseAll() returns a ParseResult with sub classes Success, Failure or Error. We are only interested in Success and match against it to retrieve the parsing result (the root node of type T). In all other cases an exception is thrown, showing the parsing error.

Now we can rewrite the SimpleParser class as follows:

class SimpleParser[T](nf: (String,Double,List[T]) => T) extends TreeParser[T] {
  def tree = (ident ~ length ~ opt(subtree)) ^^ {case n~l~d => nf(n,l,d.getOrElse(Nil))}
  def subtree:Parser[List[T]] = "(" ~> repsep(tree, ",") <~ ")"
  def length = ":" ~> floatingPointNumber ^^ { _.toDouble }
}

There are two main difference to the previous implementation. First, a node factory (nf) is provided that takes a node name (String), the branch length (Double) and a list of child nodes (List[T]) to create a node of type T. This allows us to use the same parser to create trees of different node types, which is especially handy for testing. Note that the function subtree() now has a tight return type Parser[List[T]] and is not Parser[Any] anymore.

The second change is that actions have been added to the parsers. { _.toDouble } returns a Double for the parsed floating point number of the branch length and {case n~l~d => nf(n,l,d.getOrElse(Nil))} calls the node factory to create a node. If there is no subtree the list of descendants is Nil (d.getOrElse(Nil)).

If we run the parser now, we get back a tree structure of nodes to play with. I am a bit sneaky with the node factory and simply provide the apply() method of the node singleton that is automatically generated for the case class.

val parser = new SimpleParser(Node.apply)
val root = parser.read("R:0.0(X:0.1(A:0.1,B:0.1),Y:0.1(C:0.1,D:0.1))")
root.display()
R:0.00
  X:0.10
    A:0.10
    B:0.10
  Y:0.10
    C:0.10
    D:0.10

Now we are ready for the real deal: A parser for phylogenetic trees in Newick format. I took the following grammar from here and adapted it bit.

  tree        ::= subtree ";"
  descendants ::= "(" subtree {subtree ","} ")"
  subtree     ::= descendants name length | leaf
  leaf        ::= name length
  name        ::= [quoted | unquoted]
  unquoted    ::= identifier
  quoted      ::= "'" { not "'" | "''"} "'"
  length      ::= [":" floatingPointNumber]

From that we can derive the corresponding parser:

class NewickParser[T](nf: (String,Double,List[T]) => T) extends TreeParser[T] {
  def tree = subtree <~ ";"
  def descendants:Parser[List[T]] = "(" ~> repsep(subtree, ",") <~ ")"
  def subtree = descendants~name~length ^^ {case t~n~l => nf(n,l,t)} | leaf
  def leaf = name~length ^^ {case n~l => nf(n,l,Nil)}
  def name = opt(quoted | unquoted) ^^ { _.getOrElse("") }
  def unquoted = ident
  def quoted = """'([^']|'')*'""".r  ^^ { _.drop(1).dropRight(1).replace("''","'") }
  def length = opt(":" ~> floatingPointNumber) ^^ { _.getOrElse("0").toDouble }
}

The Newick format allows to leave out node names or branch length information, which can lead to rather wicked tree definitions such as

"(A,(B,C)E)F;"
(((,),));
(,B)C:0.3;

With the lengthy introduction in mind the NewickParser class is hopefully not beyond comprehension. The only difficult and admittedly ugly part is the action { _.drop(1).dropRight(1).replace("''","'") }, which simply removes the enclosing quotes from a quoted identifier and replaces pairs of single quotes ('') by single quotes. The quoted parser could be implemented more nicely as:

def quoted = "'" ~> """([^']|'')*""".r <~ "'"  ^^ { _.replace("''","'") }

but in this case leading and trailing white spaces of quoted identifiers are lost, which might be acceptable. The astute reader will notice two other limitation. Firstly, nested comments, e.g. [ a comment [ another comment]] are not permitted and identifiers cannot contain brackets. I have not found a really nice way of supporting this functionality. Either one extends the grammar to allow comments where ever white spaces can appear, which makes the grammar unwieldy. Or one overrides the white space parser by a custom parser that supports comments. Or the last option that I chose is to implement a separate comment parser to remove comments from the input text instead of the regular expression employed in TreeParser.

class CommentParser extends JavaTokenParsers {
  override val skipWhitespace = false

  /** removes comments from given text */
  def remove(text:String) = parseAll(rest, text)

  def rest = rep(quoted | comment | any) ^^ { _.mkString }
  def any = """.""".r
  def quoted = """'([^']|'')*'""".r
  def comment: Parser[String] = ("["~rep(not("]")~(comment | any))~"]") ^^ {_ => ""}
}

I found the increase in complexity not worth the additional functionality and stayed with the limited parser that does not allow nested comments.

Finally, some links I came across recently that I found useful for parser developement:
Code examples for combinator parsing
Ten grammars for simple integer arithmetic expressions