How to

When solving these problems, learning by doing is much better than learning by reading. I encourage to you read only as far in the solution as you need, then trying to solve the problem. If you get stuck, try reading a little further. And of course, let me know if you find a better solution!

Thursday, February 23, 2012

Data structures and DAWGS


Problem:  (Thank you to Stanford’s CS 106X class).  Discuss a data structure, under 64k bytes, which can in near constant time determine whether a string is a valid word in the English language, or is invalid.

Solution:
Let’s start by asking a few more assumptions about this problem.  First, do we have to build the data structure?  No – we just need to figure out what sort of data structure would be able to do this.

Well, let’s start out with one of the parts of the questions – ‘near constant time.’  Well, there are very few common data structures which can achieve this.  Some data structures that fail immediately are binary search trees (log n), sorted arrays (log n), linked lists (n or log n).

One common data structure that is nearly constant time is a hash table.  And, at first this seems like a great data structure.  We could hash based upon the string, end up in a hash bucket, and so long as we keep the hash buckets relatively small, we could get a near constant time data structure.  Well, this is true.  This would work as being nearly constant time.   Pic 1 shows such a hash table.


Now, let’s see if this would work on the space constraint.  Let’s make an assumption that there are 50k words in English (especially with variations like sing, sang, sung, singing, singer, sings, etc.).  This seems about right.  Maybe it’s higher, but let’s see if we can create a hash table with a dataset of this size in the space that we have.

Let’s also assume that we use ascii and the average word is 5 letters (seems low, but let’s go with it).  So, the average word is 6 bytes long, assuming that it’s null terminated.  Storing all of these words, not include the over head of the hash table, we would have a data structure of 6 * 50k = 300k.  This is much larger than we want, and probably it would be larger as our estimate of words is probably low.  With the overhead of the hash table, we would have an even bigger amount.

So, a hash table can’t work as its data store will be too large.

So, this creates a problem that any data structure would have – we cannot uniquely store each word and meet the space requirements.

Well, where to go from here?

Well, let’s think about what we know about the English language to look for optimizations.  We know that the words are not random character strings, but words with many common parts.  

Again, a good example is:  sing, sang, sung, sings, singer, singing.  Here, we have six distinct words.  In a hash table, we would have six distinct entries and we know that this approach will not work.

However, looking at the words, we can see that they have both common prefix and common suffix.  Four of the words have the common prefix ‘sing’ and four of them end with ‘ng’ and 3 with ‘ing.’

Let’s think about what we could do here.  Well, this screams for a tree approach where individual characters are nodes.  We previously rejected a tree because of the time required, but let’s go with this for now.  We can make ach node only branches when it has a new letter. 
By doing this, we have a tree with s as the root and then the structure as shown in pic 2.  


We could put a bit on each bit indicating whether stopping at the node is a valid word or not.
This tree would be fairly quick to search to see if a word were in it, but the structure isn’t quite good enough for searching.  For example, we would have to search all of the lower nodes to see if a descendant node with the character we’re looking for.

One quick optimization is to create nodes with all 26 characters as pointers and a ‘true’ bit.  We can then easily go down the tree and see if we have a valid word or not.  And, this is almost constant time as we never have to look up more elements the number of characters in the word.  This type of structure is shown in pic 3, and is called a Directed Acyclical Word Graph (DAWG).   


If we think of the size of this structure, it probably reduces the size of the words by a factor of 10 given all of the common prefixes in words (probably  a greater reduction as over time, most prefixes are pretty common, but seems pretty close).  This will get us close to the size we’re looking for, but let’s think about how we can do better.

So far, we have done a good job constructing a data structure that optimizes around prefixes, but if we look at the example of the words like, ‘sing’, we see that we also have lots of common suffixes.  As simple optimizations, we could imagine uniting common prefixes. For example, many words in English end in ‘ing’; we could have all of these words point to a common ‘ing’ ending.

This would be a further optimization, but would obviously require balancing and make the function which inserts words longer. 

This would give us a much optimized and fast data structure.

This is a good example where it’s good to try the first thought of what could work (e.g., a hash table), but then move on and look at optimizations to get to the correct answer.

/* Please let me know if you find bugs or know alternative solutions.  Thanks, Noah */




2 comments:

  1. Nicely done. I think this is a question I was asked during a Microsoft interview.

    Regarding the suffix optimization, I think you're attempting to "zip" the data from both the head of the word and the tail of the word. It's potentially problematic (I can't vividly see the solution yet) because either (1) you zip from tail during insertion time, in which case you're not sure where to let the variable body length remain unoptimized, or (2) you zip as a post-tree-creation processing step, in which case it's expensive to decide how to cluster them.

    Regardless, I would propose this heuristic: Depending on whether the language is "head-optimizable" or "tail-optimizable", I would calculate both corresponding trees (maintain two trees during insertion time) and pick the one with a smaller size.

    For example, hypothetically, English language might be tail-optimizable, because as you mentioned, a lot of words end with "+ing" (sing, wing, bring), or "+ination" (procastination, imagination, combination) so a tail-optimized tree probably takes less space than the head-optimized tree.

    This makes me nostalgic for painful all-nighters :)

    ReplyDelete
    Replies
    1. Thx Amin. It does make me remember some late nights! One easy way to do a suffix-optimized (tail optimized) would be just to compare putting the words in backwards and see if that makes a smaller data structure.

      Noah

      Delete