Problem: (Thank you to Stanford’s CS 106X class). Discuss a data structure, under 64k bytes,
which can in near constant time determine whether a string is a valid word in
the English language, or is invalid.
Solution:
Let’s start by asking a few more assumptions about this
problem. First, do we have to build the
data structure? No – we just need to
figure out what sort of data structure would be able to do this.
Well, let’s start out with one of the parts of the questions
– ‘near constant time.’ Well, there are
very few common data structures which can achieve this. Some data structures that fail immediately
are binary search trees (log n), sorted arrays (log n), linked lists (n or log
n).
One common data structure that is nearly constant time is a hash
table. And, at first this seems like a
great data structure. We could hash
based upon the string, end up in a hash bucket, and so long as we keep the hash
buckets relatively small, we could get a near constant time data structure. Well, this is true. This would work as being nearly constant
time. Pic 1 shows such a hash table.
Now, let’s see if this would work on the space
constraint. Let’s make an assumption
that there are 50k words in English (especially with variations like sing,
sang, sung, singing, singer, sings, etc.).
This seems about right. Maybe
it’s higher, but let’s see if we can create a hash table with a dataset of this
size in the space that we have.
Let’s also assume that we use ascii and the average word is
5 letters (seems low, but let’s go with it).
So, the average word is 6 bytes long, assuming that it’s null
terminated. Storing all of these words,
not include the over head of the hash table, we would have a data structure of
6 * 50k = 300k. This is much larger than
we want, and probably it would be larger as our estimate of words is probably
low. With the overhead of the hash
table, we would have an even bigger amount.
So, a hash table can’t work as its data store will be too
large.
So, this creates a problem that any data structure would
have – we cannot uniquely store each word and meet the space requirements.
Well, where to go from here?
Well, let’s think about what we know about the English
language to look for optimizations. We
know that the words are not random character strings, but words with many
common parts.
Again, a good example
is: sing, sang, sung, sings, singer,
singing. Here, we have six distinct
words. In a hash table, we would have
six distinct entries and we know that this approach will not work.
However, looking at the words, we can see that they have
both common prefix and common suffix.
Four of the words have the common prefix ‘sing’ and four of them end
with ‘ng’ and 3 with ‘ing.’
Let’s think about what we could do here. Well, this screams for a tree approach where
individual characters are nodes. We
previously rejected a tree because of the time required, but let’s go with this
for now. We can make ach node only branches
when it has a new letter.
By doing this, we have a tree with s as the root and then
the structure as shown in pic 2.
We
could put a bit on each bit indicating whether stopping at the node is a valid
word or not.
This tree would be fairly quick to search to see if a word
were in it, but the structure isn’t quite good enough for searching. For example, we would have to search all of
the lower nodes to see if a descendant node with the character we’re looking
for.
One quick optimization is to create nodes with all 26
characters as pointers and a ‘true’ bit.
We can then easily go down the tree and see if we have a valid word or
not. And, this is almost constant time
as we never have to look up more elements the number of characters in the
word. This type of structure is shown in
pic 3, and is called a Directed Acyclical Word Graph (DAWG).
If we think of the size of this structure, it probably
reduces the size of the words by a factor of 10 given all of the common
prefixes in words (probably a greater
reduction as over time, most prefixes are pretty common, but seems pretty close). This will get us close to the size we’re
looking for, but let’s think about how we can do better.
So far, we have done a good job constructing a data
structure that optimizes around prefixes, but if we look at the example of the
words like, ‘sing’, we see that we also have lots of common suffixes. As simple optimizations, we could imagine
uniting common prefixes. For example, many words in English end in ‘ing’; we
could have all of these words point to a common ‘ing’ ending.
This would be a further optimization, but would obviously
require balancing and make the function which inserts words longer.
This would give us a much optimized and fast data structure.
This is a good example where it’s good to try the first
thought of what could work (e.g., a hash table), but then move on and look at
optimizations to get to the correct answer.
/* Please let me know if you find bugs or know alternative
solutions. Thanks, Noah */
Nicely done. I think this is a question I was asked during a Microsoft interview.
ReplyDeleteRegarding the suffix optimization, I think you're attempting to "zip" the data from both the head of the word and the tail of the word. It's potentially problematic (I can't vividly see the solution yet) because either (1) you zip from tail during insertion time, in which case you're not sure where to let the variable body length remain unoptimized, or (2) you zip as a post-tree-creation processing step, in which case it's expensive to decide how to cluster them.
Regardless, I would propose this heuristic: Depending on whether the language is "head-optimizable" or "tail-optimizable", I would calculate both corresponding trees (maintain two trees during insertion time) and pick the one with a smaller size.
For example, hypothetically, English language might be tail-optimizable, because as you mentioned, a lot of words end with "+ing" (sing, wing, bring), or "+ination" (procastination, imagination, combination) so a tail-optimized tree probably takes less space than the head-optimized tree.
This makes me nostalgic for painful all-nighters :)
Thx Amin. It does make me remember some late nights! One easy way to do a suffix-optimized (tail optimized) would be just to compare putting the words in backwards and see if that makes a smaller data structure.
DeleteNoah
Ace4sure is the website that deals in preparation material for the exam for many years. According to my exposure and research, this is the right platform where you can get exact MS-500 Dumps Questions.
ReplyDelete