A novel approach to word segmentation featuring high speed and allowing noisy text by incorporating spelling correction. Source code on GitHub.
For people in the West it seems obvious that words are separated by space, while in Chinese, Japanese, Korean (CJK languages), Thai and Javanese words are written without spaces between words.
Even the Classical Greek and late Classical Latin were written without those spaces. This was known as Scriptio continua.
And it seems we haven’t yet lost our capabilities: we can easily decipher
the quick brown fox jumps over the lazy dog
Our brain does this somehow intuitively and unconsciously. Our reading speed slows down just a bit, caused by all the background processing our brain has to do. How much that really is we will see if we attempt to do it programmatically.
But why would we want to implement word segmentation programmatically, if people can read the unsegmented text anyway?
For CJK languages without spaces between words, it’s more obvious.
Web search engines like Google or Baidu have to index that text in a way that allows efficient and fast retrieval. Those web search engines are based on inverted indexes. During crawling for each word a separate list of links is created. Each list contains links to all web pages where that specific word occurs.
If we search for a single word then all links contained in the list are valid search results. If we do a boolean search (AND) for two words, then those two lists are intersected and only the links that are contained in both list are returned as results.
Spelling correction allows to correct misspelled words. This is done by looking up the possible misspelled word in a dictionary. If the word is found the word is deemed correct. If the word is not found then those words from the dictionary which are closest to the candidate (according to an edit distance metric like Damerau-Levenshtein) are presented as spelling correction suggestions. Of course, the naive approach of calculating the edit distance for each dictionary entry is very inefficient. SymSpell is a much faster algorithm. But anyway, the precondition is to identify the word boundaries first in order to have an input for the spelling correction algorithm in the first place.
Machine translation, Language understanding, Sentiment analysis and many other information processing tasks are also based on words.
But why would we need word segmentation for languages that have already spaces between words? Well, even in those languages that normally have spaces they get sometimes missing! There are surprisingly many application fields:
- Normalizing English compound nouns that are variably written: for example, ice box = ice-box = icebox; pig sty = pig-sty = pigsty) for search & indexing.
- Word segmentation für compounds: both original word and split word parts should be indexed.
- Typing errors may cause missing spaces.
- Conversion errors: during conversion, some spaces between word may get lost, e.g. by removing line breaks instead of replacing them with spaces.
- OCR errors: inferior quality of original documents or handwritten text may cause that not all spaces between words are recognized properly.
- Transmission errors: during the transmission over noisy channels spaces can get lost or spelling errors introduced.
- Keyword extraction from URL addresses, domain names, table column description or programming variables that are written without spaces.
- For password analysis, the extraction of terms from passwords can be required.
- Speech recognition, if spaces between words are not properly recognized in spoken language.
- Automatic CamelCasing of programming variables.
But word segmentation might be of interest also beyond Natural languages, e.g. for segmenting DNA sequence into words (PDF).
Segmentation variant generation
The naive approach would be to generate all possible variants to segment a string into substrings (word candidates) and lookup those candidates in a dictionary. Those segmentation variants where all substrings are a valid word are returned as results.
The following recursive algorithm enumerates all compositions:
public void wordsegment(string input, string composition=””)
for (int i=1;i<=input.Length;i++)
string part1 = input.Substring(0, i);
//recursion with the remainder of the string
if (part1.Length < input.Length)
For “isit” as input it will generate all 8 possible compositions:
i s i t
i s it
i si t
is i t
But there is a problem. Each string of length n can be segmented into 2^n−1 possible compositions. This is huge!
The naive algorithm is exponential and will not scale!
E.g. our string thequickbrownfoxjumpsoverthelazydog has a length of 35. This results in more than 17 billion segmentation variants to be generated, all their words to be verified in the dictionary, and finally, the variants to be ranked by probability.
We don’t usually have so much time and computing power. We need to come up with something smarter and more efficient.
One way of improving the algorithm is using Dynamic programming:
This a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time. Each of the subproblem solutions is indexed in some way, typically based on the values of its input parameters, so as to facilitate its lookup. The technique of storing solutions to subproblems instead of recomputing them is called “memoization”.[Source: Wikipedia]
Every time before our program makes a recursive call to segment the remaining string we check whether this exact substring has been segmented before. If so we just dismiss the recursive call and retrieve the optimum segmentation result for that exact substring from a cache (where it has been stored the first time that specific substring has been segmented).
That technique has been used for word segmentation several times:
- Peter Norvig’s word segmentation Python code can be found in the chapter Natural Language Corpus Data of the book Beautiful Data
- Grant Jenks python_wordsegment.
- Jeremy Kun word segmentation
A Strictly Incremental Algorithm
SymSpell.WordSegment takes a different approach.
We don’t need to generate all compositions in order to find the best one!
In the naive algorithm, the whole tree of segmentation variants is generated. Therefore for each partial segmentation of length n’ the segmentation of the reminder of length r=n-n’ is generated again and again. And only at the end, all generated complete segmentations are compared to find the best one.
i s i t
is i t
But for each partial segmentations of length 2 (i s and is) we need to segment the remainder of length 2 ( it ) only once. We just choose the best partial segmentations for each length, and continue only from that one with further segmenting the remainder. This holds true only under the assumption that consecutive partial segmentations are independent.
i s i t
Incrementally selecting the partition optimum of each substring saves a lot of time.
Maximum word length
Word candidates are generated from a string only up to a maximum length. This maximum length is derived from the longest word present in the dictionary. While this is heuristic, it is very unlikely that a valid genuine word exists which is larger than the longest contained in the dictionary. But that upper bound can be overwritten manually as well to either allow for even longer words or speed up processing by restricting to shorter words.
SymSpell.WordSegmentation has a linear runtime O(n) to find the optimum composition. The exact relationship between text length n and runtime is O(n*Min(n, maximum dictionary word length) / 2).
While the Dynamic Programming solutions reduce the time to solve repeating patterns by memoization, the SymSpell algorithm prevents the generation of repeating problems in the first place.
That saves the computation time and memory consumption required for generating, storing and retrieving repetitive patterns. It also allows forgoing recursion and call stacks.
Segmentation variant selection
There might be several valid variants to split a string into words though:
isit might be i sit or is it
Word frequencies (or word occurrence probabilities) are used to determine the most likely segmentation variant.
Word occurrence probability P=word count C / number of words N in the text corpus used to generate the frequency dictionary.
N equals the sum of all counts c in the frequency dictionary only if the dictionary is complete, but not if the dictionary is truncated or filtered.
Naive Bayes Rule
We assume the word probabilities of two words to be independent. Therefore the resulting probability of the word combination is the product of the two word probabilities:
Instead of computing the product of probabilities we are computing the sum of the logarithm of probabilities. Because the probabilities of words are about 10^-10, the product of many such small numbers could exceed (underflow) the floating number range and become zero.
In order to calculate the word occurrence probabilities we can use different approaches:
- Create our own word frequency dictionary by counting the words in a large text corpus. If the dictionary is derived from a corpus similar to the documents you will process, the similar word occurrence probabilities distribution will ensure optimum results.
- Use an existing frequency dictionary which contains word counts, e.g. the Google Books Ngram data. The drawback is that Google Books Ngram data contains not only valid words but also spelling errors and other fragments. While many spelling errors will anyway have no influence due to their low probabilities, that is not always guaranteed. E.g. frequent spelling errors of frequent words could win over genuine rare words.
- Dictionary quality is paramount for segmentation quality. In order to achieve this two data sources can be combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors) and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).
We can’t fully rely on a dictionary. No dictionary will be ever complete. There are uncommon words, new words, foreign words, personal names, brand names, product names, abbreviations, informal words, slang and spelling errors. Even for words not contained in the dictionary, we want the word segmentation algorithm to select the best possible segmentation variant.
Therefore we have to estimate a word occurrence probability to unknown words. The information we have even for unknown words is their length.
Approximate word occurrence probability P=10 / (N * 10^word length l)
O course we could think of more sophisticated calculation, e.g. by using Markov chains to estimate the probability of a specific substring to be a genuine word.
Word segmentation and spelling correction
As mentioned above several word segmentation solutions exist.
SymSpell.WordSegment is unique in the capability of combining word segmentation and spelling correction.
This has been only feasible by utilizing the extreme lookup speed of SymSpell. During segmentation many potential word candidates are generated — all of them need to be spell checked and the suggestion candidates to be ranked.
By a parameter for maximum edit distance, we can control how much spelling correction we want to allow. With maxEditDistance=0 we have pure word segmentation without spelling correction. With maxEditDistance=2 we consider all words from the dictionary as valid candidates for a substring if they are within a Damerau-Levenshtein edit distance ≤2.
Higher maximum edit distance allows to correct more errors, but also introduces false-positive corrections and generally slows down the segmentation performance. The actual trade-off depends on the quality of the input material. It can make buggy input readable, but it can also introduce errors into an otherwise perfect input.
Norvig: is it
WordSegment ed=0: is it
WordSegment ed=1: visit
LookupCompound ed=0: i sit
LookupCompound ed=1: visit
Norvig: in depend end
WordSegment ed=0: in depend end
WordSegment ed=1: independent
LookupCompound ed=0: independend
LookupCompound ed=1: independent
Norvig: who couqdn’t read
WordSegment ed=0: who co uqdn’t read
WordSegment ed=1: who couldn’t read
LookupCompound ed=0: whocouqdn’tread
LookupCompound ed=1: whocouqdn’tread
*In order to make algorithms comparable for SymSpell.WordSegment and Norvig’s word segmentation the same frequency dictionary has been used.
WordSegment vs. LookupCompound
- mistakenly inserted space within a correct word led to two incorrect terms
- mistakenly omitted space between two correct words led to one incorrect combined term
- multiple input terms with/without spelling errors
Splitting errors, concatenation errors, substitution errors, transposition errors, deletion errors and insertion errors can be mixed within the same word.
So, what’s the difference then between SymSpell.WordSegment and SymSpell.LookupCompound, and what is similar?
- Both can insert and remove spaces from multi-word input strings, as well as correct spelling errors.
- LookupCompound can insert only a single space into a token (string fragment separated by existing spaces). It is intended for spelling correction of word segmented text but can fix an occasional missing space. There are fewer variants to generate and evaluate because of the single space restriction per token. Therefore it is faster and the quality of the correction is usually better.
- WordSegment can insert as many spaces as required into a token. Therefore it is suitable also for long strings without any space. The drawback is a slower speed and correction quality, as many more potential variants exist, which need to be generated, evaluated and chosen from.