Many users find that a spell checker is a tool that they cannot live without; they mostly know how to spell, but once in a while they experience a time when the correct spelling simply slips their mind. With the help of a spell checker, the search engine tries to guess the user's intention and recommends what is likely to be the correct spelling of whatever the user typed. In this project, we will learn the basic theory behind how spell checker works and explore different ways of implementing a basic spell checker.
Given a user-typed word \(w\), the spell checker's job is to find the correct spelling \(c\) of \(w\). Clearly, given a user input \(w\), there is no way to know its correct spelling \(c\) for sure (for example, should "kives" be "knives" or "hives" or ...?). Thus a good way to model the spelling check problem is to use a probabilistic approach; we are trying to find the correction \(c\), out of all possible candidate corrections \(C\), that maximizes the probability that \(c\) is the intended correction, given the original word \(w\):
argmaxc ∈ C P(c|w)By Bayes' Theorem this is equivalent to:
argmaxc ∈ C P(c) P(w|c) / P(w)Since P(w) is the same for every possible candidate c, we can factor it out, giving:
argmaxc ∈ C P(c) P(w|c)The four parts of this expression are:
In Task 1, we start our exploration of spell correction algorithms with the simplest model:
As you can see, this is a very naive model, which simply returns a dictionary word \(c\) within the shortest edit distance of \(w\). The "edit operations" are very limited as well; only insertions and deletions are considered, so transposition (swapping two adjacent characters) and substitutions (replacing one character with another) are counted as two edit operations as opposed to one. In later parts of this project, we will remove these limitations and see how these "improvements" will impact the performance of our spell correction algorithm.
In project3.zip, we included Java code, SpellChecker.java
, that implements this basic model. It also includes the basic "dictionary" file that you can use for this task, dict1.txt
, and a "spelling correction benchmark dataset" file, testset.txt
. The dictionary file includes one "dictionary word" per each line, and the testset includes "correctly-spelled word:" followed by a list of misspellings. Open and look at the three files to understand what they have and how to interpret them.
You can run our provided SpellCheker
code by executing the following command (after you successfully compile the Java code using the javac
command of course):
cs246@cs246:~$ java SpellChecker dict1.txt testset.txt
The above command will load the dictionary entries in dict1.txt
and run the simple correction model against the testset in testset.txt
. It then prints out a sequence of "misspelled words" in the testset and the "correct spellings" proposed by our simple model, with the 0/1 indicator of whether the proposed correction is indeed a correct spelling:
Finished loading dictionary. Total size: 29066 gallary -> allay: 0 inetials -> initials: 1 diagrammaticaally -> diagrammaticaally: 0 ... neccesary -> necessary: 1 rember -> remember: 1 FINAL RESULT: 161 of 265 correct. ( 60.75% accuracy 5.66% unknown )As the last output, the Java code prints the overall accuracy of the spelling correction algorithm, which is 60.75%. "5.66% unknown" means 5.66% of "correctly spelled words" in the testset do not exist in our dictionary, so there was no way that our algorithm could predict the correct spelling for them. The accuracy of 60% is not too bad for such a simple model, but of course we want to do better. Our goal for this project is to raise the accuracy to around 75% by improving this simple model step-by-step.
The first improvement that we will consider is to improve the language model: \(P(c)\). In the simple model of Task 1, we assume that all \(c \in\) candidates \(C\) have the same probability \(P(c)=1\), but as you can imagine, this is too crude an approximation. We want to assign more realistic probability to each \(c\). But where can we obtain more realistic \(P(c)\)?
The most common way to obtain a better \(P(c)\) is to collect a large and relatively clean corpus (with infrequent spelling errors if any), and use the relative frequency of each word in the corpus as the probability estimation of \(P(c)\). In the project3.zip file, we included an example corpus of million words that could be used for this purpose, corpus.txt
. This is not a big corpus in today's standard, but we will consider it "good enough" for our project. (Of course, the larger and the higher quality our corpus is, our estimation of \(P(c)\) will get better, improving the overall accuracy of the spell corrector.)
Now your job is change the provided SpellChecker
class, so that it can use the relative frequency of every word \(c\) in the corpus as \(P(c)\). To save you from writing a code for tokenizing the corpus file and counting the frequency of each word, we have already collected the frequency statistics of every word in corpus.txt
and made it available in the dict2.txt
file of project3.zip:
a 21124 aah 1 aaron 5 ab 2 ... zweck 1 zygoma 1 zygomatic 1
Each line of dict2.txt
corresponds to a unique word in the corpus with its frequency count. For example, the first line indicates that the word "a" appears 21,124 times in corpus.txt
. (Note that dict2.txt
contains exactly the same list of words as dict1.txt
. In fact, the word list in dict1.txt
was obtained simply by removing the frequency counts from dict2.txt
.)
In order to use this new dictionary file for our SpellChecker
, you just need to provide the name of the new dictionary file as follows:
cs246@cs246:~$ java SpellChecker dict2.txt testset.txt Finished loading dictionary. Total size: 29066 gallary -> allay: 0 ... rember -> remember: 1 FINAL RESULT: 161 of 265 correct. ( 60.75% accuracy 5.66% unknown )
Unfortunately, as you can see from the above result, the current implementation of SpellChecker
does not use the word frequency information in calculating the language model \(P(c)\) (it simply returns \(P(c)=1\) if \(c \in\) candidates \(C\))! Its result is exactly the same as when you provided dict1.txt
as the dictionary.
Now modify the SpellChecker
class, so that it uses the frequency information of \(c\) in calculating \(P(c)\). In the class, the function calculateLM()
is responsible for calculating the the language model probability \(P(c)\):
/* calculateLM: returns the language model P(word) Input: word Output: P(word) */ public double calculateLM(String word) { // Currently, we simply return P(word)=1 if the word appears in the dictionary // // TODO: modify this function so that it uses the frequency of the word in computing P(word). // You can access frequency of word from "dict" like dict.get(word) // return dict.containsKey(word) ? 1.0 : 0.0; }
so you will need to modify (at least) the above function.
Once you finish modifying the code, compile and run your code on the provided testcases. Even with modest changes, it is easy to obtain the overall accuracy of 66% (about 10% improvement from the earlier model just by using real-world statistics!), so if you get less than 66% accuracy, you may want to review your code for obvious mistakes and errors. Make sure that your code successfully completes all testcases without generating any runtime errors, like null-pointer or division-by-zero exceptions.
We were able to improve the accuracy of our spell corrector from 60% to 66% by using a better language model, but we want to do better. In Task 3, we will further improve its accuracy with a better candidate generation model by using an improved edit-distance function. For a spell-correction task, Damerau-Levenshtein distance is a popular edit distance function. Compared with the simple edit distance used in Tasks 1 and 2, Damerau-Levenshtein includes two additional edit operations: substitution (replacing a chracter with another character) and transposition of two adjacent characters, which correspond to common misspelling patterns by humans. For example, the two words "teh" and "the" are within edit distance 1 under the Damerau-Levenshtein distance (as opposed to 2 under our naive model) because it involves the transposition of the second and the third characters. Similarly, the two words "practice" and "practise" are within edit distance 1 under the the Damerau-Levenshtein distance because it involves the substitution of the 7th character "c" with "s".
Now modify the SpellChecker
class, so that it uses Damerau-Levenshtein distance as the edit distance between two strings. Since our original edit distance already includes insertion and deletion operatiors, you will need to implement two additional operatiors: (1) transposition of two adjacent characters and (2) substitution of a single character. Inside the SpellChecker
class, the function EditDist1()
is where edit distance operators are implemented:
/* EditDist1: computes the candidate set C within edit distance 1 of input word Input: word - the input word Output: all strings within edit distance 1 of "word" the value of the output map is the edit distnace of the string, which is 1.0 */ private Map < String, Double > EditDist1(String word) { Map < String, Double > cands = new HashMap < > (); for (int i = 0; i <= word.length(); ++i) { // split the word at position i String L = word.substring(0, i); String R = word.substring(i); String temp; // "deletion at [i]": drop the character at the current position i if (R.length() > 0) { temp = L + R.substring(1); cands.put(temp, 1.0); // 1.0 indicates the edit distance 1 } // "insertion at [i]" insert an alphabet (a-z) at the current position i for (int j = 0; j < alphabet.length(); ++j) { temp = L + alphabet.charAt(j) + R; cands.put(temp, 1.0); // 1.0 indicates the edit distance 1 } // TODO: implement transposition operator here // TODO: implement substitution operator here } return cands; }
Implement the two new operators in the above function, compile and run your modified code on the provided testcases and check the result. If implemented correctly, you are likely to see the overall accuracy of around 74%, which is another 10% improvement from Task 2! Again, make sure that your code successfully completes all testcases without generating any runtime errors, like null-pointer or division-by-zero exceptions.
Now that you have obtained an accuracy of 74% by using a better language model and by improving the edit-distance function, your required tasks for Project 3 are over. As long as you complete Tasks 1 through 3 successfully and obtain an accuracy of at least 74%, you will get the full credit for Project 3. But if you are still interested in further improving the accuracy, the sky is the limit!
There are many options that you can explore. For example, we used a very simple error model of \(P(w|c) = 1\) if \(c \in\) candidates \(C\), but you may be able to include the edit distance of \(w\) to \(c\) in caculating \(P(w|c)\). (The calculation of error model \(P(w|c)\) is done inside the function calculateEM()
in our code.) Even further, you may want to assign different "weights" or "distances" to each edit operation depending on how common those operations are. For example, substituting "e" with "i" is likely to be much more common than substituting "b" with "j". You may even be able to "learn" the error model \(P(w|c)\) by training the probabilities on a large "misspelling corpus" such as Birkbeck spelling error corpus from the Oxford Text Archive. Again all these potential improvements are optional and not required for this project.
If you decide to do this optional part, please submit a version that includes your work for Task 4 as a separate file named SpellChecker2.java
. That is, submit two java files, SpellChekcer.java
(your work through Task 3) and SpellChecker2.java
(including your optional improvement). You will also have to rename the Java class to SpellChecker2
for Task 4 submission as well. This will ensure that you won't get penalized even if your optional "improvement" actually degrades accuracy on our hidden test cases.
In project 3, you need to submit two files:
SpellChecker2
as well. All your codes should be written within the class of SpellChecker
. You can implement helper functions whenever necessary, but do not change the main() or runTestcases() functions! Any modification to these functions may break the API of the grading script.
Your project must be submitted electronically before the deadline through CCLE. Navigate to Projects on left of the page, and click on the Project 3 submission section. If you submit multiple times, we will grade only the latest submission.
project3.zip
, created using a zip compression utility (like using "zip -r project3.zip *
" command in the VM).
You should submit this single file project3.zip that has the following packaging structure.
project3.zip | +- SpellChecker.java | +- log.txt | +- SpellChecker2.java (optional) | +- README.txt (optional)
cs246@cs246:~$ ./p3_test.sh project3.zip(if your project3.zip file is not located in the current directory, you need to add the path to the zip file before project3.zip. You may need to use "chmod +x p3_test.sh" if there is a permission error.)
You MUST test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. When evertything runs properly, you will see an output similar to the following from the grading script:
Compiling SpellChecker.java... Testing SpellChecker... Finished loading dictionary. Total size: 4 pararrel -> parallel: 1 tehi -> they: 1 FINAL RESULT: 2 of 2 correct. ( 100.00% accuracy 0.00% unknown ) SUCCESS!