CS246 Project 3

Overview

Many users find that a spell checker is a tool that they cannot live without; they mostly know how to spell, but once in a while they experience a time when the correct spelling simply slips their mind. With the help of a spell checker, the search engine tries to guess the user's intention and recommends what is likely to be the correct spelling of whatever the user typed. In this project, we will learn the basic theory behind how spell checker works and explore different ways of implementing a basic spell checker.

Spell Correction: Learning Basics

Given a user-typed word \(w\), the spell checker's job is to find the correct spelling \(c\) of \(w\). Clearly, given a user input \(w\), there is no way to know its correct spelling \(c\) for sure (for example, should "kives" be "knives" or "hives" or ...?). Thus a good way to model the spelling check problem is to use a probabilistic approach; we are trying to find the correction \(c\), out of all possible candidate corrections \(C\), that maximizes the probability that \(c\) is the intended correction, given the original word \(w\):

argmaxc ∈ C P(c|w)
By Bayes' Theorem this is equivalent to:
argmaxc ∈ C P(c) P(w|c) / P(w)
Since P(w) is the same for every possible candidate c, we can factor it out, giving:
argmaxc ∈ C P(c) P(w|c)

The four parts of this expression are:
  1. Selection Mechanism: argmax
    We choose the candidate with the highest combined probability.
  2. Candidate Model: c ∈ candidates \(C\)
    This tells us which candidate corrections, c, to consider. When generating the candidate set \(C\), we can make use of the Edit Distance metric. A rule of thumb of spelling check is that 75% misspells are within edit distance 1 and 98% are within edit distance 2. Thus in this project, we will limit the candidate set \(C\) only to those words that are within edit distance 2 of the user provided input w.
  3. Language Model: P(c)
    The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.
  4. Error Model: P(w|c)
    The probability that w would be typed by a user when the user meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.
In the rest of this project, we will experiment multiple ways to compute each part of the above equation, starting with the simplest possible ways to more sophisticated ways using real data. Through these experiments, we will see that the overall accuracy of the spelling correction algorithm improves as we build a better "model" to estimate each part of the above equation.

Task 1: Dictionary-based spell corrector

In Task 1, we start our exploration of spell correction algorithms with the simplest model:

  1. Language model: we assume a "binary" model, where we set \(P(c)=1\) if \(c\) appears in our dictionary and \(P(c)=0\) if not.
  2. Edit distance: We define the "distance" between \(w\) and \(c\) as the minimum number of characters that should be inserted and/or deleted to transform \(w\) to \(c\). That is, our edit distance model only include two operations: insertion and deletion.
  3. Candidate model: we use the following algorithm to produce the candidate set \(C\)
    1. if \(w\) is a dictionary word, the candidate set \(C\) contains just \(w\) itself
    2. If not, \(C\) contains all dictionary words within edit distance 1 of \(w\)
    3. If there is no dictionary word within edit distance 1 of \(w\), \(C\) contains all dictionary words within edit distance 2 of \(w\).
  4. Error model: Here we use a binary model again. As long as \(c \in \) candidate set \(C\), we set the unit error probability \(P(w|c) = 1\), regardless of \(w\).

As you can see, this is a very naive model, which simply returns a dictionary word \(c\) within the shortest edit distance of \(w\). The "edit operations" are very limited as well; only insertions and deletions are considered, so transposition (swapping two adjacent characters) and substitutions (replacing one character with another) are counted as two edit operations as opposed to one. In later parts of this project, we will remove these limitations and see how these "improvements" will impact the performance of our spell correction algorithm.

In project3.zip, we included Java code, SpellChecker.java, that implements this basic model. It also includes the basic "dictionary" file that you can use for this task, dict1.txt, and a "spelling correction benchmark dataset" file, testset.txt. The dictionary file includes one "dictionary word" per each line, and the testset includes "correctly-spelled word:" followed by a list of misspellings. Open and look at the three files to understand what they have and how to interpret them.

You can run our provided SpellCheker code by executing the following command (after you successfully compile the Java code using the javac command of course):

cs246@cs246:~$  java SpellChecker dict1.txt testset.txt

The above command will load the dictionary entries in dict1.txt and run the simple correction model against the testset in testset.txt. It then prints out a sequence of "misspelled words" in the testset and the "correct spellings" proposed by our simple model, with the 0/1 indicator of whether the proposed correction is indeed a correct spelling:

Finished loading dictionary. Total size: 29066
gallary -> allay: 0
inetials -> initials: 1
diagrammaticaally -> diagrammaticaally: 0
...
neccesary -> necessary: 1
rember -> remember: 1
FINAL RESULT: 161 of 265 correct. ( 60.75% accuracy 5.66% unknown )
As the last output, the Java code prints the overall accuracy of the spelling correction algorithm, which is 60.75%. "5.66% unknown" means 5.66% of "correctly spelled words" in the testset do not exist in our dictionary, so there was no way that our algorithm could predict the correct spelling for them. The accuracy of 60% is not too bad for such a simple model, but of course we want to do better. Our goal for this project is to raise the accuracy to around 75% by improving this simple model step-by-step.

Task 2: Improving Language Model Using Real Corpus

The first improvement that we will consider is to improve the language model: \(P(c)\). In the simple model of Task 1, we assume that all \(c \in\) candidates \(C\) have the same probability \(P(c)=1\), but as you can imagine, this is too crude an approximation. We want to assign more realistic probability to each \(c\). But where can we obtain more realistic \(P(c)\)?

The most common way to obtain a better \(P(c)\) is to collect a large and relatively clean corpus (with infrequent spelling errors if any), and use the relative frequency of each word in the corpus as the probability estimation of \(P(c)\). In the project3.zip file, we included an example corpus of million words that could be used for this purpose, corpus.txt. This is not a big corpus in today's standard, but we will consider it "good enough" for our project. (Of course, the larger and the higher quality our corpus is, our estimation of \(P(c)\) will get better, improving the overall accuracy of the spell corrector.)

Now your job is change the provided SpellChecker class, so that it can use the relative frequency of every word \(c\) in the corpus as \(P(c)\). To save you from writing a code for tokenizing the corpus file and counting the frequency of each word, we have already collected the frequency statistics of every word in corpus.txt and made it available in the dict2.txt file of project3.zip:

a	21124
aah	1
aaron	5
ab	2
...
zweck	1
zygoma	1
zygomatic	1

Each line of dict2.txt corresponds to a unique word in the corpus with its frequency count. For example, the first line indicates that the word "a" appears 21,124 times in corpus.txt. (Note that dict2.txt contains exactly the same list of words as dict1.txt. In fact, the word list in dict1.txt was obtained simply by removing the frequency counts from dict2.txt.)

In order to use this new dictionary file for our SpellChecker, you just need to provide the name of the new dictionary file as follows:

cs246@cs246:~$  java SpellChecker dict2.txt testset.txt
Finished loading dictionary. Total size: 29066
gallary -> allay: 0
...
rember -> remember: 1
FINAL RESULT: 161 of 265 correct. ( 60.75% accuracy 5.66% unknown )

Unfortunately, as you can see from the above result, the current implementation of SpellChecker does not use the word frequency information in calculating the language model \(P(c)\) (it simply returns \(P(c)=1\) if \(c \in\) candidates \(C\))! Its result is exactly the same as when you provided dict1.txt as the dictionary.

Now modify the SpellChecker class, so that it uses the frequency information of \(c\) in calculating \(P(c)\). In the class, the function calculateLM() is responsible for calculating the the language model probability \(P(c)\):

/*
    calculateLM: returns the language model P(word)
    Input: word 
    Output: P(word)
 */
public double calculateLM(String word) {
    // Currently, we simply return P(word)=1 if the word appears in the dictionary
    //
    // TODO: modify this function so that it uses the frequency of the word in computing P(word).
    //       You can access frequency of word from "dict" like dict.get(word)
    //
    return dict.containsKey(word) ? 1.0 : 0.0;
}

so you will need to modify (at least) the above function.

Once you finish modifying the code, compile and run your code on the provided testcases. Even with modest changes, it is easy to obtain the overall accuracy of 66% (about 10% improvement from the earlier model just by using real-world statistics!), so if you get less than 66% accuracy, you may want to review your code for obvious mistakes and errors. Make sure that your code successfully completes all testcases without generating any runtime errors, like null-pointer or division-by-zero exceptions.

Task 3: Improving Candidate Generation Model

We were able to improve the accuracy of our spell corrector from 60% to 66% by using a better language model, but we want to do better. In Task 3, we will further improve its accuracy with a better candidate generation model by using an improved edit-distance function. For a spell-correction task, Damerau-Levenshtein distance is a popular edit distance function. Compared with the simple edit distance used in Tasks 1 and 2, Damerau-Levenshtein includes two additional edit operations: substitution (replacing a chracter with another character) and transposition of two adjacent characters, which correspond to common misspelling patterns by humans. For example, the two words "teh" and "the" are within edit distance 1 under the Damerau-Levenshtein distance (as opposed to 2 under our naive model) because it involves the transposition of the second and the third characters. Similarly, the two words "practice" and "practise" are within edit distance 1 under the the Damerau-Levenshtein distance because it involves the substitution of the 7th character "c" with "s".

Now modify the SpellChecker class, so that it uses Damerau-Levenshtein distance as the edit distance between two strings. Since our original edit distance already includes insertion and deletion operatiors, you will need to implement two additional operatiors: (1) transposition of two adjacent characters and (2) substitution of a single character. Inside the SpellChecker class, the function EditDist1() is where edit distance operators are implemented:

    /*
        EditDist1: computes the candidate set C within edit distance 1 of input word
        Input: word - the input word
        Output: all strings within edit distance 1 of "word"
                the value of the output map is the edit distnace of the string, which is 1.0
    */
    private Map < String, Double > EditDist1(String word) {
        Map < String, Double > cands = new HashMap < > ();
        for (int i = 0; i <= word.length(); ++i) {

            // split the word at position i
            String L = word.substring(0, i);
            String R = word.substring(i);
            String temp;

            // "deletion at [i]": drop the character at the current position i
            if (R.length() > 0) {
                temp = L + R.substring(1);
                cands.put(temp, 1.0); // 1.0 indicates the edit distance 1
            }
            // "insertion at [i]" insert an alphabet (a-z) at the current position i
            for (int j = 0; j < alphabet.length(); ++j) {
                temp = L + alphabet.charAt(j) + R;
                cands.put(temp, 1.0); // 1.0 indicates the edit distance 1
            }

            // TODO: implement transposition operator here

            // TODO: implement substitution operator here
        }
        return cands;
    }

Implement the two new operators in the above function, compile and run your modified code on the provided testcases and check the result. If implemented correctly, you are likely to see the overall accuracy of around 74%, which is another 10% improvement from Task 2! Again, make sure that your code successfully completes all testcases without generating any runtime errors, like null-pointer or division-by-zero exceptions.

Task 4: Further Improvement (Optional)

Now that you have obtained an accuracy of 74% by using a better language model and by improving the edit-distance function, your required tasks for Project 3 are over. As long as you complete Tasks 1 through 3 successfully and obtain an accuracy of at least 74%, you will get the full credit for Project 3. But if you are still interested in further improving the accuracy, the sky is the limit!

There are many options that you can explore. For example, we used a very simple error model of \(P(w|c) = 1\) if \(c \in\) candidates \(C\), but you may be able to include the edit distance of \(w\) to \(c\) in caculating \(P(w|c)\). (The calculation of error model \(P(w|c)\) is done inside the function calculateEM() in our code.) Even further, you may want to assign different "weights" or "distances" to each edit operation depending on how common those operations are. For example, substituting "e" with "i" is likely to be much more common than substituting "b" with "j". You may even be able to "learn" the error model \(P(w|c)\) by training the probabilities on a large "misspelling corpus" such as Birkbeck spelling error corpus from the Oxford Text Archive. Again all these potential improvements are optional and not required for this project.

If you decide to do this optional part, please submit a version that includes your work for Task 4 as a separate file named SpellChecker2.java. That is, submit two java files, SpellChekcer.java (your work through Task 3) and SpellChecker2.java (including your optional improvement). You will also have to rename the Java class to SpellChecker2 for Task 4 submission as well. This will ensure that you won't get penalized even if your optional "improvement" actually degrades accuracy on our hidden test cases.

Your Final Submission

In project 3, you need to submit two files:

  1. The SpellChekcker.java file that includes changes for Task 2 and Task 3 (but no Task 4).
  2. A log file named log.txt which records the result of Task 3, the output from your code.
  3. (Optional) SpellChecker2.java: if you made any further improvement for Task 4, include your code that includes all changes for Task 2 through Task 4 in this file. Make sure that your Java class is named as SpellChecker2 as well.
  4. (Optional) README.txt: if you made any further improvement for Task 4 you should explain how you made your improvements here.

All your codes should be written within the class of SpellChecker. You can implement helper functions whenever necessary, but do not change the main() or runTestcases() functions! Any modification to these functions may break the API of the grading script.

Your project must be submitted electronically before the deadline through CCLE. Navigate to Projects on left of the page, and click on the Project 3 submission section. If you submit multiple times, we will grade only the latest submission.

What to Submit

The zip file that you submit must be named project3.zip, created using a zip compression utility (like using "zip -r project3.zip *" command in the VM). You should submit this single file project3.zip that has the following packaging structure.
project3.zip
    |
    +- SpellChecker.java
    |
    +- log.txt
    |
    +- SpellChecker2.java (optional)
    |
    +- README.txt (optional)

Testing Your Submission

Grading is a difficult and time-consuming process, and file naming and packaging convention is very important to test your submission without any error. In order to help you ensure the correct packaging of your submission, we have made a "grading script" p3_test.sh available. In essence, the grading script unzips your submission to a temporary directory and executes your files to test whether they are likely to run OK on the grader's machine. Download the grading script and execute it in the VM like:
cs246@cs246:~$ ./p3_test.sh project3.zip
(if your project3.zip file is not located in the current directory, you need to add the path to the zip file before project3.zip. You may need to use "chmod +x p3_test.sh" if there is a permission error.)

You MUST test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. When evertything runs properly, you will see an output similar to the following from the grading script:

Compiling SpellChecker.java...
Testing SpellChecker...
Finished loading dictionary. Total size: 4
pararrel -> parallel: 1
tehi -> they: 1
FINAL RESULT: 2 of 2 correct. ( 100.00% accuracy 0.00% unknown )
SUCCESS!

Grading

Overall grading breakdown is as below. If you make further improvement on the basis of Task 3 and can achieve better than expected accuracy on the hidden test set, you will get up to 20% extra credits.