I spent a good part of today working through Bell's paper introducing Simple Good-Turing Smoothing, Good-Turing Smoothing Without Tears. Unfortunately, that's still 20 pages of academic writing to wade through, so this post is the more simplified "Good-Turing Smoothing Without Tears" Without Tears.
What is smoothing?
In statistics, smoothing a data set refers to using an approximating function rather than raw data in an attempt to "smooth out" the noise associated with using observations directly. To explain this in more detail, I'll explain the concept using our current context of the question: "How likely is a specific word sequence?"
One crucial component of many natural language processing (NLP) applications is developing a language model. In a Bayesian sense, a language model can be thought of as a set of priors for how likely different words, or word sequences (n-grams) are. For example, in speech recognition, these priors allow the algorithm to prefer "Dear Nancy" over "Deer Nancy" although "dear" and "deer" are homophones, and indistinguishable based on sound alone.
Dealing with sparse data
The naive approach to this problem is to take a large corpus of text, and simply look at how often different patterns occur. For example, "the" might occur in 10k out of 10m words, so we might infer that the prior probability of the word 'the' is .1%. Similarly, "the tiger" might occur in 20 out of 10m bigrams, "the tiger ate" might occur in 1 out of 10m trigrams and so on. Note that with a larger and larger space (trigrams as opposed to unigrams), the data is more and more sparse. Thus, if we were to imagine collecting a different corpus of 10m words, the trigram "the tiger ate" could easily occur 0 or 2, rather than 1 time, which would lead us to predict that the probability of this trigram is 00, 10−710−7, or 2∗10−72∗10−7 - very different estimates! We see that as the data becomes sparser, our probability estimates become less reliable.
One particularly egregious flaw is that many sequences like "the tiger waits" or "the tiger runs" might not occur in our corpus. Despite constituting perfectly valid English, these trigrams would be assigned prior probability 0, clearly disastrous for a Bayesian model.
We want to assign some probability mass to the case where a new event occurs, when we encounter a new word sequence. This is the intuition behind smoothing.
A first attempt at smoothing
A first approach might be Add-One Smoothing. Drawing intuition from Laplace's law of succession, one might estimate the probability of an event not simply as [observed count]/[total observations], but could add 1 to each observed count, and add 1 to the total observations for each separate class. So if C is the total number of classes, the probability of a particular class is [observed count+1]/[total observations+C].
(The Expected Likelihood Estimator (ELE), referenced in Bell's paper, which adds 1/2 rather than 1 is very similar, the difference arises because Laplace's law of succession assumes a uniform prior while the ELE assumes the uninformative prior.)
While the Add-One approach may work well when dealing with a few classes, this approach runs into trouble when we have many possible. Imagine in our corpus of 10m words, if there are 20k distinct words, there are 200003=8∗1012200003=8∗1012 possible trigrams, of which <.01% occur in our original corpus. If we were to use Add-One smoothing here, we would assign over 99.99% of the probability mass to trigrams we have never seen!
Clearly, this approach will not work here. There are many other approaches to smoothing, including the Add-Lambda approach (based on cross-validation), Witten-Bell, and backoff models. Here we will focus on an approach called Good-Turing, which builds on the intuition that we can estimate the probability of novel events based on the number of classes which are only observed once (singletons).
Turing Estimators
Let's approach this problem from a different angle. If we want to estimate the probability of a class occuring rr times, let's imagine: if we were to make a new observation now, what would be the probability it belongs to a class we have already observed rr times. We approximate this answer with another question: if I had seen N−1N−1 of the data points already, and had to predict the number of occurrences of the last class, how likely would it be for it to have already occurred exactly rr times? This is the justification through leave-out-one cross-validation, and I think is based on a similar intuition to bootstrapping. We can answer this approximation easily: out of our NN simulations, if we observe that our held-out data point belongs to a class which occurred rr times in the training set of the remaining N−1N−1 points, then this class must occur r+1r+1 times in total. If we let NrNr denote the "frequency of frequencies" - the number of classes which occur rr times - then in the future, the probability of observing any class which has occurred rr times is (r+1)Nr+1N(r+1)Nr+1N. Of particular interest, this yields that we allot probability of N1/NN1/N to novel events, rather than 0 as before. The intuition is the same - that singletons in the full-set are simply the novel events in hold-out-one cross-validation - but it also coheres with the intuition that we expect to see few singletons if the number of possible class is small (relative to the number of data points) since then we would keep coming back to the same classes. To assign probabilities to specific classes, since there are NrNr classes that occur rr times, the probability of observing a class that occurred rr times is (r+1)Nr+1NrN(r+1)Nr+1NrN. Good-Turing often deals with "adjusted counts" rather than probabilities, so for a class that we observed rr times, we would use a count of (r+1)Nr+1Nr(r+1)Nr+1Nr.
This formula for expected counts is the Turing estimator, and it provides a (loosely) motivated method for estimating class probabilities which assigns non-zero probability to novel events.
Smoothing the Turing Estimator
In answering the question "how likely is a class which has been observed rr times to occur again?" in the previous section, we implicitly made an assumption that our observed counts are good estimators. Here, sparsity strikes again! While our observed counts for N1N1 and N2N2 are likely fairly accurate, our observed counts for N20N20 is a very noisy estimator.
An example may be helpful here. Bell's paper cites the research of Sproat and Shih, who were studying Chinese words pluralized by the addition of the character "men2" (hence, each word which can be pluralized is it's own class). Here is the chart of frequency of frequencies:
With 268 observations for N1N1, this count is unlikely to be off by an order of magnitude, but for N401N401, the estimate of 0 is almost certainly wrong! If we were to use our formula from the Turing estimator for the probability of the class "ren" meaning people, which occurs 1918 times in our training set, our estimate would be 0, since N1919=0N1919=0. This seems clearly problematic.
First, let's visualize what's going on here - our observed counts are on the left, on a log-log scale to deal with the power-law shape of the distribution (common in NLP applications):
Note the lower limit in the left plot, due to the fact that NrNr cannot go below 1. Ideally, we would like to correct this, and "connect the dots," using the best fit line to interpolate to intermediate values (especially where observed frequencies are 0).
Good (1951) introduced the notion of smoothing to fix this problem, which arises from the fact that for large rr, the observed counts are only a noisy estimator. First, to fix the issue of NrNr for large rr, we want to "average" the large nonzero NrNr with the surrounding NrNr which are 0. Formally, we order the nonzero NrNr by rr and let q,r,tq,r,t denote consecutive indices. Then, we replace every NrNr with Zr=Nr.5(t−q)Zr=Nr.5(t−q) (note that for small rr where there are no nonzero bins, the denominator is 1). This gets us plot on the right, which has the desired shape.
Now, we can simply take the smooth (we clearly don't want to use a value of 0 for intermediate values). The simplest fit is a line - this is called the Linear Good-Turing Estimate, which we get by fitting the line
log(Zr)=a+blog(r)log(Zr)=a+blog(r)
(Note that we require the slope b<−1b<−1 else this summation will not converge.) Using the predicted value a+blog(r)a+blog(r) rather than the observed NrNr is the smoothed estimate.
Simple Good-Turing
We can finally put it all together! We initially introduced smoothing to deal with larger rr, where data is more sparse. Thus, for small rr, it makes more sense to directly use our Turing estimates, since these counts are relatively reliable.
When do we use Turing estimates vs. Linear Good-Turing (LGT) estimates? The proposal made in Simple Good-Turing is to use the Turing estimates so long as they are significantly different from the LGT estimates, and then once we switch, to continue to use the LGT estimates.
As the threshold for significantly different, Bell uses 1.65 times the standard deviation of the Turing estimate, where Variance is estimated as
Var(r∗T)=(r+1)2Nr+1N2r(1+Nr+1Nr)Var(rT∗)=(r+1)2Nr+1Nr2(1+Nr+1Nr)
This estimate is based on the assumptions that NrNr and Nr+1Nr+1 are independent, and that Var(Nr)=NrVar(Nr)=Nr, and then using the first-order Taylor approximation (as in this example).
As a final note, we need to renormalize these "probabilities," since they come from two different sources (Turing and LGT estimates) and as such, there's no reason they would necessarily sum to 1.
That's everything - we've introduced smoothing, why it's important and an intuitive approach to the problem, Turing estimators as a method for moving probability mass from observed classes to unobserved classes, Good-Turing estimators for smoothing these estimates, and Simple Good-Turing as a method for combining the two estimates.
Other resources: Simple Good-Turing is introduced in Bell's original paper. I found Dan Jurafsky's lecture as part of his Stanford NLP Coursera class and Jason Eisner's slides from his course at John Hopkins to be useful sources of intuition as well. If you're interested in applications, I was originally prompted to learn about Good-Turing smoothing while working on punctuation prediction, but applications throughout NLP are common.
This post ended up being a lot longer than I planned and wanted. Any feedback or questions on what is unclear is welcome!