node99.org | cv

error correction with hmms.

Hidden Markov Models (HMMs) are probabilistic functions of Markov Chains, where the state of the system is observed indirectly by a random variable. This observed variable follows different distributions according to the hidden state of the system, meaning some amount of the hidden state is revealed via observation.

There are 3 main questions asked of HMMs: (1) what is the probability that a particular model produced a particular observation, (2) what are the model parameters that maximize the probability of producing a particular observation, and (3) what is sequence of hidden state that is most likely to produce a particular observation.

There are countless mathematical references on the subject, one of the the most accessible being Rabiner's tutorial providing general background and applications to speech recognition. However, modern code-centric examples are rare, as lamented here. This page attempts to bridge this gap by providing a simple, comprehensive example of using HMMs to correct corrupted text in a typed document.

We'll be using the Python language, and a matrix-based numerically stable HMM implementation using Numpy. If you don't know Python, you should hopefully be able to follow along as the syntax is simple and I'll be discussing the code. The HMM functions are similar to the API of the Matlab Statistics Toolbox.

Installation

You'll need the free git version control system to fetch a copy of the code:

git clone git://github.com/shwhalen/hmm.git
Here's a breakdown of the repository contents:

To use the code, you can either add the directory to your PYTHONPATH or run your scripts from the repository directory as there is no installer.

Usage

The three HMM problems described earlier correspond to particular algorithms: (1) the forward algorithm, (2) the Baum-Welch algorithm, and (3) the Viterbi algorithm. You may also want to (4) generate new observations from an existing HMM. Following the Matlab Statistics Toolbox API, the four corresponding functions are (1) hmmforward, (2) hmmtrain, (3) hmmviterbi, and (4) hmmgenerate. An important point of departure with the Matlab API is that these functions require the initial state distribution \pi as a parameter, instead of assuming the model starts in state 1.

Let's take a look at one of the plotting examples from hmmplot.py:

def ploteven(length=200, epochs=40):
    tm = [[0, 0, 1], [0.5, 0.5, 0], [0.5, 0.5, 0]]
    em = [[0, 1], [1, 0], [0, 1]]
    pi = [1/3., 1/3., 1/3.]
    obs, states = hmmgenerate(length, tm, em, pi)
    
    tm2, em2, pi2, likelihoods = hmmtrain(obs, tm, em, pi, epochs)
    plot(likelihoods, label='Log Likelihood')
    
    tm2, em2, pi2, likelihoods = hmmtrain(obs, stochasticmatrix(3, 3), em, pi, epochs)
    plot(likelihoods, label='Log Likelihood (Random)')
    
    xlabel('Epoch')
    legend(loc='lower right')
    show()

We first create a transmission matrix tm, emission matrix em, and initial state distribution pi for the Even Process, a stochastic process generating binary strings where blocks of consecutive 1s are even in length. We then generate an observation of 200 symbols from the Even Process.

Next we train an HMM on our observation by passing the "correct" transmission matrix, emission matrix, and a uniform initial state distribution to hmmtrain. The Baum-Welch algorithm will update our "correct" parameters to maximize the probability of generating our string of finite data, which of course has non-asymptotic statistics. However, since our "correct" initial parameters are likely very close to maximizing the probability of the data already, Baum-Welch should be able to maximize the probability very quickly.

Usually when we want to train an HMM, we have an observation but don't know the initial parameters. To demonstrate this scenario, we'll use the same parameters but use a randomized transition matrix produced by stochasticmatrix(3, 3). We should expect the initial probability of the data given the model to be much lower. It may take several iterations (epochs) of Baum-Welch to find a local maximum that produces the data with a probability close to our previous scenario. This can be seen in the graph below:

Log likelihoods for training on the Even Process

Something to keep in mind is that the code accepts either lists or numpy arrays as input, storing either log or non-log probabilities, and converts them to numpy arrays storing log probabilities. This means, for example, if we print out the tm2 matrix returned by either call to hmmtrain, we may see something like this:

[[            -Inf             -Inf  -6.21724894e-15]
 [ -1.16992500e+00  -8.47996907e-01             -Inf]
 [ -1.02308361e+00  -9.77279923e-01             -Inf]]

We can view the non-log probabilities by calling exp2(tm2):

[[ 0.          0.          1.        ]
 [ 0.44444444  0.55555556  0.        ]
 [ 0.49206349  0.50793651  0.        ]]

Error Correction

Say we have some typed document, where the typist has a chance at every keystroke to hit the wrong key. This gives us a sequence of intended (hidden) letters, and a sequence of actual (observed) letters.

Using the Viterbi algorithm with an HMM trained on some set of error-free documents, we can correct errors in the corrupted document. Since the Viterbi algorithm computes the most likely hidden sequence of states (here our intended sequence of letters) for a particular observation, it can correct some of the errors incurred by the typist or by transmission.

We take any corpus and split it into two parts, using some majority of the letters for training and the rest for testing our accuracy. Bundled with the code is Holmes for the Holidays from Project Gutenberg. We name this file holmes.txt and produce a new corpus with 20% typing errors by running corruption.py. For simplicity, we convert all text to lowercase and strip punctuation other than letters and spaces. Here's an excerpt of the output:

s s
h h
e d
r e
l l
o o
c c
k k
_ _
h h
o p
l l
m m
e e
s s

We can build the trained transition matrix by counting the number of times a hidden state (column 0) transitions to each other hidden state, and dividing by the total number of transitions from that state. Similarly, we can build the trained emission matrix by counting the number of times each letter (column 1) is emitted by a state, and dividing by the total number of emissions for that state. This is performed below by the hmmfromdata function:

stateCount = 27
symbolCount = 27
tm, em, hidden, observed = hmmfromdata('holmes20.txt')
pi = stochasticmatrix(1, stateCount)
guess, p = hmmviterbi(observed, tm, em, pi)

We define our number of states as 27 because each letter corresponds to a state, as well as one state for the space character. Similarly, each of the 27 states can emit from an alphabet of 27 symbols. We obtain the trained transition and emission matrices from the hmmfromdata function, as well as the hidden and observed states for our training data. Finally, we feed the observed training data, our trained matrices, and a random initial state distribution into the Viterbi algorithm to compute the most likely intended sequence of letters. We can then measure the improvement by comparing states:

viterbiAccuracy = (guess == hidden).sum() / float(len(hidden))
sourceAccuracy = (observed == hidden).sum() / float(len(hidden))

print '%.2f corrected  ' % viterbiAccuracy
print '%.2f uncorrected' % sourceAccuracy
print '%.2f improvement' % (viterbiAccuracy - sourceAccuracy)

We compute the percentage of states that agree between the Viterbi path and the hidden training states, and between the observed training states and hidden training states. The difference between the accuracy of the corrected text and the uncorrected text tells us how well the algorithm performed. For a corpus with 20% corruption, the output might look something like this:

0.90 corrected  
0.84 uncorrected
0.06 improvement
I found this an intuitive problem for exploring HMM concepts, but a better approach to spell checking (also with Python code) can be found here.