Models of Temporal Evolution

If we're lucky, we have static features or we can at least describe temporal behavior through velocities or spectra. To recognize such things as gestures or spoken words, we have to model more or less complex temporal processes.

A good first step is to go from continuously varying parameters to a string of symbols from a finite alphabet. This amounts to quantize both time and value. Then we can apply typical symbolic computation methods. When forming symbols, it is imparative to retain the relelvant part and helpful to forget irrelevant aspects.

Mouse Gesture Recognition and State Machines

See source code!

Determine the 2D mouse velocity
Convert to a sequence of the symbols 0, N, E, S, W, where 0 stands for a velocity (vx, vy) that is below 200 pixels per second, and N, E, S, W stand for a motion in the cardinal directions (+/-45°). This means that the (vx, vy)-plane is subdivided into five regions, each represented by a symbol. Note that an overall larger or smaller motion will produce the same sequence of symbols, as long as it is quicker than 200 pixels per second. The layout of the regions ensures that a roughly horizontal or vertical motion will lead to a constant sequence of symbols.
Use regular expressions to find the right gesture

Every gesture pattern can be described through a set of symbol sequences. For instance "up and down" can be identified with the set {SSSSNNNN, S0SSS00NNN0NNN0N, 00SSS00S0SSSS00NNNN00NN, ...}. In the terms of theoretical computer science, this set is a formal language: a set of sequences from a fixed alphabet. To test whether a given input conforms to a gesture pattern, on has to check, whether this input is element of the set. This is easiest if there is a template that can be used for matching. Regular expressions are a standard way to accomplish this. An equivalent representation from theoretical computer science are finite state machines (FSM). One can identify the symbols with states; possible transitions between the states determine which sequences are allowed.

To use five regions in the (vx, vy)-plane is only one possibility. One could also use the xy positions of the mouse directly, starting with (x, y) = (0, 0) at the location of the first mouse click. One can even combine symbols that describe the xy position with symbols that describe the velocity. Another point to try is to normalize the overall amplitude of the signal (boost small signals, attenuate large ones). In audio processing this corresponds to dynamics compression.

Hidden Markov Models

Finite state machines can only answer "yes" or "no". They cannot capture probabilities and thus they can deal with noise and errors only in a limited fashion. A Markov Chain can help: Here, probabilites are associated with all transitions. The term "Markov" refers to the probabilities only depending on the current state, not on history. Even more sophisticted are Hidden Markov Models (HMMs), where the underlying Markov Chain is not directly observable; rather, each hidden state leads to some visible output that may vary stochastically. HMMs are standard in speech processing.

In pattern recognition, the template for every pattern is given by a HMM. Given an input symbol sequence and the collection of HMMs, one has to determine the HMM which fits best, that is: which has the highest probability to generating the given input. This is far more complex than evaluating to which of a given number of FSMs an input sequence conforms. On top of that, it is hardly possible to construct the HMMs analytically. Rather, they have to be learned from training data. This requires yet another advanced algorithm. However, both the evaluation algorithm (Viterbi) and the training algorithm (Baum-Welch) can be implemented rather effortlessly in cookbook style.

Speech Processing

In speech processing, the objective is (in broad strokes) to determine the most plausible word for an input waveform, that is to maximize the conditional probabilty P(word|wave) = P(word & wave) / P(wave). Since the wave is given, this amounts to maximizing P(word & wave) = P(wave|word) * P(word), which is an application of Bayes' principle. The latter expression is easier to compute than the initial one. The first factor P(wave|word) stems from an acoustic model, for instance based on a HMM. The second factor P(word) stems from a language model.

Actually, one does not directly use waves in speech processing, but employs the short-time spectra of chunks of about 20 ms length. The vital part is played by the resonances ("formants") formed by the throat, mouth and nose. In particular, the vowels are easily classifiable through their distribution of formants.