Temporal Features

Often we're interested in the temporal behavior of a signal, for instance to distinguish music from speech in an audio signal or to distinguish running from walking motion using an accelerometer. Temporal features allow to formulate the recognition of such patterns in the standard framework of clusters of points in a handful of dimensions. A large number of features has been introduced for audio signals. Most of the ideas underlying them can easily be carried over to other domains.

One large set of features is part of the MPEG-7 standard. MPEG-7 defines a way to store nearly every imaginable metadata about a given media file. Its main purpose is to facilitate the retrieval of media in large collections. To support content-based search (Find a video with lots of blue sky! Find classical music!), MPEG-7 also incorporates features computed from the media data. Among the low-level audio features stored in MPEG-7 are values such as the centroid of the spectrum.

A highly sophisticated set of features is employed for tasks such as speech recognition and music genre detection: Mel-Frequency Cepstral Coefficients (MFCCs). The basic idea is to describe a sound through a simplified sketch of its spectrum. One forms the spectrum and organizes it according to the human perception (frequency in Mel, level in decibels. [A note aside: To let the computer do a job that is seemingly easily done by humans, it is a good idea to simulate human perception.] To describe the form of this spectrum in broad strokes, one interprets the spectrum as a waveform and composes it of sinusoidal waves. The MFCCs give this description. The term "cepstrum" is a transmogrification of the term "spectrum", referring to the spectrum of the spectrum. Similarly, the frequencies of the waves which are used to form the spectrum are called "quefrencies". Typically, MFCCs are computed for "frames" of for instance 20 milliseconds (0,02 s) length. Thus, a sound recording will give rise to thousands of MFCCs. Their statistical distribution describes the "sound". In basic applications, the temporal sequence of the MFCCs is ignored.

Easily the most basic temporal feature is the level. To compute the level in decibels, one squares the signal divided by the maximum possible value at each instant of time and averages this squared value over time, which yields a measure of energy; then form 0.1 times the decadic logarithm. Before feeding the signal into this computation, one should subtract a fixed offset (DC offset), which most cheap soundcards tend to produce.

One only slightly more complex feature is the zero crossing rate (ZCR): the number of times a signal changes its sign from + to - or back within a certain timeframe. The ZCR is a simple and not very robust measure of frequency, but can, however, help to distinguish tuned sounds from noise. The ZCR value is easily ruined by low-frequency noise. Thus, it may be a good idea to first suppress low frequencies. To this end, apply a sliding average to form a signal that only contains the low frequencies. Subtract this signal from the original.

To form features related to the spectrum, one can make use of libraries that efficiently compute the Fourier transform. The most popular of them is FFTW. This library exists for a number of processors. Before actually doing a computation with the FFTW, one defines a "plan", which the library compiles into optimized code.

Given for instance 1024 samples sampled at a frequency of, say, 44,100 Hz, the Fourier transform returns 513 frequency components from 0 Hz (= constant offset = DC) to 22,050 Hz in equal-sized steps. Each of these components will be described by a complex number a + ib. This number describes both the power (=squared amplitude) of the corresponding sinusoidal wave and its phase. Most of the time, one is interested in the power alone, which is the square a² + b² of the length of the complex number a + ib.

When computing the spectrum of a chunk of 1024 samples, the Fourier transform behaves as though this chunk would be repeated indefinitely. This leads to jumps in the signal between on repetition and the next, which show as strange frequencies in the spectrum. To suppress this effect (it cannot be avoided completely), one fades the signal in over the first 512 samples and fades it out over the remaining 512 samples. This is accomplished through multiplication with a "window function", such as Hamming's. The data at the beginning and at the end of the chunk get merely lost in this process. To compensate for that, it is customary to take the next chunk of 1024 samples beginning with the current sample 512 ("50 % overlap").