Linear Predictive Coding (LPC)

Linear Predictive Coding (LPC)

Otolith provides On Line Tutorials in fields of current interest in audio, signal processing, speech, hearing and music.

Introduction

Linear Predictive Coding (LPC) is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. This document describes the basic ideas behind linear prediction, and discusses some of the issues involved in its use.

Basic Principles

LPC starts with the assumption that the speech signal is produced by a buzzer at the end of a tube. The glottis (the space between the vocal cords) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which are called formants. For more information about speech production, see the Speech Production OLT.

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue.

The numbers which describe the formants and the residue can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech.

Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames. Usually 30 to 50 frames per second give intelligible speech with good compression.

Estimating the Formants

The basic problem of the LPC system is to determine the formants from the speech signal. The basic solution is a difference equation, which expresses each sample of the signal as a linear combination of previous samples. Such an equation is called a linear predictor, which is why this is called Linear Predictive Coding.

The coefficients of the difference equation (the prediction coefficients) characterize the formants, so the LPC system needs to estimate these coefficients. The estimate is done by minimizing the mean-square error between the predicted signal and the actual signal.

This is a straightforward problem, in principle. In practice, it involves (1) the computation of a matrix of coefficient values, and (2) the solution of a set of linear equations. Several methods (autocorrelation, covariance, recursive lattice formulation) may be used to assure convergence to a unique solution with efficient computation.

Problem: the tube isn't just a tube

It may seem surprising that the signal can be characterized by such a simple linear predictor. It turns out that, in order for this to work, the tube must not have any side branches. (In mathematical terms, side branches introduce zeros, which require much more complex equations.)

For ordinary vowels, the vocal tract is well represented by a single tube. However, for nasal sounds, the nose cavity forms a side branch. Theoretically, therefore, nasal sounds require a different and more complicated algorithm. In practice, this difference is partly ignored and partly dealt with during the encoding of the residue (see below).

Encoding the Source

If the predictor coefficients are accurate, and everything else works right, the speech signal can be inverse filtered by the predictor, and the result will be the pure source (buzz). For such a signal, it's fairly easy to extract the frequency and amplitude and encode them.

However, some consonants are produced with turbulent airflow, resulting in a hissy sound (fricatives and stop consonants). Fortunately, the predictor equation doesn't care if the sound source is periodic (buzz) or chaotic (hiss).

This means that for each frame, the LPC encoder must decide if the sound source is buzz or hiss; if buzz, estimate the frequency; in either case, estimate the intensity; and encode the information so that the decoder can undo all these steps. This is how LPC-10e, the algorithm described in federal standard 1015, works: it uses one number to represent the frequency of the buzz, and the number 0 is understood to represent hiss. LPC-10e provides intelligible speech transmission at 2400 bits per second.

Here is a sample of LPC-10e encoded speech. Sound files are in Sun/NeXT 8 bit u-law format, and should be playable on all browsers.
Original
LPC-10e encoded

Problem: the buzz isn't just buzz

Unfortunately, things are not so simple. One reason is that there are speech sounds which are made with a combination of buzz and hiss sources (for example, the initial consonants in "this zoo" and the middle consonant in "azure"). Speech sounds like this will not be reproduced accuratly by a simple LPC encoder.

Another problem is that, inevitably, any inaccuracy in the estimation of the formants means that more speech information gets left in the residue. The aspects of nasal sounds that don't match the LPC model (as discussed above), for example, will end up in the residue. There are other aspects of the speech sound that don't match the LPC model; side branches introduced by the tongue positions of some consonants, and tracheal (lung) resonances are some examples.

Therefore, the residue contains important information about how the speech should sound, and LPC synthesis without this information will result in poor quality speech. For the best quality results, we could just send the residue signal, and the LPC synthesis would sound great. Unfortunately, the whole idea of this technique is to compress the speech signal, and the residue signal takes just as many bits as the original speech signal, so this would not provide any compression.

Encoding the Residue

Various attempts have been made to encode the residue signal in an efficient way, providing better quality speech than LPC-10e without increasing the bit rate too much. The most successful methods use a codebook, a table of typical residue signals, which is set up by the system designers. In operation, the analyzer compares the residue to all the entries in the codebook, chooses the entry which is the closest match, and just sends the code for that entry. The synthesizer receives this code, retrieves the corresponding residue from the codebook, and uses that to excite the formant filter. Schemes of this kind are called Code Excited Linear Prediction (CELP).

For CELP to work well, the codebook must be big enough to include all the various kinds of residues. But if the codebook is too big, it will be time consuming to search through, and it will require large codes to specify the desired residue. The biggest problem is that such a system would require a different code for every frequency of the source (pitch of the voice), which would make the codebook extremely large.

This problem can be solved by using two small codebooks instead of one very large one. One codebook is fixed by the designers, and contains just enough codes to represent one pitch period of residue. The other codebook is adaptive; it starts out empty, and is filled in during operation, with copies of the previous residue delayed by various amounts. Thus, the adaptive codebook acts like a variable shift register, and the amount of delay provides the pitch.

This is the CELP algorithm described in federal standard 1016. It provides good quality, natural sounding speech at 4800 bits per second.

Here is a sample of CELP encoded speech. Sound files are in Sun/NeXT 8 bit u-law format, and should be playable on all browsers.
Original
CELP encoded

Summary

Linear Predictive Coding is a powerful speech analysis technique for representing speech for low bit rate transmission or storage. We hope this tutorial has been informative and helpful. For more information, click on one of the pointers below, or see the texts listed in the References section.

References

federal standard 1015 was published in 1984.
federal standard 1016 was published in 1991.

Source code for both LPC10e and CELP is available at ftp://ftp.super.org/pub/speech/.

See the comp.speech FAQ for more pointers to articles on the algorithms and federal standards.

Digital Processing of Speech Signals.
L. R. Rabiner and R. W. Schafer.
Prentice-Hall (Signal Processing Series), 1978.

"The Government Standard Linear Predictive Coding Algorithm: LPC-10"
Thomas E. Tremain.
Speech Technology Magazine, April 1982, p. 40-49.

"The Proposed Federal Standard 1016 4800 bps Voice Coder: CELP"
Joseph P. Campbell Jr., Thomas E. Tremain and Vanoy C. Welch.
Speech Technology Magazine, April/May 1990, p. 58-64.

This tutorial is provided by Otolith. If you have any comments, suggestions, or questions, please contact us at the address below.


This page maintained by Wil Howitt
Last updated 17 October 95