Sub-Band Coding

Sub-Band Coding

Otolith provides On Line Tutorials in fields of current interest in audio, signal processing, speech, hearing and music.

Introduction

Sub-Band Coding (SBC) is a powerful and general method of encoding audio signals efficiently. Unlike source specific methods (like LPC, which works only on speech), SBC can encode any audio signal from any source, making it ideal for music recordings, movie soundtracks, and the like. MPEG Audio is the most popular example of SBC. This document describes the basic ideas behind SBC and discusses some of the issues involved in its use.

Basic Principles

SBC depends on a phenomenon of the human hearing system called masking. Normal human ears are sensitive to a wide range of frequencies. However, when a lot of signal energy is present at one frequency, the ear cannot hear lower energy at nearby frequencies. We say that the louder frequency masks the softer frequencies. The louder frequency is called the masker.

(Strictly speaking, what we're describing here is really called simultaneous masking (masking across frequency). There are also nonsimultaneous masking (masking across time) phenomena, as well as many other phenomena of human hearing, which we're not concerned with here. For more information about auditory perception, see the upcoming Auditory Perception OLT.)

The basic idea of SBC is to save signal bandwidth by throwing away information about frequencies which are masked. The result won't be the same as the original signal, but if the computation is done right, human ears can't hear the difference.

Encoding audio signals

The simplest way to encode audio signals is Pulse Code Modulation (PCM), which is used on music CDs, DAT recordings, and so on. Like all digitization, PCM adds noise to the signal, which is generally undesirable. The fewer bits used in digitization, the more noise gets added. The way to keep this noise from being a problem is to use enough bits to ensure that the noise is always low enough to be masked either by the signal or by other sources of noise. This produces a high quality signal, but at a high bit rate (over 700k bps for one channel of CD audio). A lot of those bits are encoding masked portions of the signal, and are being wasted.

There are more clever ways of digitizing an audio signal, which can save some of that wasted bandwidth. A classic method is nonlinear PCM, such as mu-law encoding (named after a perceptual curve in auditory perception research). This is like PCM on a logarithmic scale, and the effect is to add noise that is proportional to the signal strength. Sun's .au format for sound files is a popular example of mu-law encoding. Using 8-bit mu-law encoding would cut our one channel of CD audio down to about 350k bps, which is better but still pretty high, and is often audibly poorer quality than the original (this scheme doesn't really model masking effects).

A basic SBC scheme


Most SBC encoders use a structure like this. First, a time-frequency mapping (a filter bank, or FFT, or something else) decomposes the input signal into subbands. The psychoacoustic model looks at these subbands as well as the original signal, and determines masking thresholds using psychoacoustic information. Using these masking thresholds, each of the subband samples is quantized and encoded so as to keep the quantization noise below the masking threshold. The final step is to assemble all these quantized samples into frames, so that the decoder can figure it out without getting lost.

Decoding is easier, since there is no need for a psychoacoustic model. The frames are unpacked, subband samples are decoded, and a frequency-time mapping turns them back into a single output audio signal.

This is a basic, generic sketch of how SBC works. Notice that we haven't looked at how much computation it takes to do this. For practical systems that need to run in real time, computation is a major issue, and is usually the main constraint on what can be done.

Over the last five to ten years, SBC systems have been developed by many of the key companies and laboratories in the audio industry. Beginning in the late 1980's, a standardization body of the ISO called the Motion Picture Experts Group (MPEG) developed generic standards for coding of both audio and video. Let's look at MPEG Audio as a specific example of a practical SBC system.

MPEG-1 Audio: a practical SBC system

MPEG-1 Audio is really a group of three different SBC schemes, called layers. Each layer is a self-contained SBC coder with its own time-frequency mapping, psychoacoustic model, and quantizer, as shown in the diagram above. Layer 1 is the simplest, but gives the poorest compression. Layer 3 is the most complicated and difficult to compute, but gives the best compression.

The idea is that an application of MPEG-1 Audio can use whichever layer gives the best tradeoff between computational burden and compression performance. Audio can be encoded in any one layer. A standard MPEG decoder for any layer is also able to decode lower (simpler) layers of encoded audio.

MPEG-1 Audio is intended to take a PCM audio signal sampled at a rate of 32, 44.1 or 48 kHz, and encode it at a bit rate of 32 to 192 kbps per audio channel (depending on layer).

MPEG-1 Audio Layer 1

The Layer 1 time-frequency mapping is a polyphase filter bank with 32 subbands. Polyphase filters combine low computational complexity with flexible design and implementation options. However, the subbands are equally spaced in frequency (unlike critical bands).

The Layer 1 psychoacoustic model uses a 512-point FFT to get detailed spectral information about the signal. The output of the FFT is used to find both tonal (sinusoidal) and nontonal (noise) maskers in the signal. Each masker produces a masking threshold depending on its frequency, intensity, and tonality. For each subband, the individual masking thresholds are combined to form a global masking threshold. The masking threshold is compared to the maximum signal level for the subband, producing a signal-to-masker ratio (SMR) which is the input to the quantizer.

The Layer 1 quantizer/encoder first examines each subband's samples, finds the maximum absolute value of these samples, and quantizes it to 6 bits. This is called the scale factor for the subband. Then it determines the bit allocation for each subband by minimizing the total noise-to-mask ratio with respect to the bits allocated to each subband. (It's possible for heavily masked subbands to end up with zero bits, so that no samples are encoded.) Finally, the subband samples are linearly quantized to the bit allocation for that subband.

The Layer 1 frame packer has a fairly easy job. Each frame starts with a header information for synchronization and bookkeeping, and a 16-bit cyclic redundancy check (CRC) for error detection and correction. Each of the 32 subbands gets 4 bits to describe bit allocation and 6 bits for the scale factor. The remaining bits in the frame are used for subband samples, with an optional trailer for extra information.

Layer 1 processes the input signal in frames of 384 PCM samples. At 48 kHz, each frame carries 8 ms of sound. The MPEG specification doesn't specify the encoded bit rate, allowing implementation flexibility. Highest quality is achieved with a bit rate of 384k bps. Typical applications of Layer 1 include digital recording on tapes, hard disks, or magneto-optical disks, which can tolerate the high bit rate.

MPEG-1 Audio Layer 2

The Layer 2 time-frequency mapping is the same as in Layer 1: a polyphase filter bank with 32 subbands.

The Layer 2 psychoacoustic model is similar to the Layer 1 model, but it uses a 1024-point FFT for greater frequency resolution. It uses the same procedure as the Layer 1 model to produce signal-to-masker ratios for each of the 32 subbands.

The Layer 2 quantizer/encoder is similar to that used in Layer 1, generating 6-bit scale factors for each subband. However, Layer 2 frames are three times as long as Layer 1 frames, so Layer 2 allows each subband a sequence of three successive scale factors, and the encoder uses one, two, or all three, depending on how much they differ from each other. This gives, on average, a factor of 2 reduction in bit rate for the scale factors compared to Layer 1. Bit allocations are computed in a similar way to Layer 1.

The Layer 2 frame packer uses the same header and CRC structure as Layer 1. The number of bits used to describe bit allocations varies with subband: 4 bits for the low subbands, 3 bits for the middle subbands, and 2 bits for the high subbands (this follows critical bandwidths). The scale factors (one, two or three depending on the data) are encoded along with a 2-bit code describing which combination of scale factors is being used. The subband samples are quantized according to bit allocation, and then combined into groups of three (called granules). Each granule is encoded with one code word. This allows Layer 2 to capture much more redundant signal information than Layer 1.

Layer 2 processes the input signal in frames of 1152 PCM samples. At 48 kHz, each frame carries 24 ms of sound. Highest quality is achieved with a bit rate of 256k bps, but quality is often good down to 64k bps. Typical applications of Layer 2 include audio broadcasting, television, consumer and professional recording, and multimedia.

Audio files on the World Wide Web with the extension .mpeg2 or .mp2 are encoded with MPEG-1 Layer 2. CHECK THIS.

MPEG-1 Audio Layer 3

Layer 3 is substantially more complicated than Layer 2, and we will not describe it in detail. It uses both polyphase and discrete cosine transform filter banks, a polynomial prediction psychoacoustic model, and sophisticated quantization and encoding schemes allowing variable length frames. The frame packer includes a bit reservoir which allows more bits to be used for portions of the signal that need them.

Layer 3 is intended for applications where a critical need for low bit rate justifies the expensive and sophisticated encoding system. It allows high quality results at bit rates as low as 64k bps. Typical applications are in telecommunication and professional audio, such as commercially published music and video.

Other options and MPEG-2 Audio

The MPEG-1 Audio standard includes a lot more than this discussion has covered. For one thing, it allows two channels, which can be encoded separately or as joint stereo, which encodes both channels of stereo together for further savings.

MPEG-2 Audio is a recent extension to the standard, which allows up to five channels (for movies which have left, right, center, and two surround channels) plus a subwoofer channel.

Summary

Sub-Band Coding is a powerful and flexible technique for encoding high quality audio at low bit rates. Its applications are sure to grow as computer hardware increases in power and decreases in price. We hope this tutorial has been informative and helpful. For more information, click on one of the pointers below, or see the texts listed in the References section.

References

Mpeg FAQ

Open MPEG Consortium

MPEG-FAQ 4.0

"ISO-MPEG-1 Audio: a generic standard for coding of high-quality digital audio"
Karlheinz Brandenburg and Gerhard Stoll.
Journal of the Audio Engineering Society 42(10):780-792, October 1994.

For more information on audio file formats, see the Usenet FAQ on audio file formats.

This tutorial is provided by Otolith. This tutorial is currently under construction. If you have any comments, suggestions, or questions, please contact us at the address below.


This page maintained by Wil Howitt
Last updated 18 October 95