(Strictly speaking, what we're describing here is really called simultaneous masking (masking across frequency). There are also nonsimultaneous masking (masking across time) phenomena, as well as many other phenomena of human hearing, which we're not concerned with here. For more information about auditory perception, see the upcoming Auditory Perception OLT.)
The basic idea of SBC is to save signal bandwidth by throwing away information about frequencies which are masked. The result won't be the same as the original signal, but if the computation is done right, human ears can't hear the difference.
There are more clever ways of digitizing an audio signal, which can save some of that wasted bandwidth. A classic method is nonlinear PCM, such as mu-law encoding (named after a perceptual curve in auditory perception research). This is like PCM on a logarithmic scale, and the effect is to add noise that is proportional to the signal strength. Sun's .au format for sound files is a popular example of mu-law encoding. Using 8-bit mu-law encoding would cut our one channel of CD audio down to about 350k bps, which is better but still pretty high, and is often audibly poorer quality than the original (this scheme doesn't really model masking effects).
Decoding is easier, since there is no need for a psychoacoustic model. The frames are unpacked, subband samples are decoded, and a frequency-time mapping turns them back into a single output audio signal.
This is a basic, generic sketch of how SBC works. Notice that we haven't looked at how much computation it takes to do this. For practical systems that need to run in real time, computation is a major issue, and is usually the main constraint on what can be done.
Over the last five to ten years, SBC systems have been developed by many of the key companies and laboratories in the audio industry. Beginning in the late 1980's, a standardization body of the ISO called the Motion Picture Experts Group (MPEG) developed generic standards for coding of both audio and video. Let's look at MPEG Audio as a specific example of a practical SBC system.
The idea is that an application of MPEG-1 Audio can use whichever layer gives the best tradeoff between computational burden and compression performance. Audio can be encoded in any one layer. A standard MPEG decoder for any layer is also able to decode lower (simpler) layers of encoded audio.
MPEG-1 Audio is intended to take a PCM audio signal sampled at a rate of 32, 44.1 or 48 kHz, and encode it at a bit rate of 32 to 192 kbps per audio channel (depending on layer).
The Layer 1 psychoacoustic model uses a 512-point FFT to get detailed spectral information about the signal. The output of the FFT is used to find both tonal (sinusoidal) and nontonal (noise) maskers in the signal. Each masker produces a masking threshold depending on its frequency, intensity, and tonality. For each subband, the individual masking thresholds are combined to form a global masking threshold. The masking threshold is compared to the maximum signal level for the subband, producing a signal-to-masker ratio (SMR) which is the input to the quantizer.
The Layer 1 quantizer/encoder first examines each subband's samples, finds the maximum absolute value of these samples, and quantizes it to 6 bits. This is called the scale factor for the subband. Then it determines the bit allocation for each subband by minimizing the total noise-to-mask ratio with respect to the bits allocated to each subband. (It's possible for heavily masked subbands to end up with zero bits, so that no samples are encoded.) Finally, the subband samples are linearly quantized to the bit allocation for that subband.
The Layer 1 frame packer has a fairly easy job. Each
frame starts with a header information for synchronization and
bookkeeping, and a 16-bit cyclic redundancy check (CRC) for error
detection and correction. Each of the 32 subbands gets 4 bits to
describe bit allocation and 6 bits for the scale factor. The
remaining bits in the frame are used for subband samples, with an
optional trailer for extra information.
Layer 1 processes the input signal in frames of 384 PCM samples. At 48 kHz, each frame carries 8 ms of sound. The MPEG specification doesn't specify the encoded bit rate, allowing implementation flexibility. Highest quality is achieved with a bit rate of 384k bps. Typical applications of Layer 1 include digital recording on tapes, hard disks, or magneto-optical disks, which can tolerate the high bit rate.
The Layer 2 psychoacoustic model is similar to the Layer 1 model, but it uses a 1024-point FFT for greater frequency resolution. It uses the same procedure as the Layer 1 model to produce signal-to-masker ratios for each of the 32 subbands.
The Layer 2 quantizer/encoder is similar to that used in Layer 1, generating 6-bit scale factors for each subband. However, Layer 2 frames are three times as long as Layer 1 frames, so Layer 2 allows each subband a sequence of three successive scale factors, and the encoder uses one, two, or all three, depending on how much they differ from each other. This gives, on average, a factor of 2 reduction in bit rate for the scale factors compared to Layer 1. Bit allocations are computed in a similar way to Layer 1.
The Layer 2 frame packer uses the same header and CRC
structure as Layer 1. The number of bits used to describe bit
allocations varies with subband: 4 bits for the low subbands, 3 bits
for the middle subbands, and 2 bits for the high subbands (this
follows critical bandwidths). The scale factors (one, two or three
depending on the data) are encoded along with a 2-bit code describing
which combination of scale factors is being used. The subband samples
are quantized according to bit allocation, and then combined into
groups of three (called granules). Each granule is encoded
with one code word. This allows Layer 2 to capture much more
redundant signal information than Layer 1.
Layer 2 processes the input signal in frames of 1152 PCM samples. At 48 kHz, each frame carries 24 ms of sound. Highest quality is achieved with a bit rate of 256k bps, but quality is often good down to 64k bps. Typical applications of Layer 2 include audio broadcasting, television, consumer and professional recording, and multimedia.
Audio files on the World Wide Web with the extension .mpeg2 or .mp2 are encoded with MPEG-1 Layer 2. CHECK THIS.
Layer 3 is intended for applications where a critical need for low bit rate justifies the expensive and sophisticated encoding system. It allows high quality results at bit rates as low as 64k bps. Typical applications are in telecommunication and professional audio, such as commercially published music and video.
MPEG-2 Audio is a recent extension to the standard, which allows up to five channels (for movies which have left, right, center, and two surround channels) plus a subwoofer channel.
"ISO-MPEG-1 Audio: a generic standard for coding of
high-quality digital audio"
For more information on audio file formats, see the
Usenet FAQ on audio file formats.
This tutorial is provided by
Otolith.
This tutorial is currently under construction. If you have any
comments, suggestions, or questions, please contact us at the address
below.
Karlheinz Brandenburg and Gerhard Stoll.
Journal of the Audio Engineering Society
42(10):780-792, October 1994.
This page maintained by
Wil Howitt
Last updated 18 October 95