Beyond Parametric Encoding: Content Analysis

Any parameterized audio coding format can be considered as a combination of: a sound-understanding algorithm that sets parameter according to some model by analyzing the audio; the transmission of these parameters; and a sound-synthesis algorithm that maps from the transmitted parameters to a new sound [Scheirer, 2001].

The understanding step is usually called encoding and the synthesis step decoding but if the parameters are obtained directly from a sound, the process may be also termed as analysis/synthesis. If we want to communicate the sound description over a channel we need an encoder and a decoder, both with a priori knowledge of the model being used. But it is impossible to devise an encoding method that always gives the optimal coding of any input sound. In other words, not a single model can fit all kinds of input signals. For this reason we may sum up different specialized models and add some extra bits to the stream in order to indicate what model is being chosen.

Parametric encoding has been used extensively in speech transmission, where a general model of the input signal has existed for a long time. Most commonly used parametric models for speech are simple variations of the classical Linear Predictive Coding (LPC) scheme[Makhoul, 1975], which somehow exploits the knowledge of the vocal production mechanism.

Parametric coding for non-speech signals is much more complex to implement as not many assumptions can be made about the source characteristics or its production mechanisms. MPEG-4 standardized different parametric encoding schemes although the one recommended for non-speech or musical signals is the HILN or Harmonic and Individual Lines plus Noise (see [Purnhagen and Meine, 2000]). This parametric scheme is as a matter of fact a variation of the SMS or Spectral Modelling Synthesis technique (see further explanation in annex B).

The Object-Oriented Content Transmission metamodel represents a step beyond parametric encoding. The basic scheme is the same: the sound is analyzed, some parameters are extracted, transmitted and then synthesized back at the receiver.

But the main difference is on the kind of parametric model used in each case. In traditional parametric encoding, the model makes assumptions about the low-level features of the signal. For instance, the input signal may be supposed to be harmonic and therefore modeled as a set of time varying sinusoids (see Annex B for a more in depth explanation of one of these models).

In the Object-Oriented Content Transmission Metamodel we make assumptions about the semantics of the input data and the way it is organized in the real world. We may, for instance, assume that the input signal will be monophonic musical phrases and then apply a model in which a musical phrase is modelled as a sequence of notes.

The OOCTM encodes data as Sound Objects. This is, as already mentioned in the previous section, a particular case of Parametric encoding. Parameters extracted from the signal become attributes of concrete objects. In that sense, all parameters have a particular meaning and contribute to the content description. The final encoded signal also has a clear structure, given by the resulting object-oriented model.

In that sense, the result of an OOCTM encoding is generally understandable while a Parametric encoding is usually not.

Finally, in a Parametric encoding scheme the synthesis capabilities of the receiver are very much limited to a simple signal model. In the OOCTM scheme the synthesizer is able to not only apply different models but also to infer or abstract a sound from an incomplete description (see 5.2.2).

2004-10-18