Structured Audio and the Object-Oriented Content Transmission Metamodel

After the previous description it must have become clear that Structured Audio is very much related to our Object-Oriented Content Transmission Metamodel. We will now highlight the main differences of both approaches.

Structural audio's focus in on structure: it is designed to exploit structural redundancy. The focus of our OOCTM is on content and its meaning: we aim at understanding the signal and representing it accordingly. Although the final results in some particular applications might not differ much, the difference in focus is clear: SA is a syntactic metamodel while OOCTM is a semantic metamodel.

According to the authors Structured Audio can surpass the Shannon&Weaver theoretical channel limit except if the encoded signal was completely random. Our metamodel can surpass the limit even if the signal is completely random. As a matter of fact, OOCTM performs extremely well in such situations. If the data source is completely random then it means that it has no meaning. Therefore, it can be transmitted as a simple ``play random signals'' sound object that will be rendered at the receiver. The result will obviously not resemble the original signal but we insist that we are not interested in mathematical nor perceptual accuracy, only in semantic accuracy. SA performs well on highly structured signals, the OOCTM performs well on both highly meaningful and meaningless signals just as long as a meaningful signal is not classified as meaningless.

As already commented at the beginning of this chapter the idea of synthesizability in SA is different from ours. While in SA a given representation is said to be synthesizable if the original sound can be obtained from it in OOCTM a representation is synthesizable simply if it can render a sound. Whether the result is meaningful or not relates to other measures of the description such as meaningfulness, but not to its synthesizability.

SA encodes data sources parametrically, based on signal models. Nevertheless, it does not impose any limitation or conceptual metamodel on these models. The OOCTM encodes data sources as Sound Objects and forces these object-oriented data models to comply with the DSPOOM metamodel. Object-oriented data models can be interpreted as a subset of Parametric models, therefore OOCTM is more restrictive in that sense. But it is only because of this restriction that we can ensure that all models that might be instantiated from the metamodel will carry some semantic information.

In SA the message always includes the transmission of a particular model, to be used by the receiver. Although model transmission can also be provided and used in OOCTM it is not compulsory. In many situations the model may be known beforehand as all components share the same metamodel and may be able to deduce it. On other situations, the degree of abstraction might be so high that no model is necessary at the receiver except from ``real-world'' knowledge.

Structured Audio needs to standardize language with precise semantics such as SAOL or SASL. In the Object-Oriented Content Transmission Metamodel no languages need to be standardized. XML is used as a general purpose content-description language but any other similar general purpose language could be used. Different particular instances of XML can be used (see MetriXML) but they are not part of the metamodel. In our metamodel even the language description can be transmitted dynamically in a schema.

A key issue in the OOCTM is that the same language is consistently used throughout the metamodel and in any of its components. This is something that cannot be said about SA. SAOL and SASL are basically synthesis languages and they are not suitable for encoding the result of a general signal analysis, even less if this analysis addresses the content level. This fact has been explicitly recognized by the MPEG working group when constructing a completely different standard, MPEG-7 (see section 1.4.2). MPEG-7's content description and MPEG-4's Structured Audio tools are not even compatible and efforts to bring both world together, if ever started, are, in our opinion, not going to succeed.

Our OOCTM is a particular instance of the Digital Signal Processing Object-Oriented Metamodel. And DSPOOM does not stand on any specific language but is rather instantiated in a framework such as CLAM implemented in a general purpose programming language.

It is interesting to note how the Eric Scheirer, main creator of MPEG-4's Structured Audio, points out that a general-purpose computer programming language could be used instead than a language like SAOL, specifically designed for structured audio description (see [Scheirer, 2001]). According to him the approach of having a specific language has a number of advantages, namely:

A Software framework such as CLAM provides most of the aforementioned advantages. It accepts audio blocks and may run real-time; it can satisfy system level requirements thanks to its operating system abstraction; and it provides even more primitives than Structured Audio. The only point that it does not satisfy is that it is not written with custom hardware in mind, but we see this rather as a disadvantage than as an advantage of SAOL.

He also states that the process of decoding the bitstream header and reconfiguring the synthesizer is similar to parsing and compiling a computer language. According to [Scheirer, 1998b] for implementing a SAOL system, similar skills in software engineering and computer science than those used for implementing a compiler are needed. Then why waste all those resources when there are many general purpose programming language compilers that can do the job?

2004-10-18