Marsyas

Marsyas [Tzanetakis and Cook, 1999,Tzanetakis and Cook, 2002,Tzanetakis and Cook, 2000,Tzanetakis, 2002] or MusicAl Research SYStem for Analysis and Synthesis is a framework for experimenting, evaluating and integrating techniques for audio content analysis. Although the name includes the word Synthesis, Marsyas' focus is clearly on sound analysis tools and information retrieval techniques. The framework allows to integrate these tools using a semi-automatic approach and a graphical interface. On the other hand Marsyas is released under the GPL license and is therefore Free Software.

In order to come up with a valid model for Marsyas, different algorithms and techniques were studied and common behavior and features were abstracted. OO programming techniques were used to implement abstract classes that provide a common API for the building blocks of the system and inheritance is used to factor out common operations.

The environment is able to combine traditional bottom-up processing (from signal to metadata) as well as top-down (according to the author prediction-driven, for instance, has proven to be interesting). Although the objects form a natural bottom-up hierarchy, top-down flow of information can be expressed in the framework (e.g. a silence feature can be used by an iterator for music/speech to avoid calculating features on silent frames).

The framework design is based on a client-server architecture. The server is written in C++ and contains all the signal processing and pattern recognition algorithms, optimized for performance. The client is written in Java, contains only the graphical interface and communicates with the server using sockets. Both the server and the client run on Solaris, SGI, Linux and Windows.

The main classes of the system can roughly be divided into process-like and data-structure-like.

The Process-like classes can be divided in the following categories:

Transformations are low-level signal processing units used by the system. They take as input a frame of sound samples and output a transformation of that frame (e.g. power spectral density, caepstrum, windowing...)
Features process a frame of sound samples and output a vector which unlike transformations is reduced significantly in dimensionality. More than one ``physical'' feature can be combined in a single vector.
Memories are circular buffers that hold previously calculated features for a limited time. They are used to compute means and variances of features over large windows.
Iterators break up a sound stream into frames. For each frame they use memories and features to compute a feature vector. The time-series of feature vectors is called the feature map. Typically there is a different iterator for each classification scheme (e.g. silence/non silence iterator uses only energy as a feature and no memories, the music/speech iterator uses 9 features and 2 memories (of different sizes).
Classifiers take as input a feature vector and output its estimated class. They are trained using labeled feature maps.
Segmentators take as input feature maps and output a signal with peaks corresponding to segmentation boundaries.

Data structure classes can in turn be categorized in:

Vectors are the basic data components of the system. They are float arrays tagged with sizes. Operator overloading is used for vector operations to avoid writing many nested loops for signal processing code. The operators are inlined and optimized and the resulting code is easy to read without compromising performance.
Sound data contain samples of audio as vectors with header information such as sample rate or channels.
Feature maps are time-series of feature vectors. They can be class labeled for evaluation and training.
Time regions are time intervals tagged with annotation information.
Time lines are lists of time regions.
Time trees are arbitrary trees of time regions. They represent a hierarchical decomposition of audio into successively smaller segments.

All objects contain methods to read/write them to file and transport them using the socket interface.

Implemented features in the framework include spectral centroid, spectral moments, spectral flux, pitch, harmonicity, mel-frequency cepstral coefficients (MFCC), linear prediction (LPC) reflection coefficients, zero crossings, RMS, and spectral rolloff. For all of them means, variances and higher-order statistics can be computed using memories. New features can be easily added by just writing the code for computing the feature over a frame of samples.

Two classifiers have been implemented: the Gaussian (MAP) classifier and the K-Nearest Neighbor (KNN).

Different applications such as music/speech discriminator have been implemented in order to test the architecture.

The user interface looks like a typical tape-recorder wave editor but in addition it allows skipping by either user-defined fixed duration blocks or time lines containing regions of different duration.

At the moment of this writing Marsyas is going an overall rewrite towards a 0.2 version of the framework.

2004-10-18