Synthetic and SNHC Audio in MPEG-4

Eric D. Scheirer

Machine Listening Group, MIT Media Laboratory
E15-401D, Cambridge MA 02143-4307 USA
Tel: +1 617 253 0112 Fax: +1 617 258 6264
eds@media.mit.edu

Youngjik Lee and Jae-Woo Yang

Switching and Transmission Technology Laboratories, ETRI

Abstract

In addition to its sophisticated audio-compression capabilities, MPEG-4 contains extensive functions supporting synthetic sound and the synthetic/natural hybrid coding of sound. We present an overview of the Structured Audio format, which allows efficient transmission and client-side synthesis of music and sound effects. We also provide an overview of the Text-to-Speech Interface, which standardizes a single format for communication with speech synthesizers. Finally, we present an overview of the AudioBIFS portion of the Binary Format for Scene Description, which allows the description of hybrid soundtracks, 3-D audio environments, and interactive audio programming. The tools provided for advanced audio functionality in MPEG-4 are a new and important addition to the world of audio standards.


Introduction

This article describes the parts of MPEG-4 that govern the compression, representation, and transmission of synthetic sound and the combination of synthetic and natural sound into hybrid soundtracks. Through these tools, MPEG-4 provides advanced capabilities for ultra-low-bitrate sound transmission, interactive sound scenes, and flexible, repurposable delivery of sound content.

We will discuss three MPEG-4 audio tools. The first, MPEG-4 Structured Audio, standardizes precise, efficient delivery of synthetic music and sound effects. The second, MPEG-4 Text-to-Speech Interface, standardizes a representation protocol for synthesized speech, an interface to text-to-speech synthesizers, and the automatic synchronization of synthetic speech and "talking head" animated face graphics [24]. The third, MPEG-4 AudioBIFS--part of the main BIFS framework--standardizes terminal-side mixing and post-production of audio soundtracks [22]. AudioBIFS enables interactive soundtracks and 3-D sound presentation for virtual-reality applications. In MPEG-4, the capability to mix and synchronize real sound with synthetic is termed Synthetic/Natural Hybrid Coding of Audio, or SNHC Audio.

The organization of the present paper is as follows. First, we provide a general overview of the objectives for synthetic and SNHC audio in MPEG-4. This section also introduces concepts from speech and music synthesis to readers whose primary expertise may not be in the field of audio. Next, a detailed description of the synthetic-audio codecs in MPEG-4 is provided. Finally, we describe AudioBIFS and its use in the creation of SNHC audio soundtracks.


Synthetic Audio in MPEG-4: Concepts and Requirements

In this section, we introduce speech synthesis and music synthesis. Then we discuss the inclusion of these technologies in MPEG-4, focusing on the capabilities provided by synthetic audio and the types of applications that are better addressed with synthetic audio coding than with natural audio coding.

Relationship between natural and synthetic coding

Modern standards for natural audio coding [1, 2] use perceptual models to compress natural sound. In coding synthetic sound, perceptual models are not used; rather, very specific parametric models are used to transmit sound descriptions. The descriptions are received at the decoding terminal and converted into sound through real-time sound synthesis. The parametric model for the Text-to-Speech Interface is fixed in the standard; in the Structured Audio toolset, the model itself is transmitted as part of the bitstream and interpreted by a reconfigurable decoder.

Natural and synthetic audio are not unrelated methods for transmitting sound. Especially as sound models in perceptual coding grow more sophisticated, the boundary between "decompression" and "synthesis" becomes somewhat blurred. Vercoe, Gardner, and Scheirer [28] have discussed the relationships among parametric models of sound, digital sound creation and transmission, perceptual coding, parametric compression, and various techniques for algorithmic synthesis.

Concepts in speech synthesis

Text-to-speech (TTS) systems generate speech sound according to given text. This technology enables the translation of text information into speech so that the text can be transferred through speech channels such as telephone lines. Today, TTS systems are used for many applications, including automatic voice-response systems (the "telephone menu" systems that have become popular recently), e-mail reading, and information services for the visually handicapped [9, 10].

TTS systems typically consist of multiple processing modules as shown in Figure 1. Such a system accepts text as input and generates a corresponding phoneme sequence. Phonemes are the smallest units of human language; each phoneme corresponds to one sound used in speech. A surprisingly small set of phonemes, about 120, is sufficient to describe all human languages.

Figure 1: Block diagram of a text-to-speech system, showing the interaction between text-to-phoneme conversion, text understanding, and prosody generation and application

The phoneme sequence is used in turn to generate a basic speech sequence without prosody, that is, without pitch, duration, and amplitude variations. In parallel, a text-understanding module analyzes the input for phrase structure and inflections. Using the result of this processing, a prosody generation module creates the proper prosody for the text. Finally, a prosody control module changes the prosody parameters of the basic speech sequence according to the results of the text-understanding module, yielding synthesized speech.

One of the first successful TTS systems was the DecTalk English speech synthesizer developed in 1983 [11]. This system produces very intelligible speech and supports eight different speaking voices. However, developing speech synthesizers of this sort is a difficult process, since it is necessary to extract all the acoustic parameters for synthesis. It is a painstaking process to analyze enough data to accumulate the parameters that are used for all kinds of speech.

In 1992, CNET in France developed the pitch-synchronous overlap-and-add (PSOLA) method to control the pitch and phoneme duration of synthesized speech [25]. Using this technique, it is easy to control the prosody of synthesized speech. Thus synthesized speech using PSOLA sounds more natural; it can also use human speech as a guide to control the prosody of the synthesis, in an analysis-synthesis process that can also modify the tone and duration. However, if the tone is changed too much, the resulting speech is easily recognized as artificial.

In 1996, ATR in Japan developed the CHATR speech synthesizer [10]. This method relies on short samples of human speech without modifying any characteristics; it locates and sequences phonemes, words, or phrases from a database. A large database of human speech is necessary to develop a TTS system using this method. Automatic tools may be used to label each phoneme of the human speech to reduce the development time; typically, hidden Markov models (HMMs) are used to align the best phoneme candidates to the target speech. The synthesized speech is very intelligible and natural; however, this method of TTS requires large amounts of memory and processing power.

The applications of TTS are expanding in telecommunications, personal computing, and the Internet. Current research in TTS includes voice conversion (synthesizing the sound of a particular speaker’s voice), multi-language TTS, and enhancing the naturalness of speech through more sophisticated voice models and prosody generators.

Applications for speech synthesis in MPEG-4

The synthetic speech system in MPEG-4 was designed to support interactive applications using text as the basic content type. Some of these applications include on-demand storytelling, motion picture dubbing, and "talking head" synthetic videoconferencing.

In the story-telling on demand (STOD) application, the user can select a story from a huge database stored on fixed media. The STOD system reads the story aloud, using the MPEG-4 Text-to-Speech Interface (henceforth, TTSI) with the MPEG-4 facial animation tool or with appropriately selected images. The user can stop and resume speaking at any moment he wants through the user interface of the local machine (for example, mouse or keyboard). The user can also select the gender, age, and the speech rate of the electronic story-teller.

In a motion-picture-dubbing application, synchronization between the MPEG-4 TTSI decoder and the encoded moving picture is the essential feature. The architecture of the MPEG-4 TTS decoder provides several levels of synchronization granularity. By aligning the composition time of each sentence, coarse granularity of synchronization can be easily achieved. To get more finely-tuned synchronization, information about the speaker lip shape can be used. The finest granularity of synchronization can be achieved by using detailed prosody transmission and video-related information such as sentence duration and offset time in the sentence. With this synchronization capability, the MPEG-4 TTSI can be used for motion picture dubbing by following the lip shape and the corresponding time in the sentence.

To enable synthetic video-teleconferencing, the TTSI decoder can be used to drive the facial-animation decoder in synchronization. Bookmarks in the TTSI bitstream control an animated face by using facial animation parameters (FAP); in addition, the animation of the mouth can be derived directly from the speech phonemes. Other applications of the MPEG-4 TTSI include speech synthesis for avatars in virtual reality (VR) applications, voice newspapers, and low-bitrate Internet voice tools.

Concepts in music synthesis

The field of music synthesis is too large and varied to give a complete overview here. An artistic history by Chadabe [4] and a technical overview by Roads [16] are sources that provide more background on the concepts developed over the last 35 years.

The techniques used in MPEG-4 for synthetic music transmission were originally developed by Mathews [13, 14], who demonstrated the first digital synthesis programs. The so-called unit-generator model of synthesis he developed has proven to be a robust and practical tool for musicians interested in the precise control of sound. This paradigm has been refined by many others, particularly Vercoe [26], whose language "Csound" is very popular today with composers.

In the unit-generator model (also called the Music-N model after Mathews’ languages Music-III, Music-IV, and Music-V and Vercoe’s languages Music-11 and Music-360), algorithms for sound synthesis are described as the interaction of a number of basic primitives, such as oscillators and envelope functions. Modern languages, such as Csound and the MPEG-4 language SAOL (described below) provide a rich set of built-in functions that musicians can use to create synthesizers. The notion of transmitting sound by sending algorithms in a synthesis-description language was suggested as early as 1991 by J. O. Smith [23], but apparently not attempted in practice until the "NetSound" experiment by Casey and Smaragdis [3], which used Csound to code and transmit sound.

Contemporaneously to the development of advanced software-synthesis technology, which occurred primarily in the academic world, the MIDI (Musical Instrument Digital Interface) protocol [15] was standardized by the music-synthesizer industry and became popular. MIDI is a protocol for communication between controllers (such as keyboards) and synthesizer modules; the protocol allows the specification of which note to play, but not which algorithms should be used for synthesis. The algorithms for synthesis are implementation-dependent in a MIDI synthesizer.

In recent years, inexpensive soundcards for personal computers have become available. These devices typically provide limited sound quality and synthesis features (typically, each sound card provides only a single algorithm for synthesis) as well as direct audio output. It is somewhat ironic that although general-purpose software synthesis was the first method developed for sound creation on digital computers, by the time digital computers became popular, much simpler and less satisfying technology emerged as a de facto standard.

Requirements and applications for audio synthesis in MPEG-4

The goal in the development of MPEG-4 Structured Audio--the toolset providing audio synthesis capability in MPEG-4--was to reclaim a general-purpose software synthesis model for use by a broad spectrum of musicians and sound designers. By incorporating this technology in an international standard, the development of compatible tools and implementations is encouraged, and such capabilities will become available as a part of the everyday multimedia sound hardware.

Including high-quality audio synthesis in MPEG-4 also serves a number of important goals within the standard itself. It allows the standard to provide capabilities that would not be possible through natural sound, or through simpler MIDI-driven parametric synthesis. We list some of these capabilities below.

The Structured Audio specification allows sound to be transmitted at very low bitrates. Many useful soundtracks and compositions can be coded in Structured Audio at bitrates from 0.1 to 1 kbps; as content developers become more practiced in low-bitrate coding with such tools the bitrates can undoubtedly be pushed even lower. In contrast to perceptual coding, there is no necessary tradeoff in algorithmic coding between audio quality and bitrate. Low-bitrate compressed streams can still decode into full-bandwidth, full-quality stereo output. Using synthetic coding, the tradeoff is more accurately described as one of flexibility (quality and of sound models) versus bitrate [19].

Interactive accompaniment, dynamic scoring, synthetic performance [27], and other new-media music applications can be made more functional and sophisticated by using synthetic music rather than natural music. In any application requiring dynamic control over the music content itself, a structured representation of music is more appropriate than a perceptually-coded one.

Unlike existing music-synthesis standards such as the MIDI protocol, structured coding with downloaded synthesis algorithms allows accurate sound description and tight control over the sound produced. Allowing any method of synthesis to be used, not only those included in a low-cost MIDI device, provides composers with a broader range of options for sound creation.

There is an attractive unification in the MPEG-4 standard between the capabilities for synthesis and those used for effects processing. By carefully specifying the capabilities of the Structured Audio synthesis tool, the AudioBIFS tool for audio scene description (see Section 5) is much simplified and the standard as a whole is cleaner.

Finally, Structured Audio is an example of a new concept in coding technology--that of the flexible or downloadable decoder. This idea, considered but abandoned for MPEG-4 video coding, is a powerful one whose implications have yet to be fully explored. The Structured Audio toolset is computationally complete in that it is capable of simulating a Turing machine [7], and thus of executing any computable sound algorithm. It is possible to download new audio decoders--even new perceptual coders--into the MPEG-4 terminal as Structured Audio bitstreams; the requirements and applications for such a capability remain as a question for future research.

The Structured Audio tools can be used to describe algorithms of arbitrary complexity. To enable guaranteed decodability, a complexity-measurement tool for Structured Audio bitstreams is included in the Conformance part of the MPEG-4 standard. The capability required of a conforming decoder at a certain Level is described in reference to this tool. A content author can measure his bitstreams by simulating them with this complexity-analysis tool in order to guarantee that they are decodable at the desired Level of decoder performance.


The MPEG-4 Text-to-Speech Interface

Text--that is, a sequence of words written in some human language--is a widely-used representation for speech data in stand-alone applications. However, it is difficult with existing technology to use text as a speech representation in multimedia bitstreams for transmission. The MPEG-4 Text-to-Speech Interface (TTSI) is defined so that speech can be transmitted as a bitstream containing text. It also provides interoperability among text-to-speech (TTS) synthesizers by standardizing a single bitstream format for this purpose.

Synthetic speech is becoming a rather common media type; it plays an important role in various multimedia application areas. For instance, by using TTS functionality, multimedia content with narration can be easily created without recording natural speech. Before MPEG-4, however, there was no way for a multimedia content provider to easily give instructions to an unknown TTS system. In MPEG-4, a single common interface for TTS systems is standardized; this interface allows speech information to be transmitted in the International Phonetic Alphabet (IPA), or in a textual (written) form of any language.

The MPEG-4 TTSI tool is a hybrid/multi-level scalable TTS interface that can be considered a superset of the conventional TTS framework. This extended TTSI can utilize prosodic information taken from natural speech in addition to input text and can thus generate much higher-quality synthetic speech. The interface and its bitstream format are strongly scalable in terms of this added information; for example, if some parameters of prosodic information are not available, a decoder can generate the missing parameters by rule. Algorithms for speech synthesis and text-to-phoneme translation are not normative in MPEG-4, but to meet the goal that underlies the MPEG-4 TTSI, a decoder should have the capability to utilize all the information provided in the TTSI bitstream.

As well as an interface to Text-to-speech synthesis systems, MPEG-4 specifies a joint coding method for phonemic information and facial animation (FA) parameters. Using this technique, a single bitstream may be used to control both the TTS interface and the facial animation visual object decoder. The functionality of this extended TTSI thus ranges from conventional TTS to natural speech coding and its application areas, from simple TTS to audiovisual presentation with TTS and moving picture dubbing with TTS.

The next section describes the functionality of the MPEG-4 TTSI and its decoding process.

MPEG-4 TTSI functionality

The MPEG-4 TTSI has important functionalities both as an individual codec and in synchronization with the facial animation techniques described by Tekalp et al. [24]. As a standalone codec, the bitstream format provides hooks to control the language being transmitted, the gender and age of the speaker, the speaking rate, and the prosody (pitch contour) of the speech. It can pause with no cost in bandwidth, by transmission of a silence sentence that only has silence duration. A "trick mode" allows operations such as start, stop, rewind, and fast forward to be applied to the synthesized speech.

The basic TTSI format is extremely low bitrate. In the most compact method, one can send a bitstream that contains only the text to be spoken and its length. In this case, the bitrate is 200 bits per second. The synthesizer will add predefined or rule-generated prosody to the synthesized speech (in a nonnormative fashion). The synthesized speech with predefined prosody will deliver emotional content to the listener.

On the other hand, one can send a bitstream that contains text as well as the detailed prosody of the original speech, that is, phoneme sequence, duration of each phoneme, base frequency (pitch) of each phoneme, and energy of each phoneme. The synthesized speech in this case will be very similar to the original speech since it employs the original prosody. Thus, one can send speech with subtle nuances without any loss of intonation using MPEG-4 TTSI.

One of the important features of the MPEG-4 TTSI is the ability to synchronize synthetic speech with the lip movements of a computer-generated avatar or "talking head". In this technique, the TTS synthesizer generates phoneme sequences and their durations, and communicates them to the facial animation visual object decoder so that it can control the lip movement. With this feature, one can not only hear the synthetic speech but also see the synchronized lip movement of the avatar.

The MPEG-4 TTSI has the additional capability to send facial expression bookmarks through the text. The bookmark is identified by '<FAP', and lasts until the closing bracket '>'. In this case, the TTS synthesizer transfers the bookmark directly to the face decoder so that it can control the facial animation visual object accordingly. The facial animation parameter (FAP) of the bookmark is applied to the face until another bookmark resets the FAP. Content capable of playing sentences correctly, even in trick-mode manipulations, requires that bookmarks of the text to be spoken are repeated at the beginning of each sentence. These bookmarks initialize the face to the state that is defined by the previous sentence. In such a case, some mismatch of synchronization can occur at the beginning of a sentence; however, the system recovers when the new bookmark is processed.

Through the MPEG-4 elementary stream synchronization capabilities [6], the MPEG-4 TTSI can perform synthetic motion picture dubbing. The MPEG-4 TTSI decoder can use the system clock to select an adequate speech location in a sentence, and communicates this to the TTS synthesizer, which assigns appropriate duration for each phoneme. Using this method, synthetic speech can be synchronized with the lip shape of the moving image.

MPEG-4 TTSI decoding process

Figure 2 shows a schematic of the MPEG-4 TTSI decoder. The architecture of the decoder can be described as a collection of interfaces. The normative behavior of the MPEG-4 TTSI is described in terms of these interfaces, not the sound and/or animated faces that are produced.

Figure 2: Overview of the MPEG-4 TTSI decoding process, showing the interaction between the syntax parser, the TTS synthesizer, and the face animation decoder. The shaded blocks are not normatively described, and operate in a terminal-dependent manner.

In particular, the TTSI standard specifies:

    1. the interface between the demux and the syntactic decoder. Upon receiving a multiplexed MPEG-4 bitstream, the demux passes coded MPEG-4 TTSI elementary streams to the syntactic decoder. Other elementary streams are passed to other decoders.
    2. the interface between the syntactic decoder and the TTS synthesizer. Receiving a coded MPEG-4 TTSI bitstream, the syntactic decoder passes a number of different pieces of data to the TTS synthesizer. The input type specifies whether TTS is being used as a standalone function, or in synchronization with facial animation or motion-picture dubbing. The control commands sequence specifies the language, gender, age, and speech rate of the speaking voice. The input text specifies the character string for the text to be synthesized. Auxiliary information such as IPA phoneme symbols (which allow text in a language foreign to the decoder to be synthesized), lip shape patterns, and trick-mode commands are also passed along this interface.
    3. the interface from the TTS synthesizer to the compositor. Using the parameters described in the previous paragraph, the synthesizer constructs a speech sound and delivers it to the audio composition system (described in Section 4).
    4. the interface from the compositor to the TTS synthesizer. This interface allows the local control of the synthesized speech by users. Using this interface and an appropriate interactive scene, user can start, stop, rewind, and fast forward the TTS system. Controls can also allow changes to the speech rate, pitch range, gender, and age of the synthesized speech by the user.
    5. the interface between the TTS synthesizer and the phoneme/bookmark-to-FAP converter. In the MPEG-4 framework, the TTS synthesizer and the face animation can be driven synchronously, by the same input control stream, which is the text input to the MPEG-4 TTSI. From this input stream, the TTS synthesizer generates synthetic speech, and at the same time, phoneme symbols, phoneme durations, word boundaries, stress parameters, and bookmarks. The phonemic information is passed to the Phoneme/Bookmark-to-FAP converter, which generates relevant facial animation accordingly. Through this mechanism, the synthesized speech and facial animation are synchronized when they enter the scene composition framework.

MPEG-4 Structured Audio

The tool that provides audio synthesis capability in MPEG-4 is termed the Structured Audio coder. This name originates in the Vercoe et al. [28] comparison of different methods of parameterized sound generation--it refers to the fact that this tool provides general access to any method of structuring sound. While the music synthesis technology is typically conceived mainly as a tool for composers and musicians, MPEG-4 Structured Audio is, finally, a codec like the other audio tools in MPEG-4. That is, the standard specifies a bitstream format and a method of decoding it into sound. While the techniques used in decoding the bitstream are those taken from the practice of general-purpose digital synthesis, and the bitstream format is somewhat unusual, the overall paradigm is identical to that of the natural audio codecs in MPEG-4.

This section will describe the organization of the Structured Audio standard, focusing first on the bitstream format and then on the decoding process. There is a second, simpler, tool for using parameterized wavetable synthesis with downloaded sounds; we will discuss this tool at the end of the section, and then conclude with a short discussion on encoding Structured Audio bitstreams.

Structured Audio bitstream format

The Structured Audio bitstream format makes use of the new coding paradigm known as algorithmic structured audio, described by Vercoe et al. [28] and Scheirer [18]. In this framework, a sound transmission is decomposed into two pieces: a set of synthesis algorithms that describe how to create sound, and a sequence of synthesis controls that specify which sounds to create. The synthesis model is not fixed in the MPEG-4 terminal; rather, the standard specifies a framework for reconfigurable software synthesis. Any current or future method of digital sound synthesis can be used in this framework.

Like the other MPEG-4 media types, a Structured Audio bitstream consists of a decoder configuration header that tells the decoder how to begin the decoding process, and then a stream of bitstream access units that contain the compressed data. In Structured Audio, the decoder configuration header contains the synthesis algorithms and auxiliary data, and the bitstream access units contain the synthesis control instructions.

Decoder configuration header and SAOL

The decoder configuration header specifies the synthesis algorithms using a new unit-generator language called SAOL (pronounced "sail"), which stands for Structured Audio Orchestra Language. The syntax and semantics of SAOL are specified precisely in the standard--MPEG-4 contains the formal specification of SAOL as a language. The similarities and differences between SAOL and other popular music languages have been discussed elsewhere [21]. While space does not provide for a full tutorial on SAOL in the present article, we give a short example so that the reader may understand the flavor of the language.

 global {
  srate 32000;
  krate 1000;
}

instr beep(pitch, amp) {
  asig out;
  ksig env;
  table sound(harm,2048,1,0.5,0,0.2);

  env = kline(0,0.1,amp,dur-0.1,0);
  out = oscil(sound,pitch) * amp;
  output(out);
}

Figure 3: A SAOL orchestra, containing one instrument that makes a ramped complex tone. See text for in-depth discussion of the orchestra code.

Figure 3 shows the textual representation of a complete SAOL synthesizer or orchestra. This synthesizer defines one instrument (called "beep") for use in a Structured Audio session. Each bitstream begins with a SAOL orchestra that provides the instruments needed in that session.

The synthesizer description as shown in Figure 3 begins with a global header that specifies the sampling rate (in this case, 32 KHz) and control rate (in this case, 1 kHz) for this orchestra. SAOL is a two-rate signal language--every variable represents either an audio signal that varies at the sampling rate or a control signal that varies at the control rate. The sampling rate of the orchestra limits the maximum audio frequencies that may be present in the sound, and the control rate limits the speed with which parameters may vary. Higher values for these parameters lead to better sound quality but require more computation. This tradeoff between quality and complexity is left to the decision of the content author (who should respect the Level definitions discussed in Section 1) and can differ from bitstream to bitstream.

After the global header comes the specification for the instrument beep. This instrument depends on two parameter fields (p-fields) named pitch and amp. The number, names, and semantics of p-fields for each instrument are not fixed in the standard but decided by the content author. The values for the p-fields are set in the score, which is described in Section 4.1.2. The instrument defines two signal variables: out, which is an audio signal, and env, which is a control signal. It also defines a stored-function table called sound.

Stored-function tables, also called wavetables, are crucial to general-purpose software synthesis. Many synthesis algorithms can be realized as the interaction of a number of oscillators creating appropriate signals; wavetables are used to store the periodic functions needed for this purpose. A stored-function table in SAOL is created by using one of several wavetable generators (in this case, harm) that allocate space and fill the table with data values. The harm wavetable generator creates one cycle of a periodic function by summing a set of zero-phase harmonically related sinusoids; the function placed in the table called sound consists of the sum of four sine waves at frequency 1, 2, and 4 with amplitudes 1, 0.5, and 0.2 respectively. This function is sampled at 2048 points per cycle to create the wavetable.

To create sound, the beep instrument uses an interaction of two unit generators, kline and oscil. There is a set of about 100 unit generators specified in the standard, and content authors can also design and deliver their own. The kline unit generator generates a control-rate envelope signal; in the example instrument it is assigned to the control-rate signal variable env. The kline unit generator interpolates a straight-line function between several (time, val) control points; in this example, a line-segment function is specified which goes from 0 to the value of the amp parameter in 0.1 seconds, and then back down to 0 in dur-0.1 seconds. dur is a standard name in SAOL that always contains the duration of the note as specified in the score.

In the next line of the instrument, the oscil unit generator converts the wavetable sound into a periodic audio signal, by oscillating over this table at a rate of cps cycles per second. Not every point in the table is used (unless the frequency is very low); rather, the oscil unit generator knows how to select and interpolate samples from the table in order to create one full cycle every 1/cps seconds. The sound that results is multiplied by the control-rate signal env and the overall sound amplitude amp. The result is assigned to the audio signal variable out. The last line of the instrument contains the output statement, which specifies that the sound output of the instrument is contained in the signal variable out.

When the SAOL orchestra is transmitted in the bitstream header, the plain-text format is not used. Rather, an efficient tokenized format is standardized for this purpose. The Structured Audio specification contains a description of this tokenization procedure. The decoder configuration header may also contain auxiliary data to be used in synthesis. For example, a type of synthesis popular today is "wavetable" or "sampling" synthesis, in which short clips of sound are pitch-shifted and added together to create sound. The sound samples for use in this process are not included directly in the orchestra (although this is allowed if the samples are short), but placed in a different segment of the bitstream header.

Score data, which normally resides in the bitstream access units as described below, may also be included in the header. By including in the header score instructions that are known when the session starts, the synthesis process may be able to allocate resources more efficiently. Also, real-time tempo control over the music is only possible when the notes to be played are known beforehand. For applications in which it is useful to reconstruct a human-readable orchestra from the bitstream, a symbol table may also be included in the bitstream header. This element is not required and has no effect on the decoding process, but allows the compressed bitstream representation to be converted back into a human-readable form.

Bitstream Access Units and SASL

The streaming access units of the Structured Audio bitstream contain instructions that specify how the instruments that were described in the header should be used to create sound. These instructions are specified in another new language called SASL, for "Structured Audio Score Language." An example set of such instructions, or score, is given in Figure 4.

0.0 	beep 1.0 440 0.5
1.0 	beep 2.0 220 0.2
2.0	beep 1.0 264 0.5
3.0	beep 1.0 440 0.5
4.0	end

Figure 4: A SASL score, which uses the orchestra in Figure 1 to play four notes. In an MPEG-4 Structured Audio bitstream, each score line is compressed and transmitted as an Access Unit.

Each line in this score corresponds to one note of synthesis. That is, for each line in the score, a different note is played using one of the synthesizers defined in the orchestra header. Each line contains, in order: a timestamp indicating the time at which the note should be triggered, the name of the instrument that should perform the synthesis, the duration of the note, and the parameters required for synthesis. The semantics of the parameters are not fixed in the standard, but depend on the definition of the instrument. In this case, the first parameter corresponds to the cps field in the instrument definition in Figure 3, and the second parameter in each line to the amp field. Thus, the score in Figure 4 includes four notes that correspond to the musical notation shown in Figure 5.

Figure 5: The musical notation corresponding to the SASL score in Figure 2.

In the streaming bitstream, each line of the score is packaged as an access unit. The multiplexing of the access units with those in other streams, and the actual insertion of the access units into a bitstream for transport, is performed according to the MPEG-4 multiplex specification [6].

There are many other sophisticated instructions in the orchestra and score formats; space does not permit a full review, but more details can be found in the standard and in other references on this topic [21]. In the SAOL orchestra language, there are built-in functions corresponding to many useful types of synthesis; in the SASL score language, tables of data can be included for the use in synthesis, and the synthesis process can be continuously manipulated with customizable parametric controllers. In addition, timestamps can be removed from the score lines, allowing a real-time mode of operation such as the transmission of live performances.

Decoding Process

The decoding process for Structured Audio bitstreams is somewhat different than the decoding process for natural audio bitstreams. The streaming data does not typically consist of "frames" of data that are decompressed to give buffers of audio samples; rather, it consists of parameters that are fed into a synthesizer. The synthesizer creates the audio buffers according to the specification given in the header. A schematic of the Structured Audio decoding process is given in Figure 6.

Figure 6: Overview of the MPEG-4 Structured Audio decoding process. See text for details.

The first step in decoding the bitstream is processing and understanding the SAOL instructions in the header. This stage of the bitstream processing is similar to compiling or interpreting a high-level language. The MPEG-4 standard specifies the semantics of SAOL--the sound that a given instrument declaration is supposed to produce--exactly, but does not specify the exact manner of implementation. Software, hardware, or dual software/hardware solutions are all possible for Structured Audio implementation; however, programmability is required, and thus fixed-hardware (ASIC) implementations are difficult to realize. The SAOL pre-processing stage results in a set of instrument definitions that are used to configure the reconfigurable synthesis engine. The capabilities and proper functioning of this engine are described fully in the standard.

After the header is received and processed, synthesis from the streaming access units begins. Each access unit contains a score line that directs some aspect of the synthesis process. As each score line is received by the terminal, it is parsed and registered with the Structured Audio scheduler as an event. A time-sequenced list of events is maintained, and the scheduler triggers each at the appropriate time.

When an event is triggered to turn on a note, a note object or instrument instantiation is created. A pool of active notes is always maintained; this pool contains all of the notes that are currently active (or "on"). As the decoder executes, it examines each instrument instantiation in the pool in turn, performing the next small amount of synthesis that the SAOL code describing that instrument specifies. This processing generates one frame of data (the length of the frame depends on the control rate specified by the content author) for each active note event. The frames are summed together for all notes to produce the overall decoder output.

Since SAOL is a very powerful format for the description of synthesis, it is not possible to generally characterize the specific algorithms which are executed in each note event. The content author has complete control over the methods used for creating sound and the resulting sound quality. Although the specification is flexible, it is still strictly normative (specified in the standard); this guarantees that a bitstream produces the same sound when played on any conforming decoder.

Wavetable synthesis in MPEG-4

A simpler format for music synthesis is also provided in MPEG-4 Structured Audio for applications that require low-complexity operation and do not require sophisticated or interactive music content--karaoke systems are the primary example. A format for representing banks of wavetables, the Structured Audio Sample Bank Format or SASBF, was created in collaboration with the MIDI Manufacturer’s Association for this purpose.

Using SASBF, wavetable synthesizers can be downloaded to the terminal and controlled with MIDI sequences. This type of synthesis processing is readily available today; thus, a terminal using this format may be manufactured very cheaply. Such a terminal still allows synthetic music to be synchronized and mixed with recorded vocals or other natural sounds. Scheirer and Ray [19] have presented a comparison of algorithmic synthesis and wavetable synthesis capabilities in MPEG-4, describing the relative advantages of each as well as situations in which they are profitably used together.

Encoding Structured Audio bitstreams

As with all MPEG standards, only the bitstream format and decoding process are standardized for the Structured Audio tools. The method of encoding a legal bitstream is outside the scope of the standard. However, the natural audio coders in MPEG-4 [2], like those in previous MPEG Audio standards, at least have well-known starting points for automatic encoding. Many tools have been constructed that allow an existing recording (or live performance) to be automatically turned into legal bitstreams for a given perceptual coder.

This is not yet possible for Structured Audio bitstreams; the techniques required to do this fully automatically are still in a basic research stage, where they are known as polyphonic transcription [12] or computational auditory scene analysis [17]. Thus, for the forseeable future, human intervention is required to produce Structured Audio bitstreams. Since the tools required for this are very similar to other tools used in a professional music studio today--such as sequencers and multitrack recording equipment--this is not an impediment to the utility of the standard.


MPEG-4 Audio/Systems interface and AudioBIFS

This section describes the relation between the MPEG-4 audio decoders and the MPEG-4 Systems functions of elementary stream management and composition. By including sophisticated capabilities for mixing and post-producing multiple audio sources, MPEG-4 enables a great number of advanced applications such as virtual-reality sound, interactive music experiences, and adaptive soundtracks.

Companion papers in this special issue provide detailed introductions to elementary stream management in MPEG-4 [6] and to the MPEG-4 Binary Format for Scenes (BIFS) [22]. The part of BIFS controlling the composition of a sound scene is called AudioBIFS. AudioBIFS provides a unified framework for sound scenes that use streaming audio, interactive presentation, 3-D spatialization, and dynamic download of custom signal-processing effects. Scheirer, Väänänen, and Huopaniemi [20] have presented a more in-depth discussion of the AudioBIFS tools.

AudioBIFS requirements

Many of the main BIFS concepts originate from the Virtual Reality Modeling Language (VRML) standard [8], but the audio toolset is built from a different philosophy. AudioBIFS contains significant advances in quality and flexibility over VRML audio. There are two main modes of operation that AudioBIFS is intended to support. We term them virtual-reality and abstract-effects compositing.

In virtual-reality compositing, the goal is to recreate a particular acoustic environment as accurately as possible. Sound should be presented spatially according to its location relative to the listener in a realistic manner; moving sounds should have a Doppler shift; distant sounds should be attenuated and low-pass filtered to simulate the absorptive properties of air; and sound sources should radiate sound unevenly, with a specific frequency-dependent directivity pattern. This type of scene composition is most suitable for "virtual world" applications and video-games, where the application goal is to immerse the user in a synthetic environment. The VRML sound model embraces this philosophy, with fairly lenient requirements on how various sound properties must be realized in an implementation.

In abstract-effects compositing, the goal is to provide content authors with a rich suite of tools from which artistic considerations can be used to choose the right effect for a given situation. As Scheirer [18] discusses in depth, the goal of sound designers for traditional media such as films, radio, and television is not to recreate a virtual acoustic environment (although this would be well within the capability of today’s film studios), but to apply a body of knowledge regarding "what a film should sound like." Spatial effects are sometimes used, but often in a non-physically-realistic way; the same is true for the filters, reverberations, and other sound-processing techniques used to create various artistic effects that are more compelling than strict realism would be. This model of content production demands stricter normativity in playback than does the virtual-reality model.

MPEG realized in the early development of the MPEG-4 sound compositing toolset that if the tools were to be useful to the traditional content community--always the primary audience of MPEG technology--then the abstract-effects composition model would need to be embraced in the final MPEG-4 standard. However, game developers and virtual-world designers demand high-quality sonification tools as well, so the VRML model should also be available.

MPEG-4 AudioBIFS therefore integrates these two components into a single standard. Sound in MPEG-4 may be post-processed with arbitrary downloaded filters, reverberators, and other digital-audio effects; it may also be spatially positioned according to the simulated parameters of a virtual world. These two types of post-production may be freely combined in MPEG-4 audio scenes.

The MPEG-4 Audio System

A schematic diagram of the overall audio system in MPEG-4 is shown in Figure 7 as a reference for the discussion to follow.

Figure 7: The MPEG-4 Audio system, showing the demux, decode, AudioBIFS, and BIFS layers. This schematic shows the interaction between the frames of audio data in the bitstream, the decoders, and the scene composition process. See text for details.

Sound is conveyed in the MPEG-4 bitstream as several elementary streams which contain coded audio in the formats described earlier. There are four elementary streams in the sound scene in Figure 7. Each of these elementary streams contains a primitive media object, which in the case of audio is a single-channel or multichannel sound that will be composited into the overall scene. In Figure 7, the stream coded with the GA coder (MPEG-4 General Audio [2], used for wideband music) decodes into a stereo sound, and the other streams into monophonic sounds. The different primitive audio objects may each make use of a different audio decoder, and decoders may be used multiple times in the same scene.

The multiple elementary streams are conveyed together in a multiplexed representation. Multiple multiplexed streams may be transmitted from multiple servers to a single MPEG-4 receiver, or terminal. There are two multiplexed MPEG-4 bitstreams shown in Figure 7; each originates from a different server. Encoded video content can also be multiplexed into the same MPEG-4 bitstreams. As they are received in the MPEG-4 terminal, the MPEG-4 bitstreams are demultiplexed, and each primitive media object is decoded. The resulting sounds are not played directly, but rather made available for scene compositing using AudioBIFS.

AudioBIFS nodes

Also transmitted in the multiplexed MPEG-4 bitstream is the BIFS scene graph itself. BIFS--and AudioBIFS--are simply parts of the content like the media objects themselves; there is nothing "hardwired" about the scene graph in MPEG-4. Content developers have wide flexibility to use BIFS in a variety of ways. In Figure 7, the BIFS part and the AudioBIFS part of the scene graph are separated because it is convenient to imagine them this way, but there is no technical difference between them (AudioBIFS is just a subset of BIFS).

Like the rest of the BIFS capabilities [22], AudioBIFS consists of a number of nodes which are interlinked in a scene graph. However, the concept of the AudioBIFS scene graph is somewhat different; it is termed an audio subgraph.

Whereas the main (visual) scene graph represents the spatiotemporal position of visual objects in presentation space, and their properties such as color, texture, and layering, an audio subgraph represents a signal-flow graph describing digital-signal-processing manipulations. Sounds flow in from MPEG-4 audio decoders at the bottom of the scene graph; each "child" node presents its results from processing to one or more "parent" nodes. Through this chain of processing, sound streams eventually arrive at the top of the audio subgraph. The "intermediate results" in the middle of the manipulation process are not sounds to be played to the user; only the result of the processing at the top of each audio subgraph is presented.

The AudioBIFS nodes are summarized in Table 1 and discussed in more detail in the following paragraphs.

Table 1: The AudioBIFS nodes

Node Name

Function

AudioSource

Connect decoder to scene graph

Sound

Connect audio subgraph to visual scene

AudioMix

Mix multiple channels of sound together

AudioSwitch

Select a subset of a set of channels of sound

AudioDelay

Delay a set of audio channels

AudioFX

Perform audio effects-processing

AudioBuffer

Buffer sound for interactive playback

ListeningPoint

Control position of virtual listener

TermCap

Query resources of terminal

The AudioSource node is the point of connection between real-time streaming audio and the AudioBIFS scene. The AudioSource node attaches an audio decoder, of one of the types specified in the MPEG-4 audio standard, to the scene graph; audio flows out of this node.

The Sound node is used to attach sound into audiovisual scenes, either as 3-D directional sound, or as non-spatial ambient sound. All of the spatial and non-spatial sounds produced by Sound nodes in the scene are summed and presented to the listener. The semantics of the Sound node in MPEG-4 are similar to those of the VRML standard, i.e., the sound attenuation region and spatial characteristics are defined in the same way as in the VRML standard to create an simple model of attenuation. In contrast to VRML, where the Sound node accepts raw sound samples directly and no intermediate processing is done, in MPEG-4 any of the AudioBIFS nodes may be attached. Thus, if an AudioSource node is the child node of the Sound node, the sound as transmitted in the bitstream is added to the sound scene; however, if a more complex audio scene graph is beneath the Sound node, the mixed or effects-processed sound is presented.

The AudioMix node allows M channels of input sound to be mixed into N channels of output sound through the use of a mixing matrix.

The AudioSwitch node allows N channels of output to be taken as a subset of M channels of input, where M £ N. It is equivalent to, but easier to compute than, an AudioMix node where M £ N and all matrix values are 0 or 1. This node allows efficient selection of certain channels, perhaps on a language-dependent basis.

The AudioDelay node allows several channels of audio to be delayed by a specified amount of time, enabling small shifts in timing for media synchronization.

The AudioFX node allows the dynamic download of custom signal-processing effects to apply to several channels of input sound. Arbitrary effects-processing algorithms may be written in SAOL and transmitted as part of the scene graph. The use of SAOL to transmit audio effects means that MPEG does not have to standardize the "best" artificial reverberation algorithm (for example), but also that content developers do not have to rely on terminal implementors and trust in the quality of the algorithms present in an unknown terminal. Since the execution method of SAOL algorithms is precisely specified, the sound designer has control over exactly which reverberation algorithm (for example) is used in a scene. If a reverb with particular properties [5] is desired, the content author transmits it in the bitstream; its use is then guaranteed. The complexity of processing in the AudioFX node is defined and restricted through levels, in a manner similar to the complexity of the Structured Audio decoder.

The position of the Sound node in the overall scene, as well as the position of the listener, are also made available to the AudioFX node, so that effects-processing may depend on the spatial locations (relative or absolute) of the listener and sources.

The AudioBuffer node allows a segment of audio to be excerpted from a stream, and then triggered and played back interactively. Unlike the VRML node AudioClip, the AudioBuffer node does not itself contain any sound data. Instead, it records the first n seconds of sound produced by its children. It captures this sound into an internal buffer. Then, it may later be triggered interactively (see the section on interaction below) to play that sound back.

This function is most useful for "auditory icons" such as feedback to button-presses. It is impossible to make streaming audio provide this sort of audio feedback, since the stream is (at least from moment to moment) independent of user interaction. The backchannel capabilities of MPEG-4 are not intended to allow the rapid response required for audio feedback. There is a special function of AudioBuffer which allows it to cache samples for sampling synthesis in the Structured Audio decoder. This technique allows perceptual compression to be applied to sound samples, which can greatly reduce the size of bitstreams using sampling synthesis.

The ListeningPoint node allows the user to set the listening point in a scene. The listening point is the position that the spatial positions of sources are calculated relative to. By default (if no ListeningPoint node is used), the listening point is the same as the visual viewpoint.

The TermCap node is not an AudioBIFS node specifically, but provides capabilities which are useful in creating terminal-adaptive scenes. The TermCap node allows the scene graph to query the terminal on which it is running, to discover hardware and performance properties of that terminal. For example, in the audio case, TermCap may be used to determine the ambient signal-to-noise ratio of the environment. The result can be used to control "switching" between different parts of the scene graph, so that (for example) a compressor is applied in a noisy environment such as an automobile, but not in a quiet environment such as a listening room. Other audio-pertinent resources that may be queried with the TermCap node include: the number and configuration of loudspeakers, the maximum output sampling rate of the terminal, and the level of sophistication of 3-D audio functionality available.

The MPEG-4 Systems standard contains specifications for the resampling, buffering, and synchronization of sound in AudioBIFS. Although we will not discuss these aspects in detail, for each of the AudioBIFS nodes there are precise instructions in the standard for the associated resampling and buffering requirements. These aspects of MPEG-4 are normative. This makes the behavior of an MPEG-4 terminal highly predictable to content developers.


Summary

We have described the tools for synthetic and SNHC audio in MPEG-4. By using these tools, content developers can create high-quality, interactive content and transmit it at extremely low bitrates over digital broadcast channels or the Internet. The Structured Audio toolset provides a single standard to unify the world of algorithmic music synthesis and to drive forward the capabilities of the PC audio platform; the Text-to-Speech Interface provides a greatly needed measure of interoperability between content and text-to-speech systems.


References

[1] K. Brandenberg, "Perceptual coding of high quality digital audio", in: M. Kahrs and K. Brandenburg, eds., Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic, New York, 1998, pp. 39-83.

[2] K. Brandenburg, "Natural audio and speech coding in MPEG-4", Signal Processing: image Communication, 1999 (this issue).

[3] M. A. Casey and P. J. Smaragdis, "Netsound: Real-time audio from semantic descriptions", Proc. Int. Computer Music Conf., Hong Kong, 1996, pp. 143.

[4] J. Chadabe, Electric Sound: The Past and Promise of Electronic Music, Prentice-Hall, Upper Saddle River NJ, 1997.

[5] W. G. Gardner, "Reverberation algorithms", in: M. Kahrs and K. Brandenburg, eds., Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic, New York, 1998, pp. 85-132.

[6] C. Herpel, "Elementary Stream Management in MPEG-4", Signal Processing: image Communication, 1999 (this issue).

[7] J. E. Hopcroft and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, 1979.

[8] International Organisation for Standardisation, Virtual reality modeling language (VRML), International Standard ISO 14472-1:1997, ISO, 1997.

[9] D. Johnston, C. Sorin, C. Gagnoulet, et al., "Current and experimental applications of speech technology for telecom services in Europe", Speech Communication, Vol. 23, No 1-2, 1997, pp. 5-16.

[10] M. Kitai, K. Hakoda, S. Sagayama, et al., "ASR and TTS telecommunications applications in Japan", Speech Communication, Vol. 23, No 1-2, 1997, pp. 17-30.

[11] D. H. Klatt, "Review of text-to-speech conversion for English", J. Acoust. Soc. Am., Vol. 82, No 3, 1987, pp. 737-793.

[12] K. D. Martin, "Automatic transcription of simple polyphonic music: Robust front-end processing", Technical Report #399, MIT Media Laboratory Perceptual Computing, 1996.

[13] M. V. Mathews, "An acoustic compiler for music and psychological stimuli", Bell Systems Technical Journal, Vol. 40, 1961, pp. 677-694.

[14] M. V. Mathews, The Technology of Computer Music, MIT Press, Cambridge, MA, 1969.

[15] MIDI Manufacturers Association, The Complete MIDI 1.0 Detailed Specification, Protocol specification, MIDI Manufacturers Association, 1996.

[16] C. Roads, The Computer Music Tutorial, MIT Press, Cambridge, MA, 1996.

[17] D. F. Rosenthal and H. G. Okuno, eds., Computational Auditory Scene Analysis, Lawrence Erlbaum, Location, 1998.

[18] E. D. Scheirer, "Structured audio and effects processing in the MPEG-4 multimedia standard", Multimedia Systems, Vol. 7, No 1, 1999, pp. 11-22.

[19] E. D. Scheirer and L. Ray, "Algorithmic and wavetable synthesis in the MPEG-4 multimedia standard", Proc. 105th Convention of the Audio Engineering Society (reprint #4811), San Francisco, 1998.

[20] E. D. Scheirer, R. Väänänen and J. Huopaniemi, "AudioBIFS: Describing audio scenes in the MPEG-4 multimedia standard", IEEE Trans. Multimedia, in press.

[21] E. D. Scheirer and B. L. Vercoe, "SAOL: The MPEG-4 Structured Audio Orchestra Language", Comp. Mus. J., Vol. 23, No 2, 1999, pp. 31-51.

[22] J. Signes, "MPEG-4 scene description", Signal Processing: image Communication, 1999 (this issue).

[23] J. O. Smith, "Viewpoints on the history of digital synthesis", Proc. Int. Computer Music Conf., Montreal, 1991, pp. 1-10.

[24] M. Tekalp, "Coding of synthetic visual data in MPEG-4", Signal Processing: image Communication, 1999 (this issue).

[25] H. Valbret, E. Moulines and J. P. Tubach, "Voice transformation using PSOLA technique", Speech Communication, Vol. 11, No 2-3, 1992, pp. 175-187.

[26] B. L. Vercoe, Csound: A Manual for the Audio-Processing System (rev. 1996), Program reference manual, MIT Media Lab, 1985.

[27] B. L. Vercoe, "The synthetic performer in the context of live performance", Proc. International Computer Music Conference, Paris, 1984, pp. 199-200.

[28] B. L. Vercoe, W. G. Gardner and E. D. Scheirer, "Structured audio: The creation, transmission, and rendering of parametric sound representations", Proc. IEEE, Vol. 85, No 5, 1998, pp. 922-940.


ABOUT THE AUTHORS

Eric D. Scheirer is a Ph.D. candidate in the Machine Listening Group at the MIT Media Laboratory, where his research focuses on the construction of music-understanding computer systems. He received B.S. degrees in computer science and linguistics from Cornell University, Ithaca, NY, in 1993, and the M.S. degree from the Media Lab in 1995. He was an Editor of the MPEG-4 audio standard and was the primary developer of the Structured Audio and AudioBIFS components of MPEG-4.

Eric has published articles on a range of audio topics, including music analysis, structured coding, advanced psychoacoustic models, audio pattern recognition, and sound synthesis. He is an active speaker and writer for non-technical audiences, with a particular interest in the application of audio technology to multimedia systems design. He is also an accomplished and award-winning jazz trombonist.

Youngjik Lee received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea in 1979, the M.S. degree in electrical engineering from Korea Advanced Institute of Science and Technology, Seoul, Korea in 1981 and the Ph.D. degree in electrical engineering from the Polytechnic University, Brooklyn, New York, USA.

From 1981 to 1985 he was with Samsung Electronics Company, where he developed video display terminals. From 1985 to 1988 he worked on sensor array signal processing. Since 1989, he has been with Electronics and Telecommunications Research Institute pursuing research in multimodal interfaces, speech recognition, speech synthesis and speech translation, neural networks, pattern recognition, and digital signal processing.

Jae-Woo Yang received the B.S. degree in electrical engineering and the M.S. degree in control engineering from Seoul National University in 1978 and 1982, respectively, and received the Ph.D. degree in computer science from Korea Advanced Institute of Science and Technology in 1997. He worked at Samsung Electronics Company for 1978-1979. He joined ETRI in 1980. Since then he has developed operation support systems, artificial intelligence systems, spoken language processing systems and multimedia systems. He is now director of the Telecommunication Terminal Technology department. He also works as a board member of the Acoustics Society of Korea. His current research areas are human interfaces, multimedia communication, Internet terminal equipment and spoken-language processing