INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO

 

ISO/IEC JTC1/SC29/WG11
MPEG2006/N8641
October 2006, Hangzhou, CN

 

 

Source:      Audio Subgroup
Status:       Proposal
Title:          Audio Bifs version 3
Editors:     Jürgen Schmidt (THOMSON), Johannes Boehm (THOMSON) 

Introduction

The MPEG-4 standard provides a powerful scene description language for 2D and 3D scenes and the whole infrastructure for the transmission and state-of the-art (de)coding of audiovisual content. Key to this ability is the scene description language BIFS (Binary Information For Scenes) [1]. BIFS scene description is based on the tree concept that is similarly used in the VRML standard [2]. All functional blocks of a scene can be visualized as nodes of a tree and the structure of the tree reflects the interdependency of the functional blocks.

Figure 1: Simplified MPEG-4 audio player diagram

 

Scene information is transmitted in a composite data stream that consists of audio data in elementary streams and structural information in the scene stream, binary encoded in BIFS. Information in the scene stream includes the tree information as well as the description of the nodes themselves. The number of elementary audio streams is application dependent and can vary over time. Different audio coding algorithms exist for coding the elementary streams, e.g. MPEG-4 AAC.

Scene information and elementary streams are combined and processed in the composition unit or compositor. This unit receives and decodes the scene stream and is responsible for handling the nodes and the scene graph (tree) and their connections with the decoded elementary streams. The audio nodes themselves are responsible for any signal processing (rendering), initiated by the renderer. The output of the renderer is fed to the presenter, who in turn is responsible for mapping the audio signals to the available loudspeaker array. The principle is shown in Figure 1.

Audio Bifs

A simplified view of an audiovisual scene is given in Figure 2, showing control flow (scene control, field access), data flow (audio) and interactivity flow. The tree is logically divided into a visual part and an audio part [3]. The visual part contains all visual or interactive nodes. The nodes in the audio tree are responsible for the handling of audio content. The scene stream is decoded by the BIFS decoder (not shown) that instantiates the nodes and fills their property fields with properties arriving with the control flow.

Figure 2:  Simplified scene graph diagram for a MPEG-4 scene

The audio nodes of the audio sub-tree determine the processing of the audio signals. Each node can be connected with any other node in the tree; crossings or multiple connections are allowed. The top node of the audio sub-tree is always a sound node. There exist different types of sound nodes, for example the Sound2 or the Sound2D node. The top node interfaces to the presenter and can have only one child. The bottom node is always an AudioSource node, or for backward compatibility an AudioClip node, whose usage should be avoided in new applications. The AudioSource node interfaces always to the decoded elementary-stream data.

Audio Bifs version 3

Audio Bifs version 3 adds new functionalities to Audio Bifs. Here is a brief exemplary overview:

Audio Bifs

Advanced Audio Bifs (version 2, AABIFS)

The DirectiveSound node is a top node similar to the Sound or Sound2D node with extended properties for the spatial presentation of sound sources with the support of the AcousticScene and AcousticMaterial nodes. Details can be found in [6, 7].

Audio Bifs version 3

A more comprehensive review of this Audio Bifs version 3 review can be found in [8].

Target applications

Target applications range from multimedia terminals to gaming applications. The technology of Audio Bifs composition and rendering allows for sound replay of all formats. These formats vary from the well-known 2-channel stereo format to multi-channel formats with up to 16 channels. They also cover Ambisonics or the binaural transport format and object oriented virtual reality formats as depicted in Figure 3.

Figure 3: Abstract view of an audio system with various audio formats

References

[1]        ISO/IEC 14496-11: 2005, Information technology , Coding of audio-visual objects, Part 11: Scene description and application engine

[2]        ISO/IEC 14772-1:1997 The Virtual Reality Modeling Language (VRML97) 1997, www.web3d.org/specifications/VRML97/

[3]        Scheirer, E. D.; Väänänen, R.; Houpaniemi, V:AudioBIFS: Describing audio scenes with the MPEG-4 multimedia standard. IEEE Transactions on Multimedia, vol. 1, no. 3, pp 237 - 250, September 1999.

[4]        Scheirer, Eric D.; Vercoe, Barry L.: "SAOL: The MPEG-4 Structured Audio Orchestra Language", Computer Music Journal, 23:2, pp. 31–51, 1999

[5]        Vercoe, B. L.; Gardner, B. L.; Scheirer, E. D.: "Structured Audio: Creation, transmission, and rendering of parametric sound representations,"  Proc. IEEE vol. 86 (1998), no. 5, pp 922 – 940

[6]        Dantele, A.; Reiter, U.; Schuldt, M.; Drumm, H.; Baum, O.: Implementation of MPEG-4 Audio Nodes in an Interactive Virtual 3D Environment. 114th AES Convention, Amsterdam, March 2003, preprint No. 5820

[7]        Väänänen, Riitta: User Interaction and Authoring of 3D Sound Scenes in the Carrouso EU project. 114th AES Convention, Amsterdam, March 2003, preprint No. 5764

[8]        Jürgen Schmidt and Ernst F. Schröder: New and Advanced Features for Audio Presentation in the MPEG-4 Standard. 116th AES Convention, Berlin, May 2004, preprint No 6058