MPEG-4 Systems Binary Format For Scene (BIFS)

1. What is BIFS?
2. Why BIFS ?
3. What is the MPEG-4 model of an audio-visual object?
4. Why is scene description information separate from audio-visual objects?
5. Can the scene description be changed?
6. What is the difference between BIFS and VRML?
7. The Scene Description looks similar to VRML. Is it?
8. How is interactivity handled in MPEG-4?
9. So, the scene description is streamed?
10. Can there be multiple scene description streams in an MPEG-4 presentation?

1. What is BIFS?

BIFS is an abbreviation for "BInary Format for Scenes". BIFS provides a complete framework for the presentation engine of MPEG-4 terminals. BIFS enables to mix various MPEG-4  media together with 2D and 3D graphics, handle interactivity, and deal with the local or remote changes of the scene over time. BIFS has been designed as an extension of the VRML 2.0 specification in a binary form.

BIFS is actually composed of 4 elements:

  • The operational elements of the scene, consisting of nodes and routes. These represent in particular:
    • Audio-visual objects and their attributes (which define their audio-visual properties);
    • Composition operations;
    • Animation of the content;
    • Interactive behavior of individual objects by linking event source fields to event sink fields between different nodes.
  • The binary syntax for compressing the node tree as well as the associated routes.
  • The BIFS-Command protocol, in order to stream scene changes, insert new scenes or objects, delete objects, etc.
  • The BIFS-Anim protocol, in order to stream animations of node parameters. This is used as a very low overhead mechanism to animate audio-visual objects.

2. Why BIFS ?

A central concept in the MPEG-4 design is transmission and interaction with audio-visual objects, of synthetic or natural nature. The Audio, Visual part of the standard provide the encoding algorithms for individual audio-visual objects. In order to combine these media together into complete presentations, a scene description capability is needed.
BIFS provides the input data to the presentation layer of the MPEG-4 terminal. No other scene format covers all the requirements of the MPEG-4 presentation engine. The main concepts driving the design of the BIFS specification are the following:

  • Integration of 2D and 3D synthetic media together in a single format. By avoiding the burden of mixing multiple media formats together, the content creator has a way to design a complete multimedia content without the hassle of dealing with many different formats, and the end users to benefit from a lighter terminal with state of the art media capabilities;
  • Streaming environment: all existing scene description formats are designed in a way that a complete scene has to be downloaded before anything can be viewed on the terminal. In MPEG-4, the terminal is linked to one or several MPEG-4 servers. The scene description, as any other media, has to be streamed to the client. Allowing the scene to be "cut into pieces" and streamed to the client, as well as its animation parameters provide a more efficient model of transmission, that matches the MPEG-4 usual requirements. When dealing with communication applications these streaming features are also necessary in order to send new data to the user during the communication.
  • Compression: most existing scene representations are in text format, making them editable but very inefficiently represented in terms of data size. The size of scenes are often much smaller than other media. However, complex scenes can be large data sets. Sometimes several mega bytes. Even for smaller scenes, being able to reduce the data size can bring significant improvements in transmission time, especially for low bit rates, or broadcast environments in which the scene has to be repeatedly transmitted. Moreover, animation data can also be streamed in MPEG-4. The efficient compression of such data can significantly reduce the bit rates. A typical example : for a BIFS-Anim streams that consumes 10 kbit/s, non compressed data would consume more than 120 kbit/s.

3. What is the MPEG-4 model of an audio-visual object?

In the MPEG-4 model, audio-visual objects have both a spatial and a temporal extent. Temporally, all AV objects have a single dimension. Each AV object has a local coordinate system in which the object has a fixed spatio-temporal location and scale. AV objects are positioned in a scene by specifying one or more coordinate transformations from the object's local coordinate system into a common, global coordinate system, or scene coordinate system. An audio-visual object in a BIFS scene is usually represented by one BIFS node or a sub-tree of the BIFS scene graph.

4. Why is scene description information separate from audio-visual objects?

Scene description information is a property of the scene's structure rather than of particular AV objects. Consequently, it is transmitted as a separate stream. This is an important feature for bitstream editing and one of the essential content based functionalities in MPEG-4. For bitstream editing, one can change the composition of AV objects without having to decode their bitstreams and change their content. If the position of the object were part of the object's bitstream, this would become very difficult.

5. Can the scene description be changed?

The scene description can be dynamically changed at any time. An initial scene description is provided at the beginning of an MPEG-4 stream. It can be as simple as a single node, or as complex as one wants (within limits that are established for ensuring conformance). BIFS-Commands are used to modify a set of properties of the scene at a given time. It is possible to insert, delete and replace nodes, fields and ROUTEs as well as to replace the entire scene. For continuous changes of the parameters of the scene, BIFS-Anim can be used; it specifically addresses the continuous update of the fields of a particular node. BIFS-Anim is used to integrate different kinds of animation, including the ability to animate face models as well as meshes, 2D and 3D positions, rotations, scale factors, and color attributes. The BIFS-Anim information is conveyed in its own elementary stream.

6. What is the difference between BIFS and VRML?

BIFS has been designed as an extension to the VRML 2.0 specification. In Version 2 of MPEG-4 Systems, all VRML nodes are supported. BIFS extended the base VRML specification in various aspects:

  • New media capabilities in the scene:
    • 2D nodes containing 2D graphics and 2D scene graph description;
    • mixing of 2D and 3D graphics;
    • new audio nodes supporting advanced audio features:
      • Mixing of sources,
      • Streaming audio interface and
      • Creation of synthetic audio content.
    • face and body specific nodes to link to specific Face and Body animation streams;
    • specific nodes linked to the streaming client/server environment, such as media time sensors and back channel messages.
  • A binary encoding of the scene, so that an efficient transmission of the scene can be performed.
  • Specific protocols to stream scene and animation data:
    • The BIFS-Command protocol in order to send synchronized modifications of the scene with a stream;
    • The BIFS-Anim protocol in order to stream continuous animation of the scene.

7. The Scene Description looks similar to VRML. Is it?

The scene description has several similarities to VRML, as the set of nodes defined by VRML was used as an initial set of composition nodes for MPEG-4. The environment that MPEG-4 addresses, however, is quite different from VRML because a key requirement is support for high quality real-time audio-visual content. In addition, rather than using a static scene description, MPEG-4 defines a dynamic one in which objects can be added, changed, or removed from the scene description at any point in time. The MPEG group collaborates closely with VRML in order to ensure alignment and maximize the synergy of the work of both international bodies.

8. How is interactivity handled in MPEG-4?

Interactivity in MPEG-4 Systems is separated into two major categories: client side and server side. The former is available locally at an MPEG-4 terminal while the latter requires communication between the terminal and the sender. Client side interactivity can be further divided in simple object manipulation (repositioning, hiding, changing attributes, etc.) that does not require normative support from the standard, and more general types of events (hyper linking, triggers, etc.) that do require normative support. Note that server side interactivity also requires normative support. Client-side interactivity is handled via VRML's ROUTE mechanism, that links event source fields to event sink fields in the BIFS node tree. Server-based interactivity is provided via a Version 2 BIFS node, called ServerCommand. Additional interactivity can be provided by an application, by translating application events into local scene description updates. Sophisticated interactive applications can be created using the programmatic features of MPEG-4 Systems (ECMAScript as well as Java).

9. So, the scene description is streamed?

Yes. There are elementary streams (just as any visual or audio stream) with BIFS commands and elementary streams conveying BIFS animation data. This allows to attach time stamps to such information, same as time stamps are attached, e.g., to an audio frame.

10. Can there be multiple scene description streams in an MPEG-4 presentation?

Yes. For example, the BIFS scene may be composed from multiple sub-scenes that are Inlined to a main scene. In that case each sub-scene would have its own scene description stream.

Webmaster