Riding the Media Bits  chiariglione.org
Riding the Media Bits
Digital Media Project
Digital Media Manifesto
Leonardo
Acronyms
Site Map
Home

Inside MPEG-4 - Part A


e-mail

 Last update: 2003/10/25

 

An overview of the technical content of MPEG-4 Systems.

 

Imagine that I am a publisher of technical courses. I hire professional presenters and I make videos of them while they present their slides. Then I copy the recorded lessons on VHS cassettes and distribute them. If anything changes, like if I want to make the version of a successful course in another language, I take another presenter capable of speaking that particular language, I get the slide translated and here you go for another version of the course. 

Of late the world has been changing. People talk about multimedia and the Internet, and this looks great because I can achieve a wider audience while cutting my distribution costs. However, what I have seen is just a transposition of my old business of VHS cassette distribution to a newer business of file or stream distribution. Is this a sufficient improvement to motivate me to change my way of doing business? I am not sure because when my courses cease to be carried by a cassette and leave my shop in the form of bits, I do not know what happens to them. But one day I hear of this great new thing called MPEG-4 and I decide to try it. 

Tens of papers and books start out explaining MPEG-4 by using the figure below, originally contributed by Phil Chou, then with Xerox PARC. This page will be no different :-).

An MPEG-4 scene

I will use this figure to describe my example of publisher of technical courses. The standing lady is the teacher, making her lecture using a multimedia presentation next to a desk with a globe on it. In the example, I hire a professional presenter and I make video clips of her while she is talking, but this time I make the video while she has a blue screen as background. The blue screen is useful because I can extract just the shape of the presenter using "chroma key", a well-known technique used in television to effect composition. I record the voice of the presenter in a way that it is easy for me to dub her voice and translate the audio-visual support material in case I want to make a multilingual edition of the course, without changing the video. 

Having the teacher as a separate video sprite, I can use a professional designer to create a virtual set made up of a synthetic room with some synthetic furniture and the frame of a blackboard that I can use to display the audio-visual support material. Using the MPEG-4 object model I have now the following objects in the figure above:

  1. the teacher (video sprite)

  2. the teacher (speech)

  3. the multimedia presentation

  4. the desk

  5. the globe

  6. the background

What I have created is a typical MPEG-4 scene composed of audio-visual objects: static "objects" (e.g. the desk that stays unchanged for the duration of the lesson) and dynamic "objects" (e.g. the sprite and the accompanying voice, and the sequences of slides).

I now need an authoring tool using which I can place the sprite of the teacher anywhere I want, e.g. near the blackboard and I will then store all the objects and the scene description in the MP4 File Format. This presentation may be 'local' to the system containing the presentation, or may be via a network or other stream delivery mechanism (a TransMux). The file format is also designed to be independent of any particular delivery protocol but enables efficient support for delivery in general.  

The content, possibly recorded, on a DVD-ROM, is the MPEG-4 equivalent of the content on my traditional VHS cassette. Interestingly, I can still distribute the DVD-ROM using my old distribution network. In this sense MPEG-4 is just a tool that helps me produce my content more efficiently because any time I want I can change or add as many objects as I think will improve the value of my course, including new languages, new slides or an improved classroom. 

Now I would like to start offering my courses to my clients via the Internet. It would be nice if I could use the web server that has so far only been used to advertise my company courses. Web servers use HTTP, a good protocol because content transported with it can cross firewalls, but HTTP is a just a file download protocol. This means that my customers would have to download the entire scene before they can start seeing anything. Fortunately there are technology providers who can provide me with a computer program that creates the additional information that the web server requires to stream using so-called "progressive HTTP". The good thing is that, the day I decide to switch to a new transport protocol, say the Real Time Protocol (RTP) I do not have to do anything to my content. I will just buy another piece of software that creates the additional information. Ditto, if one or the other day I have to send my courses in broadcast using MPEG-2 TS. The content itself does not require any change. 

Let us now see what happens at the end user side. One of my subscribers to the course, after having completed some forms of payment or authentication - something that is obviously outside of the MPEG-4 standard - can access the course. To do this, the first thing that needs to be done is to set up a session between the client and the server. This is done using DMIF, the MPEG-4 session protocol for the management of multimedia streaming. When the session with the remote side is set up, the streams that are needed for the particular lesson are selected and the DMIF client sends a request to stream them. The DMIF server returns the pointers to the connections where the streams can be found, and finally the connections are established. Then each audio-visual object is streamed using a virtual channel called Elementary Stream (ES) through the Elementary Stream Interface (ESI). The functionality provided by DMIF is expressed by the DAI as in the figure below, and translated into protocol messages. In general different networks use different protocol messages, but the DAI allows the DMIF user to specify the Quality of Service (QoS) requirements for the desired streams.

The 3 layers in the MPEG-4 stack

The "TransMux" (Transport Multiplexing) layer offers transport services matching the requested QoS. However, only the interface to this layer is specified because the specific choice of the TransMux is left to the user. The specification of the TransMux itself is left to bodies that are responsible for the relevant transport, with the obvious exception of MPEG-2 TS, whose body in charge is MPEG itself. The second multiplexing layer is the FlexMux (this name is being changed), which allows grouping of ESs with a low multiplexing overhead. This is particularly useful when there are many ESs with similar QoS requirements, each possibly with a low bitrate. In this case it is possible to reduce the number of network connections, the transmission overhead and the end-to-end delay. 

One special ES, the one containing the Scene Description, plays an important role. The Scene Description is a graph represented by a tree, like in the figure below that refers to the scene used at the top of this page.

An MPEG-4 scene graph

With reference to the specific example, at the top of the graph we have the full scene with four branches: the person, the background, the furniture and the audio-visual presentation. The second branch is actually a "leaf" because there is no further subdivision, but the others are further subdivided. The object "person" is composed of two media objects: a visual object and an audio object (her voice). The object "furniture" is composed of two visual objects, the desk and the globe. The audio-visual presentation may be itself another scene. The ESs carry the information corresponding to the individual "leaves", they are decompressed by the appropriate decoders and composed in a 3D space using information provided by the scene description.

The other important feature of the DAI is to provide a uniform interface to access multimedia contents on different delivery technologies. This means that the part of the MPEG-4 player sitting on top of the DAI is independent of the actual type of delivery: interactive networks, broadcast and local storage. This can be seen from the figure below. In the case of a remote connection via the network there is a real DMIF peer at the server, while in the local disk and broadcast access cases there is a simulated DMIF peer at the client.

DMIF independence from delivery

In the same way as MPEG-1 and MPEG-2 describe the behaviour of an idealised decoding device along with the bitstream syntax and semantics, MPEG-4 defines a System Decoder Model (SDM). The purpose is to define precisely the operation of the terminal without unnecessary assumptions about implementation details that may depend on the specific environment. As an example there may be devices receiving MPEG-4 streams over isochronous networks, while others will use non-isochronous means (e.g. the Internet). The specification of a buffer and timing model is essential to design encoding devices that may be unaware of what the terminal device is or how it will receive the encoded stream. Each stream carrying media objects is characterised by a set of descriptors for configuration information, e.g. to determine the precision of encoded timing information. The descriptors may carry "hints" to the QoS required for transmission (e.g. maximum bit rate, bit error rate, priority, etc.). 

ESs are subdivided in Access Units (AU). Each AU is time stamped for the purpose of ES synchronisation. The synchronisation layer manages the identification of such AUs and the time stamping. ESs coming from the demultiplexing function are stored in Decoding Buffers (DB) and the individual Media Object Decoders (MOD) read the data from there. The Elementary Stream Interface (ESI) is located between DBs and MODs. See the figure below.

The MPEG-4 decoder model

The functions of an MPEG-4 decoder is represented in the figure below.

Functions of an MPEG-4 decoder model

Depending on the viewpoint selected by the user, the 3D space generated by the MPEG-4 decoder is projected onto a 2D plane  and rendered: the visual part of the scene is displayed on the screen and the audio part is generated from the loudspeakers. The user can hear the lesson and view the presentation in the language of his choice by interacting with the content. This interaction can be separated into two major categories: client-side interaction and server-side interaction. Client-side interaction involves locally handled content manipulation, and can take several forms. In particular, the modification of an attribute of a scene description node, e.g. changing the position of an object, making it visible or invisible, changing the font size of a synthetic text node, etc., can be implemented by translating user events, such as mouse clicks or keyboard commands, to scene description updates. The MPEG-4 terminal can process the commands in exactly the same way as if they had been embedded in the content. Other interactions require sending commands to the source of information using the upstream data channel. 

Let me now impersonate once more the publisher of courses. I have succeeded in entering the business of selling content on the web, but one day I discover that my content can be found on the web for people to enjoy without getting it from me. I had heard of MP3, but I did not expect that MP4 would mean that my content would be similarly exposed to piracy dangers. I turn to my technology provider of choice and I learn that MPEG-4 can provide solutions for protection of my Intellectual Property Rights (IPR). 

A first level of content management is achieved by adding the Intellectual Property Identification (IPI) data set to the coded media objects. This carries information about the content, type of content and (pointers to) rights holders, e.g. myself or other people from whom I may have acquired the right to use content. The mechanism provides a registration number similar to the well established International Standard Recording Code (ISRC) used in CD Audio. For some parts of my content I am quite happy to let users freely exchange information, provided it is known that I am the rights holder, but for other parts of my content, the information has great value to me so that I need higher-grade technology for management and protection.

Fortunately MPEG-4 has standardised the MPEG-4 IPMP interface allowing the design and use of domain-specific IPMP Systems (IPMP-S). This interface consists of IPMP-Descriptors (IPMP-D) and IPMP-Elementary Streams (IPMP-ES) that provide a communication mechanism between IPMP-Ss and the MPEG-4 terminal. When MPEG-4 objects require management and protection, they have IPMP-Ds associated with them to indicate which IPMP-Ss are to be used and provide information about how to manage and protect the content. It is to be noted that, unlike MPEG-2 where a single IPMP system is used at a time, in MPEG-4 different streams may require different IPMP-Ss. The figure below represents these concepts.

The MPEG-4 IPMP model

MPEG-4 IPMP is a powerful mechanism. As an examples it allows me to "buy" the right to use certain content already in protected form from a third party .

Another feature that I find useful to make my content more interesting is to add programmatic content to the scene. The technology used is called MPEG-J, a programmatic system (as opposed to the purely declarative system that I have used so far) that specifies APIs to enable Java code to manage the operation of the MPEG-4 player. By combining MPEG-4 media and executable code, I can now achieve functionalities that would be cumbersome to achieve just with the declarative part of the standard (see figure below). 

MPEG-J model

The lower half of this drawing represents the parametric MPEG-4 Systems player also referred to as the Presentation Engine. The MPEG-J subsystem controlling the Presentation Engine, also referred to as the Application Engine, is depicted in the upper half of the figure. The Java application is delivered as a separate elementary stream to the MPEG-J run time environment of the MPEG-4 terminal, from where the MPEG-J program will have access to the various components and data of the MPEG-4 player.

 

 

Send an e-mail to commentSee the communication policy

 

Copyright © 2003 chiariglione.org