Riding the Media Bits  chiariglione.org
Riding the Media Bits
Digital Media Project
Digital Media Manifesto
Leonardo
Acronyms
Site Map
Home

Inside MPEG-4 - Part B


e-mail

 Last update: 2003/10/25

 

An overview of the technical content of the other MPEG-4 components.

 

MPEG-4 Visual provides a natural video coding algorithm that is capable of operation from 5 kbit/s with a spatial resolution of QCIF (144x176 pixels) scaling up to bitrates of some Mbit/s for ITU-R 601 resolution pictures (288x720@50Hz and 240x720@59.94 Hz). The Studio Profile brings the operation range in excess of 1 Gbit/s. It is ITU-T H.263 compatible in the sense that a basic H.263 bitstream is correctly decoded by an MPEG-4 Video decoder. In addition MPEG-4 has a so-called Fine Granularity Scalability (FGS) mode that allows transmission of the same video content at different bitrates.

As mentioned before, MPEG-4 Video supports conventional rectangular images and video (upper portion of figure below) as well as images and video of arbitrary shape (lower portion of figure below).

The MPEG-4 Video Core and the Generic MPEG-4 Coder

The coding of conventional images and video is similar to conventional MPEG-1/2 coding. It involves motion prediction/compensation followed by texture coding. For content-based functionalities, where the image sequence input may be of arbitrary shape and location, coding shape and transparency information is encoded as well. Shape may be either represented by an 8 bit transparency component - which allows the description of transparency if one VO is composed with other objects - or by a binary mask.

The basic coding structure is represented in the figure below. This involves shape coding (for arbitrarily shaped VOs) and motion compensation as well as DCT-based texture coding (using standard 8x8 DCT or shape adaptive DCT).

The MPEG-4 Video coding scheme

MPEG-4 Video can offer unexpectedly high compression ratios if it is possible to exploit the a-priori knowledge of the scene. In the figure below

Background and sprites in MPEG-4 Video

coding of the top left figure would require a considerable amount of information but, if it is possible to separate the background and the sprite (top right), coding of the picture below can be achieved with relatively few bit/s.

The ‘facial animation object’ can be used to render an animated face. The face object contains a generic face with a neutral expression. This can be rendered as such. The shape, texture and expressions of the face are controlled by Facial Definition Parametres (FDP) and/or Facial Animation Parametres (FAP). 

Face Definition Parametres

Upon receiving the animation parameters from the bitstream, the face can be animated: expressions, speech, etc. and FDPs can be sent to change the appearance of the face from something generic to a particular face with its own shape and texture. If so desired, a complete face model can be downloaded via the FDP set. Face models themselves are not mandated by the standard. It is also possible to use specific configurations of the lips and the mood of the speaker.

The Body is an object capable of producing virtual body models and animations in the form of a set of 3D polygonal meshes ready for rendering. Two sets of parameters are defined for the body: Body Definition Parameter (BDP) set, and Body Animation Parametre (BAP) set. The BDP set defines the set of parametres to transform the default body to a customised body with its body surface, body dimensions, and (optionally) texture. The BAPs will produce reasonably similar high level results in terms of body posture and animation on different body models.

MPEG-4 provides standard technology for efficient coding of generic 3D polygonal meshes. The scheme offers some features that are useful in applications, such as incremental representation to enable a decoder to reconstruct a number of faces in a mesh proportional to the number of bits in the bit stream that have been processed, error resilience to enable a decoder to partially recover a mesh when subsets of the bit stream are missing and/or corrupted, and Level Of Detail (LOD) scalability that enables a decoder to reconstruct a simplified version of the original mesh containing a reduced number of vertices from a subset of the bit stream.

MPEG-4 Audio provides complete coverage of the bitrate range of 2 to 64 kbit/s. Good coded speech is obtained already at 2 kbit/s and transparent quality of monophonic music sampled at 48 kHz and 16 bits/sample is obtained at 64 kbit/s. Three classes of algorithms are used in the standard. The first covers the low bitrate range and has been designed to encode speech. The second can be used in the midrange to encode both speech and music. The third can be used in the high bitrate range and can be used for any audio signal.

In the area of synthetic audio two important technologies are available. The first is a Text To Speech (TTS) Interface (TTSI), i.e. a standard way to represent prosodic parameters, such as pitch contour, phoneme duration, and so on. Typically these can be used in a proprietary TTS system to improve the synthesised speech quality and to create, with the synthetic face, a complete audio-visual talking face. The TTS can also be synchronised with the facial expressions of an animated talking head as in the figure below. 

TTS-driven Face Animation

The second technology provides a rich toolset for creating synthetic sounds and music, called Structured Audio (SA). Using newly developed formats to specify synthesis algorithms and their control, any current or future sound-synthesis technique can be used to create and process sound in MPEG-4. The sound quality is guaranteed to be exactly the same on every MPEG-4 decoder. 

 

 

Send an e-mail to commentSee the communication policy

 

Copyright © 2003 chiariglione.org