INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11 N7297
Poznań, Poland, July 2005
Title: Introduction to MPEG-4 Video (rectangular)
Source: Video Subgroup
Editor: Jens-Rainer Ohm
Status: Approved
ISO/IEC 14496-2 specifies a video codec which allows efficient compression of rectangular (frame-based) video. Support is given for manifold applications, ranging from extremely low rates and resolutions as required by mobile video transmission up to high rates, resolutions and fidelity as applicable in the field of professional production. Additional functionalities such as scalability and error resilience are supported as well.
The video coding algorithm used for frame-based video is based on proven technology from previous standards (e.g. block-based motion compensation, DCT), including tools for functionalities such as scalability and error resilience are specified as well. MPEG-4 defines encoding of video objects, which in this case have rectangular shape. Several objects can be combined in a scene, e.g. for picture-in-picture overlays. A video object relates to the syntax and semantics of a Video Object Plane (VOP), which in this relates to a rectangular-plane frame area. Motion-compensated prediction processing is defined for B-VOP (bi-directional predictive VOP) and P-VOP (unidirectional predictive VOP). Within an I-VOP (intraframe encoded), no reference is taken to other VOPs; which, rather than using interframe prediction, use prediction within the same VOP of DC transform coefficients and the first row or column of AC transform coefficients.
The following formats and bit rates are be supported by MPEG-4 Visual:
· Bit rates: typically between 5 kbit/s and more than 1 Gbit/s
· Progressive as well as interlaced video
· Different color sampling formats (including 4:2:0, 4:2:2 and 4:4:4)
· Resolutions: from sub-QCIF to 'Studio' resolutions (4k x 4k pixels)
New compression tools are defined to improve the compression efficiency over the previous standards MPEG-1 and MPEG-2, and to support high compression performance for all addressed bit rates. This includes the compact coding of textures with a quality adjustable between "acceptable" for very high compression ratios up to "near lossless". The basic compression algorithm is hybrid coding (combination of motion-compensated prediction and scalar-quantized DCT coefficient coding). Specific tools include:
- Quarter-pixel accuracy and variable block size (8x8 or 16x16) can be used in motion compensation;
- Global motion compensation, which allows to express e.g. the effect of camera motion by using only a small number of parameters;
- Different VLC tables can be selected where the codes are designed for more efficient encoding at ranges of lower or higher rates; the choice is controlled by the encoder and can depend on the target rate;
- In the short header mode, bitstream-level compatibility with the H.263 baseline syntax is realized;
- The direct mode can determine the motion vectors within B-VOPs by inference from the co-located P-VOP motion vectors without rate overhead;
- For high-quality studio storage and inter-studio transmission applications, a different method of DCT coefficient encoding is introduced[1]. This is based on grouping of DCT coefficients by similar amplitude values instead of the conventional zig-zag run-length and level combination entropy-coding scheme. In this feature, a recursive selection of VLC tables is applied to groups of coefficients, where the selection function relies on previously coded groups. Coded data are the group indicator and a fixed-length code determining the actual coded value.
- For high-quality studio storage, a lossless coding method based on switching between DPCM and PCM is defined.
Complexity scalability in the encoder allows encoders of different complexity to generate valid and meaningful bitstreams for a given texture, image or video. Complexity scalability in the decoder allows a given texture, image or video bitstream to be decoded by decoders of different levels of complexity. The reconstructed quality, in general, is related to the complexity of the decoder used. This may entail that less powerful decoders decode only a part of the bitstream. More specific scalability tools defined in the video codec are as follows:
- Spatial scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display textures, images and video objects at reduced spatial resolution.
- Temporal scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display video at reduced temporal resolution. A maximum of three levels are supported.
- Fidelity scalability (also called SNR scalability) allows a bitstream to be parsed into a number of bitstream layers of different bit rate such that the combination of a subset of the layers can still be decoded into a meaningful signal with the same spatial and temporal resolution but lower fidelity. The bitstream parsing can occur either during transmission or in the decoder. The reconstructed quality, in general, is related to the number of layers used for decoding and reconstruction.
Error resilience allows accessing images and video over a wide range of storage and transmission media. This includes the useful operation of image and video compression algorithms in error-prone environments at low bit-rates (i.e., less than 64 Kbps). There are tools that address both the band-limited nature and error resiliency aspects of access over wireless networks. Specific tools for error resilience include:
- Resync markers can be embedded at different points of the bit stream, down to the level of a video packet, also called a slice, which is a unit containing a variable (defined by encoder) number of macroblocks, each of which covers a 16x16 picture region. The slice header then contains position information which is necessary to recover after data losses and restart the decoding process at the correct position in the decoded picture.
- Data partitioning allows to separate the motion data (for which a loss would be quite critical, as the reconstructed image may appear severely geometrically distorted otherwise) from the less important texture data (DCT coefficients).
- Reversible variable-length codes allow to reconstruct the DCT coefficient information backwards from a resynchronization point in the case of data errors or losses.
The concept of profiles and levels is implemented in MPEG-4 to define conformance points of decoder configurations. Below the entity of profiles, MPEG-4 defines object types. These are combinations of tools (basic coding methods like B VOPs, interlaced coding etc.) necessary to support a selected set of applications.
One element of a profile specification is the object type, referring to a syntactical structure supporting a specific set of tools. Often the profiles and object types within them are specified in a hierarchical fashion, providing a well-defined structure for differing varieties of capability within relevant application environments. Object types related to rectangular video are as follows:
- Simple and Simple Scalable: Support for only rectangular VOPs, no B-VOPs, half-pixel accuracy of motion compensation, tools for error resilience. Simple scalable further allows spatial and temporal scalability (including B-VOPs). The Simple Profile is particularly suitable for applications on mobile networks, such as UMTS and IMT2000. Its scalable extension is useful for applications which provide services at more than one level of quality due to bit-rate or decoder resource limitations, such as Internet use and software decoding.
- Advanced Simple: A superset of the Simple object type, allowing B-VOPs, quarter-pixel accuracy of motion compensation, global motion compensation, and interlaced video coding tools. It is used in applications where higher compression performance is required than that provided by the Simple object type.
- Simple Studio and Core Studio: These are defined specifically for high resolution and high quality in applications of studio production and materials exchange. Studio-typical color sampling formats as 4:4:4 and up to 12 bit amplitude resolution are supported. Additional tools useful for production are included such as lossless coding, sprite coding and multiple alpha channels for auxiliary data.
- Advanced Real-Time Simple: Invokes additional error-resilience functionality, such as encoder/decoder re-synchronization in case of transmission errors, and resolution reduction. It is suitable for real time coding applications; such as the videophone, teleconferencing and remote observation.
- Error Resilient Simple Scalable: A superset of the Simple Scalable object type with additional error resilience tools, in particular resynchronization mechanisms for the enhancement layer.
Subsequent to the third edition of the standard text which was published in 2004, the following corrigenda and amendments are integral part of the MPEG-4 Visual specification:
- ISO/IEC 14496-2:2004/Cor.1:2004
- ISO/IEC 14496-2:2004/Amd.1:2004 (Error resilient simple scalable profile)
- ISO/IEC 14496-2:2004/Amd.2:2005 (New levels in simple profile)
- ISO/IEC 14496-2:2004/Cor.2:200X (in preparation)
Frame-based MPEG-4 Video is a format that is used for efficient storage of video content and for video streaming over the Internet and mobile networks, as well as for professional applications such as video storage in studios.