INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11 N7298
Poznań, Poland, July 2005
Title: Introduction to MPEG-4 Video (arbitrary shape)
Source: Video Subgroup
Editor: Jens-Rainer Ohm
Status: Approved
ISO/IEC 14496-2 specifies the coded representation of picture information in the form of natural or synthetic visual objects such as video sequences of rectangular or arbitrarily shaped pictures, moving 2D meshes, animated 3D face and body models, and texture for synthetic objects. The coded representation allows for content-based access for digital storage media, digital video communication, and other applications. The representation supports constant bit rate transmission, variable bit rate transmission, robust transmission, content-based random access (including normal random access), object-based scalable decoding (including normal scalable decoding), object-based bitstream editing, as well as special functions such as fast forward playback, fast reverse playback, slow motion, pause, and still pictures. The following description concentrates on the support for arbitrary-shaped video objects.
The video coding algorithm is partially based on proven technology from previous standards (e.g. block-based motion compensation, DCT), but tools for new functionalities such as content-based coding are specified. While MPEG-1 and MPEG-2, are only able to encode rectangular video frames, MPEG-4 extends into encoding of video objects, which can have arbitrary shape. Video scenes can be composed from several objects which may change in position, appearance, size etc., independent of each other. For unique definition of both rectangular and arbitrary-shaped video objects in the bitstream syntax, the concept of a Video Object Plane (VOP) is introduced, which can be used to represent either a rectangular-plane frame or arbitrary-shaped object plane. Motion-compensated prediction processing is defined for B-VOP (bi-directional predictive VOP) and P-VOP (unidirectional predictive VOP). Within an I-VOP (intraframe encoded), no reference is taken to other VOPs; which, rather than using interframe prediction, use prediction within the same VOP of DC transform coefficients and the first row or column of AC transform coefficients.
Content-based coding of images and video allows separate decoding and reconstruction of arbitrarily-shaped video objects. Extended manipulation of content in video sequences allows functionalities such as warping of synthetic or natural text, textures, image and video overlays on reconstructed video content. An example is the mapping of text in front of a moving video object where the text moves coherently with the object. As a specific type of video-related object, static sprites are defined, which are mosaic-like images that can geometrically be aligned by 2D global warping, such that reverse mapping into the frames of a video sequence can be performed.
Shape coding assists the description and composition of conventional images and video as well as arbitrarily shaped video objects. Applications that benefit from binary shape maps with images are content-based image representations for image databases, interactive games, surveillance, and animation. A binary alpha map defines whether or not a pixel belongs to an object. It can be ‘on’ or ‘off’. ‘Gray Scale’ or ‘alpha’ Shape Coding defines the ‘transparency’ of an object, which is not necessarily uniform; it can vary over the object, so that, e.g., edges are more transparent (a technique sometimes called feathering). Multi-level alpha maps can be used to blend different layers of image sequences. Other applications that benefit from associated binary alpha maps with images are content-based image representations for image databases, interactive games, surveillance, and animation.
The binary shape mask is compressed by Context Arithmetic Encoding (CAE). Binary shape parameters are also encoded by utilization of motion information, where motion-compensated samples from the reference frame can be used within the context of CAE. As the basic concept is block-based (DCT, motion compensation), the shape is aligned with a block grid, where blocks of rectangular shape and boundary blocks of non-rectangular shape co-exist. For gray-scale shape; the encoding is performed by the same motion-compensated DCT algorithm that is used for the texture information.
Padded DCT blocks or shape-adaptive DCT can be used to encode the texture within boundary blocks of non-rectangular shape.
The concept of profiles and levels is implemented in MPEG-4 to define conformance points of decoder configurations. Below the entity of profiles, MPEG-4 defines object types. These are combinations of tools (basic coding methods like B VOPs, interlaced coding etc.) necessary to support a selected set of applications.
One element of a profile specification is the object type, referring to a syntactical structure supporting a specific set of tools. Often the profiles and object types within them are specified in a hierarchical fashion, providing a well-defined structure for differing varieties of capability within relevant application environments. Object types related to arbitrary-shape video are as follows:
- Core and Core Scalable: Supersets of the Simple and Simple Scalable object types, respectively. These allow arbitrary binary-shape video objects, B-VOPs, and different quantization methods. The Core types are useful for applications such as those providing relatively simple content-interactivity (Internet multimedia applications).
- Advanced Coding Efficiency: A superset of the Advanced Simple object type, allowing arbitrary shaped video objects with binary or gray-scale shape. It is suitable for applications such as mobile broadcast reception, the acquisition of image sequences (camcorders) and other applications where high coding efficiency is requested and small footprint is not the prime concern.
- Main: A superset of the Core object type, invoking most of available MPEG-4 video tools such as sprites, gray-scale shape, and interlaced coding.
Subsequent to the third edition of the standard text which was published in 2004, the following corrigenda and amendments are integral part of the MPEG-4 Visual specification:
- ISO/IEC 14496-2:2004/Cor.1:2004
- ISO/IEC 14496-2:2004/Amd.1:2004 (Error resilient simple scalable profile)
- ISO/IEC 14496-2:2004/Amd.2:2005 (New levels in simple profile)
- ISO/IEC 14496-2:2004/Cor.2:200X (in preparation)
Arbitrary-shape MPEG-4 Video is a format that can be used for a wide range of interactive and content-related applications, such as interactive movies and games with user-selected insertion and replacement of scene parts, insertion of segmented video objects in graphics and multimedia presentations. The different options of scene composition are also useful in editing and production of video and multimedia content.