INTERNATIONAL
ORGANISATION FOR STANDARDISATION
ORGANISATION
INTERNATIONALE DE NORMALISATION
ISO/IEC
JTC1/SC29/WG11
CODING
OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11
N7314
Poznań, Poland, July 2005
1
Introduction
The
demand for ever-increasing compression performance has urged the definition
of a new part of the MPEG-4 standard, ISO/IEC 14496-10: 'Coding of Audiovisual
Objects – Part 10: Advanced Video Coding', which is identical technical
content with ITU-T Rec. H.264. The development of AVC was performed by the Joint
Video Team (JVT), which consists of members of both MPEG and the ITU-T Video
Coding Experts Group.
2
Technical Solution
The
basic approach of AVC is block-based hybrid video coding (block MC prediction
+ 2D block transform). The most relevant tools and elements extending over other
video compression standards are as follows:
- Motion compensation using variable block
sizes of size 16x16, 8x8, 16x8, 8x16, 8x8, 8x4, 4x8, or 4x4, using motion
vectors encoded by hierarchical prediction starting at the 16x16 macroblock
level;
- Motion compensation of the luma component
sample array is performed by quarter-sample accuracy, using high-quality interpolation
filters;
- Usage of an integer transform of block
size 4x4 or 8x8. The transform design is not exactly a DCT, but could be interpreted
as an integer approximation thereof. For the entire building block of transform
and quantization, implementation by 16-bit integer arithmetic precision is
possible both for encoding and decoding. In contrast to previous standards
based on the DCT, there is no dependency on a floating point implementation,
such that no drift between encoder and decoder picture representations can
occur in normal (error-free) operation.
-
Intra-picture coding is performed
by first predicting the entire block from boundary samples of adjacent blocks.
Prediction is possible for 4x4, 8x8 and 16x16 blocks, where for the 16x16
and 8x8 cases only horizontal, vertical, DC, and planar prediction is allowed.
In the 4x4 block case, nine prediction types are supported (DC and nine directional
spatial prediction modes).
-
An adaptive de-blocking filter
is applied in the prediction loop. The adaptation process of the filter is
non-linear, with the lowpass strength of the filter steered by the quantization
parameter (step size) and by syntax under the control of the encoder. Further
parameters considered in the filter selection are the difference between motion
vectors at the respective block edges, the coding mode used (e.g. stronger
filtering is made for intra mode), the presence of coded coefficients and
the differences between reconstruction values across the block boundaries.
-
Multiple reference picture prediction
allows to define references for prediction of any macroblock from one of up
to F previously decoded pictures; the number F itself depends
on the profile/level definition, which specifies the maximum amount of frame
memory available in a decoder. Values around F=5 are typical when using
the maximum picture size supported in the profile/level definition.
-
Instead of B-type, P-type,
and I-type pictures, type definitions are made slice-wise, where
a slice may, at maximum, cover an entire picture.
-
New types of switching slices
(S-type slices, with SP and SI sub-types) allow controlled
transition of the decoder memory state when stream switching is made.
-
The B-type slices are
generalized compared to previous standards, denoted as bi-predictive
instead of bi-directional. This in particular allows to define structures
of prediction of individual regions from two previous or two subsequent
pictures, provided that a causal processing order is observed. Furthermore,
prediction of B-type slices from other B-type slices is possible,
which allows implementation of a B-frame pyramid. Different weighting
factors can be used for the reference frames in the B-prediction.
-
Two different entropy coding
mechanisms are defined, one of which is Context-adaptive VLC (CAVLC),
the other Context-adaptive Binary Arithmetic Coding (CABAC). Both are
universally applicable to all elements of the code syntax, which is based
on a systematic construction of variable-length code tables. By proper definition
of the contexts, it is possible to exploit non-linear dependencies between
the different elements to be encoded. CABAC is a coding method for binary
signals, and a binarization of multi-level values such as transform coefficients
or motion vectors must be performed before it can be applied; methods which
can be used are unary codes or truncated unary codes (VLCs
consisting of '1' bits with a terminating zero), Exp-Golomb codes or fixed-length
codes. Four different basic context models are defined, where the usage depends
on the specific values to be encoded.
-
Additional error resilience
mechanisms are defined, which are Flexible Macroblock Ordering (FMO
– allowing macroblock interleaving), Arbitrary Slice Ordering (ASO),
data partitioning of motion vectors and other prediction information, and
encoding of redundant pictures, which e.g. allows duplicate sending
or re-transmission of important information.
-
Other methods known from previous
standards, such as frame/field adaptive coding of interlaced material, direct
mode for B-slice motion vector prediction, predictive coding of motion
vectors at macroblock level etc. are implemented.
-
A Network Abstraction Layer
(NAL) is defined for the purpose of simple interfacing of the video stream
with different network transport mechanisms, e.g. for access unit definition,
error control etc.
To achieve the highest possible compression
performance and other goals of the project, it was necessary to sacrifice strict
forward or backward compatibility with prior MPEG and ITU-T video coding standards.
The
key improvements as compared to previous standards are made in the area of motion
compensation, but in proper combination with the other elements. The loop filter
provides a significant gain in subjective quality at low and very low data rates.
State-of-the-art context-based entropy coding drives compression to the limits.
The various degrees of freedom in mode selection, reference-frame selection,
motion block-size selection, context initialization etc. will only provide significant
improvement of compression performance when appropriate optimization decisions,
in particular based on rate-distortion criteria, are made. Such elements
have been included in the reference encoder software.
The
combination of all different methods listed has led to a significant increase
of the compression performance compared to previous standard solutions. Reduction
of the bit rate at same quality level by up to 50% or more as compared to prior
standards such as MPEG-2, H.263, MPEG-4
Part 2 Simple Profile, and MPEG-4
Part 2 Advanced Simple Profile have been reported.
The
concept of profile and level definitions for decoder conformance points is also
implemented in the AVC standard. Presently, the following profiles are defined:
-
Baseline profile:
Constraint to usage of I- and P-type slices, no weighted prediction,
no interlace coding tools, no CABAC, no slice data partitioning, some more
specific constraints on the number of slice groups and levels to be used with
this profile.
-
Extended profile.
No CABAC, all error resilience tools used (including SP and SI
slices), some more specific constraints imposed to the direct mode, number
of slice groups and levels to be used with this profile.
-
Main profile:
Only I-, P- and B-type slices, enhanced error resilience
tools such as slice data partitioning, arbitrary slice order, multiple slice
group per picture are disabled while more basic error resilience features
such as slice resynchronization, NAL parameter set robustness, and constrained
intra prediction are supported; some more specific constraints are made on
levels to be used with this profile.
-
High profile.
Extending Main profile, supporting integer transform of block size 8x8 (switchable),
supporting 8x8 (filtered) intra prediction modes, encoder-customized frequency-specific
inverse quantization scaling, and level definitions adjusted such that better
alignment with typical HD picture formats is achieved.
-
High 10 profile.
Extending High profile, supporting
up to 10 bit amplitude resolution precision
-
High 4:2:2 profile.
Extending High 10 profile, extending color
sampling format into 4:2:2.
-
High 4:4:4 profile.
Extending High 4:2:2 profile, extending color
sampling format support to 4:4:4, supporting up to 12 bit amplitude resolution
precision, supporting a residual color transform in the decoding process,
and defining a transform bypass mode which allows efficient lossless coding.
A
total of 5 major levels and 15 total levels (including sub-levels) is defined.
Level restrictions relate to the maximum number of macroblocks per second, maximum
number of macroblocks per picture, maximum decoded picture buffer size (imposing
constraints on multiframe prediction), maximum bit rate, maximum coded picture
buffer size and vertical motion vector ranges. These parameters can be mapped
to a model of a Hypothetical Reference Decoder (HRD), which relates
to buffer models and their timing behavior.
The
text of the MPEG-4 AVC standard is common with ITU-T Rec. H.264. Currently,
the third edition of the standard text is prepared for publication, which will
contain the full set of specifications as described above.
3
Application areas
MPEG-4
AVC is expected to become widely used in a wide range of applications such as
high-resolution video broadcast and storage, mobile video streaming (Internet
and broadcast), and professional applications such as cinema content storage
and transmission.