INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11
MPEG2008/N9792
April 2008, Archamps, FR
Title: Overview of Scalable Video Coding
Author: Jens-Rainer Ohm
Scalable Video Coding (SVC) was defined as an amendment over MPEG4-AVC, providing efficient scalable representation of video by flexible multi-dimensional resolution adaptation. The interrelationship and adaptation between transmission/storage and compression technology is highly simplified by this scalable video representation, giving support to various network and terminal capabilities and also giving significantly increased error robustness by very simple stream truncation. Unlike previous solutions, SVC provides a high degree of flexibility in terms of scalability dimensions (supporting various temporal/spatial resolutions, SNR/fidelity levels and global/local ROI access), while the penalty in compression performance, as compared to single-layer coding, is almost negligible. Extensive results on subjective viewing have been presented in [1]
SVC is based on a layered representation with multiple dependencies. To achieve temporal scalability, the construction of frame hierarchies is essential, where those frames that are not used as references for prediction of layers that are still present can be skipped. An example of such a hierarchical prediction structure is given in Figure 1. The pictures marked as “B3” establish the set that would be removed to reduce the frame rate by a factor of 3, by removing “B2” the frame rate would further be reduced by a factor of 2 etc.

Figure 1. Example of hierarchical B prediction structure for temporal scalability.
The hierarchical prediction structure as shown in Figure 1 is not only useful to achieve the functionality of temporal scalability. Due to the establishment of finite prediction dependencies between the various frames, encoder/decoder drift problems are significantly reduced in cases where not the same information would be used for prediction on both sides due to bitstream scaling

Figure 2. Hierarchical layer structure for spatial scalability
For the purpose of spatial scalability, the video is first downsampled to the required spatial resolution(s). The ratio between frame heights/widths of the respective resolutions does not need to be dyadic (factor of two). Moreover, configurations where the higher layer is 1080p and the lower layer is 720p are easily supported.
Encoding as well as decoding starts at the lowest resolution, where an AVC compatible “base layer” bitstream will typically be used. For the respective next-higher “enhancement layer”, three decoded component types are used for inter-layer prediction from the lower layer:
- Up-sampled intra-coded macroblocks;
- Motion and mode information (aligned/stretched according to image size ratios);
- Up-sampled residual signal in case of inter-coded macroblocks.
The prediction from the lower layer is an additional mode which may not always be used. In extreme case, each of the spatial layers could still be encoded completely independently, e.g. when the predictions from past or future frames of the higher-resolution layer are better than the up-sampled result from the lower-resolution layer. The different possibilities of prediction dependencies for the case of two spatial resolutions are illustrated in Figure 3.

Figure 3. Example of prediction dependencies for the case of two spatial layers.
Quality scalability in SVC (also known as “SNR scalability”) can be seen as a simple case of spatial scalability, where the prediction dependencies are applied between pictures of same resolution, but different qualities. Typically, the next higher quality layer is operated by changing the AVC QP parameter by a value of 6, which maps into half quantizer step size.
Due to the nature of the information that is conveyed between the layers, it is in fact not necessary to run predictive decoder loops for the lower layers. Only information that is directly decodable, such as motion, mode, residual or intra information are conveyed to the next-higher layer. Therefore, the decoding process of SVC can be designated as single-loop decoding, which is in fact not significantly more complex than conventional AVC single-layer decoding.
At the bitstream, packetization and network interfacing level, full compatibility is retained. SVC enhancement layer information is conveyed as a new NAL unit type which would be skipped by an existing AVC decoder, such that the base layer would still be decodable by such devices. Within the SVC NAL unit header, important information about the respective packet, such as its belonging to a certain layer of spatial, temporal and quality resolution is conveyed. This can easily be extracted by media-aware network elements to make a decision on whether the respective packet should be dropped.
The compatibility with existing devices is also retained by the profile structure defined for SVC. The definitions are as follows:
- Scalable baseline profile, which builds on top of a baseline-profile base layer bitstream;
- Scalable high profile, which builds on top of a high-profile base layer bitstream;
- Scalable high intra profile, which builds on top of a high-profile base layer bitstream, but restricts the enhancement layer to intra-frame coding.
[1] ISO/IEC JTC 1/SC 29/WG 11 N9577: SVC Verification Test Report, Antalya, Jan. 2008 (public document)