MPEG-4: Why, What, How and When?
Fernando Pereira
Instituto Superior Técnico - Instituto de Telecomunicações
Av. Rovisco Pais, 1049-001 Lisboa Codex, Portugal
E-mail: Fernando.Pereira@lx.it.pt
Abstract
The MPEG-4 Version 1 standard has been recently finalized. Since MPEG-4
adopted an object-based audiovisual representation model with hyperlinking and
interaction capabilities and supports both natural and synthetic content, it
is expected that this standard will become the information coding playground
for future multimedia applications.
This paper intends to give an overview on the MPEG-4 motivations, objectives,
achievements, process and workplan, providing a stimulating starting point for
more detailed reading.
1. Why: The Context
"What does it mean, to see? The plain man’s answer (and Aristotle’s, too) would be, to know what is where by looking. In other words, vision is the process of discovering from images what is present in the world, and where it is." [1]. The image coding standards nowadays available, and the underlying image data models, mainly address this process by providing an image representation in the form of a sequence of rectangular 2D frames which give the users "a window to the real world": the Television paradigm. However, the process of vision is just a part of the task at hand since typically the human being needs and wants to see, to take actions after, interacting with the objects that compose the world being seen. A similar reasoning can be made regarding the process of hearing and the corresponding audio representation models.
Although the television paradigm dominated audiovisual communications for many years, the situation is nowadays evolving very quickly in terms of the ways audiovisual content is produced, delivered and consumed [2]. Moreover hardware and software are getting more and more powerful, opening new frontiers to the technologies used and to the functionalities provided.
Producing content is nowadays easier and easier. Digital still cameras directly storing in JPEG format have hit the mass market. Together with the first digital video cameras directly recording in MPEG-1 format, this represents a major step for the acceptance, in the consumer market, of digital audiovisual acquisition technology. This step transforms every one of us in a potential content producer, capable of creating content that can be easily distributed and published using the Internet. Moreover more content is being synthetically produced – computer generated – and integrated with natural material in truly hybrid audiovisual content. The various pieces of content, digitally encoded, can be successively re-used without the quality losses typical of the previous analog processes.
While audiovisual information, notably the visual part, was until recently only carried over very few networks, the trend is now towards the generalization of visual information in every single network. Moreover the increasing mobility in telecommunications is a major trend. Mobile connections will not be limited to voice, but other types of data, including real-time media, will be next. Because mobile telephones are replaced every two to three years, new mobile devices can finally make the decade-long promise of audiovisual communications turn into reality. The need for visual communication is much more apparent when you are not at home, and have something to show besides your living room that does not really change over time.
The explosion of the Web and the acceptance of its interactive mode of operation have clearly shown, in the last five years, that the traditional television paradigm would no longer suffice for audiovisual services. Users will want to have access to audio and video like they now have access to text and graphics. This requires moving pictures and audio of acceptable quality at low bitrates on the Web, and Web-type interactivity with live content. It should be possible to activate relationships between entities (in a potentially virtual world) through hyperlinking—the Web paradigm—, and to experience interactive immersion in natural and virtual environments — the Games paradigm.
Since many of the emerging audiovisual applications demand interworking, the need to develop an open and timely international standard became evident. In 1993, MPEG (Moving Pictures Experts Group) [3] launched the MPEG-4 work item, now officially called "Coding of audiovisual objects", to address, among others, the requirements mentioned above [4]. MPEG is Working Group 11 of Sub Committee 29 of the ISO/IEC Joint Technical Committee 1. The group meets 3 to 5 times a year, gathering at each meeting around 300 experts.
MPEG has been responsible for the successful MPEG-1 and MPEG-2 standards that have given rise to widely adopted commercial products and services, such as Video-CD, DVD, digital television, digital audio broadcasting and MP3 codecs (MPEG-1 audio layer 3). The MPEG-4 standard, MPEG’s most recent achievement, is aimed to define an audiovisual coding standard to address the emerging needs of the communication, interactive and broadcasting service models as well as of the mixed service models resulting from their technological convergence. The convergence of the three traditionally separate application areas - communications, computing and TV/film/entertainment – was evident in the mutual cross fertilization with functionalities characteristic of each one of these application areas emerging more and more in the others.
Following the previous successes, MPEG is already working in the next audiovisual
representation standard, this time addressing the problem of describing audiovisual
information to allow the quick and efficient searching, processing and filtering
of various types of multimedia material of interest to the user: MPEG-7, officially
called "Multimedia Content Description Interface" [
5] .
2. What: The Objectives and Achievements
The three major trends above mentioned - mounting importance of audiovisual media on all networks, increasing mobility and growing interactivity - have driven, and still drive, the development of the MPEG-4 standard [2].
To address the identified needs and requirements [6], a standard was needed that could:
Figure 1 – The MPEG-4 object-based architecture
A major difference with previous audiovisual standards, at the basis of the new functionalities, is the object-based audiovisual representation model that underpins MPEG-4 (see Figure 1). An object-based scene is built using individual objects that have relationships in space and time, offering a number of advantages. First, different object types may have different suitable coded representations—a synthetic moving head is clearly best represented using animation parameters, while video benefits from a smart representation of pixel values. Second, it allows harmonious integration of different types of data into one scene: an animated cartoon character in a real world, or a real person in a virtual studio set. Third, interacting with the objects and hyperlinking from them is now feasible. There are more advantages, such as selective spending of bits, easy re-use of content without transcoding, providing sophisticated schemas for scalable content on the Internet, etc.
The applications that benefit from what MPEG-4 brings are found in many —and very different— environments [7]. Therefore, MPEG-4 is constructed as a tool-box rather than a monolithic standard, using profiles that provide solutions in these different settings (see the paper on MPEG-4 profiling in this issue). This means that although MPEG-4 is a rather big standard, it is structured in a way that solutions are available at the measure of the needs. It is the task of each implementer to extract from the MPEG-4 standard the technological solutions adequate to his needs, which are very likely a small sub-set of the standardized tools.
MPEG-4 can be used to deploy complete new applications or to improve existing ones. Unlike MPEG-2 (digital television), MPEG-4 does not target a major "killer application" but it rather opens many new frontiers. Playing with audiovisual scenes, creating, re-using, accessing and consuming audiovisual content will become easier. New and richer applications can be developed in, e.g., enhanced broadcasting, remote surveillance, personal communications, games, mobile multimedia, virtual environments, etc. It allows services with combinations of the traditionally different service models: "broadcast", "(on-line) interaction" and "communication". As such, MPEG-4 addresses "convergence" defined as the proliferation of multimedia in all kinds of services and on all types of (access) networks.
Since a standard is always a constraint of freedom, it is important to make it as minimally constraining as possible [8]. To MPEG this means that a standard must offer the maximum of advantages by specifying the minimum necessary, allowing for competition and for evolution of technology in the so-called "non-normative" areas. The normative tools included in the standard are those which specification is essential for interoperability. For example, while video segmentation and rate control are non-normative tools, the decoding process needs to be normative. The strategy of "specifying the minimum for maximum usability" ensures that good use can be made of the continuous improvements in the relevant technical areas. The consequence is that better non-normative tools can always be used, also after the standard is finalized, and that it is possible to rely on competition for obtaining ever better results. In fact, it will be the very non-normative tools that products will use to distinguish themselves, which only reinforces their importance.
The MPEG-4 requirements have been addressed by the 6 parts of the recently finalized MPEG-4 Version 1 standard, notably:
MPEG-4 was developed over the past 5 years by hundreds of experts from tens
of companies and universities spread globally, who believed that the MPEG-4
technology can power the next generation of multimedia products and services.
MPEG-4 Version 1 was available at the end of 1998 [15]. MPEG-4 Version 2 will
extend the capabilities of the standard in a backward compatible way, and will
be ready by the end of 1999. Participants in MPEG-4 represent broadcasters,
equipment and software manufacturers, digital content creators and managers,
telecommunication service providers, publishers and intellectual property rights
managers, as well as university researchers.
3. How: The Process
Since the technological landscape changed from analog to digital, with all the associated implications, it was essential that also standard makers acknowledged this change by modifying the way by which standards are created. Standards must offer interoperability, across countries, services and applications, and no more a "system driven approach" by which the value of a standard is limited to a specific, vertically integrated, system. This brings us to the tool-kit approach by which a standard must provide a minimum set of relevant tools, which after assembled, according to industry needs, provide the maximum interoperability at a minimum complexity, and very likely cost [ 8] . The success of MPEG standards is mainly based on this tool-kit approach, bounded by the "one functionality, one tool" principle. In conclusion, MPEG wants to offer the users interoperability and flexibility, at the smallest complexity and cost.
In order to fulfill these objectives, MPEG follows a development process with some major steps [ 8] :
While the period until the evaluation of the proposals submitted as answer to the call for proposals is designated as "competitive phase", the period after the evaluation corresponds to the "collaborative phase". During the collaborative phase all the MPEG members collectively improve and complete the most promising tools identified at the evaluation. The collaborative phase is the major strength of the MPEG process since hundreds of the best experts in the world, from tens of companies and universities, work together for a common goal. In this context, it does not come as a surprise that this super-team traditionally achieves excellent technical results, justifying the need for most companies to at least follow the process, if the direct involvement does not result possible.
As stated above, two working tools play a major role in the collaborative development phase that follows the initial competitive phase: the Working Model and Core Experiments (CE) [16]. In MPEG-1 the (video) working model was called Simulation Model (SM), in MPEG-2 the (video) working model was called Test Model (TM), and in MPEG-4 the various working models were called Verification Models (VM). In MPEG-4 there were independent VMs for the video, audio, SNHC (Synthetic and Natural Hybrid Coding) and systems developments. Regarding the MPEG-4 Verification Models and Core Experiments it is important to highlight:
A Verification Model is a complete framework such that an experiment performed
by multiple independent parties will produce essentially identical results.
The VM enabled the checking of the relative performance of different tools,
as well as improving the performance of selected tools. The MPEG-4 VMs were
built after screening the proposals answering the call for proposals. The
first VM (for each technical area) was not the best proposal but a combination
of the best tools, independently of the proposal that they belonged to. Each
VM included normative and non-normative tools to create the "common framework"
that allows performing adequate evaluation and comparison of tools targeting
the continuous improvement of the technology included in the VM. After the
first VMs were established, new tools were brought to MPEG-4 and were evaluated
inside the VMs following a core experiment procedure. The VMs evolved through
versions as core experiments verified the inclusion of new techniques, or
proved that included techniques should be substituted. At each VM version,
only the best performing tools were part of the VM. If any part of a proposal
was selected for inclusion in the VM, the proposer had to provide the corresponding
source code for integration into the VM software in the conditions specified
by MPEG.
B) Core Experiments
The improvement of the VMs started with a first set of core experiments defined
at the conclusion of the evaluation of the proposals. The core experiments
process allowed for multiple, independent, directly comparable experiments
to be performed to determine whether or not a proposed tool had merit. Proposed
tools targeted the substitution of a tool in the VM or the direct inclusion
in the VM to provide a new relevant functionality. Improvements and additions
to the VMs were decided based on the results of core experiments.
A core experiment has to be completely and uniquely defined, so that the results
are unambiguous. In addition to the specification of the tool to be evaluated,
a core experiment also specifies the conditions to be used, again so the results
can be compared. A core experiment is proposed by one or more MPEG experts
and is accepted by consensus, providing that two or more independent experts
agree to perform the experiment.
4. When: The Workplan
For MPEG-4 the process highlighted above translated to the workplan presented
in Table 1. As you may notice in the table, MPEG-4 Version 2 is formally seen
as amendments to the various parts of Version 1.
|
|
Call for MPEG-4 proposals
Final version of the MPEG-4 Evaluation Document |
|
|
Subjective evaluation of video proposals |
|
|
Subjective evaluation of audio proposals |
|
|
Experts evaluation of video proposals |
|
|
First version of the MPEG-4 Video Verification Model |
|
|
Version 1 Working Draft (WD) – parts 1,2,3,5,6 |
|
|
Version 1 Committee Draft (CD) – parts 1,2,3,5,6 |
|
|
Version 1 Final Committee Draft (FCD) after ballot with comments – parts 1,2,3,5,6 |
|
|
Version 1 Final Draft International Standard (FDIS) after ballot with comments – parts 1,2,3,6 |
|
|
Version 1 Committee Draft (CD) – part 4 |
|
|
Version 2 Proposed Draft Amendment (PDAM) – parts 1,2,3,6 |
| Version 1 International Standard (IS) after yes/no ballot – parts 1,2,3,6 | |
|
|
Version 1 Final Draft International Standard
(FDIS) after ballot with comments – part 5
Version 1 Final Committee Draft (FCD) after ballot with comments – part 4 Version 2 Final Proposed Draft Amendment (FPDAM) after ballot with comments – parts 1,2,3,6 Version 2 Proposed Draft Amendment (PDAM) – part 5 |
| Version 1 International Standard (IS) after yes/no ballot – part 5 | |
|
|
Version 1 Final Draft International Standard
(FDIS) after ballot with comments – part 4
Version 2 Final Draft Amendment (FDAM) after ballot with comments – parts 1,2,3,6 Version 2 Proposed Draft Amendment (PDAM) – part 4 Version 2 Final Proposed Draft Amendment (FPDAM) – part 5 |
| Version 1 International Standard (IS) after yes/no
ballot – part 4
Version 2 Amendment (AMD) after yes/no ballot – parts 1,2,3,6 |
|
|
|
Version 2 Final Draft Amendment (FDAM) – part 5 |
| Version 2 Amendment (AMD) – part 5 | |
|
|
Version 2 Final Proposed Draft Amendment (FPDAM) – part 4 |
|
|
Version 2 Final Draft Amendment (FDAM) – part 4 |
| Version 2 Amendment (AMD) – part 4 |
Table 1- MPEG-4 time schedule
Although discussions about MPEG-4 started as early as May 1991, in Paris, it was not until September 1993 that the MPEG AOE (Applications and Operational Environments) group, chaired by Cliff Reader, met for the first time. The main task of this group was to identify the applications and requirements relevant for the far-term very low bitrate coding solution to be developed by ISO/MPEG as stated in the very initial MPEG-4 project description [17]. At the same time, the near-term hybrid coding solution being developed within the ITU-T LBC (Low Bitrate Coding) group started producing the first results (later the ITU-T H.263 standard). It was then quite generally felt that those results were close to the best performance that could be obtained by block-based hybrid DCT/motion compensation video coding schemes.
In July 1994, the Grimstad MPEG meeting marked a major change in the direction of MPEG-4. Until that meeting, the main goal of MPEG-4 had been to obtain a significantly better compression ratio than could be achieved by conventional techniques. Only very few people, however, believed that it was possible, within the following 5 years, to get enough improvements over the LBC standard (H.263 and H.263+) to justify a new standard. So the AOE group was faced with the need to broaden the objectives of MPEG-4, believing that "pure compression" would not be enough. The group then started an in-depth analysis of the audiovisual world trends, based on the convergence of the TV/film/entertainment, computing and telecommunications worlds. The conclusion was that the emerging MPEG-4 coding standard should support new ways, notably content-based, for communication, access and manipulation of digital audiovisual data.
Following this change of direction, the vision behind the MPEG-4 standard was explained through the eight "new or improved functionalities" described in the MPEG-4 Proposal Package Description (PPD) [18]. These eight functionalities came from an assessment of the functionalities that would be useful in future applications, but were not supported or not well supported by the available coding standards. The eight "new or improved" MPEG-4 functionalities were clustered in three classes related to the aforementioned three worlds, the convergence of which MPEG-4 wanted to address [18]:
The video subjective tests were performed in November 1995 at the premises of Hughes Aircraft Co., in Los Angeles, while the audio subjective tests were performed in December 1995 at CCETT, Mitsubishi, NTT and Sony. The video expert panels evaluation was performed in October 1995 and January 1996.
After the evaluation of the technology received [20], choices were made and the collaborative phase started with the most promising tools. In the course of developing the standard, additional calls were issued when not enough technology was available within MPEG to meet the requirements, e.g., for synthetic coding tools in March 1996. This is a typical solution when MPEG is missing some technology and there are good indications that the technology does indeed exist outside MPEG.
At the MPEG January 96 meeting in Munich, a single MPEG-4 Video Verification Model (VM) was defined. In this VM, a video scene was represented as a composition of "Video Object Planes" (VOPs) [21]. The first MPEG-4 Video VM used ITU-T H.263 coding tools together with shape coding, following the results of the November 1995 MPEG-4 video subjective tests.
A process similar to the one used for video was followed for audio, although with some initial delay due to the involvement of many audio experts in the AAC (Advanced Audio Coding) MPEG-2 work.
Following this initial phase, the several MPEG-4 VMs evolved by using the core experiment process, as described before. A new version of each of the MPEG-4 VMs has been issued at each MPEG meeting, e.g., the Video VM was in version 13 at the Seoul meeting in March 1999 [22].
As highlighted in the previous section, the last step of the MPEG process is
the verification of the technology in the standard aiming at verifying the performance
of the available tools and demonstrating their potentialities. For MPEG-4, the
verification step has been performed through a set of verification tests addressing
various parts of the standard. Until now verification tests have been performed
for narrowband audio broadcasting, speech codecs, audio on Internet, video error
resilience, video content-based coding and video temporal scalability in the
simple scalable profile [3]. Tests on video temporal scalability in the core
profile and video coding efficiency were going on in April 1999.
5. Final Remarks
MPEG-1 and MPEG-2 have been successful standards that have given rise to widely adopted commercial products, such as CD-interactive, digital audio broadcasting, and digital television. However these standards are deeply limited in terms of the functionalities provided by the data representation models used.
The recent MPEG-4 standard opens new frontiers in the way users will play with, create, re-use, access and consume audiovisual content. The MPEG-4 object-based representation approach where a scene is modeled as a composition of objects, both natural and synthetic, with which the user may interact, is at the heart of the MPEG-4 technology.
Let’s now expect that the MPEG-4 vision may reach and convert many application
developers and that the MPEG-4 standard will become the audiovisual playground
of the future.
Acknowledgments
The author would like to thank all the MPEG members for the interesting and
fruitful discussions in meetings and by e-mail, which substantially enriched
his technical knowledge. Special thanks go to Rob Koenen for his friendship
and patience.
Moreover the author acknowledges the support of his work on the MPEG-4 standard
by the European Commission under the ACTS program, project AC098 MoMuSys and
by PRAXIS XXI under the project "Normalização de métodos
avançados de representação de vídeo".
References
Fernando Pereira
Fernando
Pereira was born in Vermelha, Portugal in October 1962. He was graduated in
Electrical and Computers Engineering by Instituto Superior Técnico (IST),
Universidade Técnica de Lisboa, Portugal, in 1985. He received the M.Sc.
and Ph.D. degrees in Electrical and Computers Engineering from IST, in 1988
and 1991, respectively. He is currently Professor at the Electrical and Computers
Engineering Department of IST. He is responsible for the participation of IST
in many national and international research projects. He is a member of the
Editorial Board of the Signal Processing: Image Communication Journal and an
Associate Editor of IEEE Transactions of Circuits and Systems for Video Technology.
He is a member of the Scientific Committee of several international conferences.
He has contributed more than ninety papers. He has been participating in the
work of ISO/MPEG for many years, notably as the head of the Portuguese delegation,
and chairing many Ad Hoc Groups related to the MPEG-4 and MPEG-7 standards.
His current areas of interest are video analysis, processing, coding and description,
and multimedia interactive services.