MPEG and multimedia communications
Leonardo Chiariglione
CSELT - Torino - Italy
You are visitor no.   since
Monday, 26-Jun-00
Digital Television is a reality today but Multimedia Communications, after years of hype, is still a catchword. Lack of suitable multi-industry standards supporting it is one reason for the unfulfilled promise.
The MPEG committee which originated the MPEG-1 and MPEG-2 standards that made Digital Television possible, has completed the development of MPEG-4 version1 in October 1998 and version 2 in December 1999. MPEG is currently developing some minor extensions to these two versions. MPEG is now developing MPEG-7, a standard for multimedia information search, filtering, management and processing. MPEG-7 will become Committee Draft in October 2000 and Draft International Standard in July 2001.
This paper describes how the MPEG-4 standard, with its application-level features and its network-awareness but independence is poised to become the enabling technology for multimedia communications and will therefore contribute to solve the problems that are hindering Multimedia Communications.
Table of content
After 10 years since the word "multimedia" entered the techno-vocabulary, 5 years of "convergence" hype and 3 years of digital television, we are still struggling to make multimedia communications happen. The reasons of this stalemate are manifold. Here are some of them:
A solution to the problem is offered by MPEG with its MPEG-4 standard under development.
MPEG-4, the current standardisation project of MPEG, combines some of the typical features of other MPEG standards with new ones coming from existing or anticipated manifestations of multimedia:
MPEG can also claim to have found the way to make different industries talk to one another while developing common technologies. The MPEG-1 and MPEG-2 standards have been collaboratively developed by multiple industries and successfully adopted for such different purposes ad digital audio and television broadcasting, interactive video on computers and movies on compact discs. Such MPEG-established standardisation principles as "not systems but tools", "one functionality - one tool", "relocatability of tools", "specify the minimum", "a priori standardisation", "stick to the deadline" etc. if not adopted in practice by other standards bodies, are at least becoming widely known, discussed and their positive implications gradually appreciated in the standards world. The profile/level approach that complements the principles above combines the need of specifying generic technologies while accommodating application-specific needs of different industries.
MPEG-4 will become Draft International Standard in December 1998 and International Standard 2 months later. It can thus be expected that MPEG-4 will become the enabling technology for multimedia communications as much as MPEG-1 has become the enabling technology for digital video and MPEG-2 of digital television.
Chapter 2 of this paper will try and clear the ground from the convergence hype and will identify which parts of the industries are candidate for convergence under the condition of existence of common multi-industry communication standards. Chapter 3 will assess the difficulty of the task of producing such common standards in view of the different approaches to standardisation of the different industries. Chapter 4 will describe what is MPEG and how it operates. Chapter 5 describes the principles of MPEG standardisation. Chapter 6 will briefly outline the technical content of the MPEG-1 and MPEG-2 standards. Chapter 7 will clarify the scope of applicability of MPEG-1 and MPEG-2, i.e. digital video and digital television, respectively, and Chapter 8 will identify the need of a new standard - MPEG-4 - to satisfy the requirements coming from new information interaction/consumption paradigms. Chapter 9 goes into some of the technical details of the MPEG-4 standard and para. 9 lists the additional features that MPEG-4 will need to avoid some of the problems encoutered with the customisation of MPEG-2 made by specific industries.
After years of multimedia hype, there is no sign that multimedia communications will happen the way media gurus had anticipated, i.e. by convergence of telecommunications, entertainment and computers, all adopting digital technology. This is not happening as much as the professions of barber, butcher and cobbler have not moved a single inch to a convergence point through the millennia in spite of all sharing the common "knife" technology. What is happening is movie makers buying broadcasting companies, telcos buying CATV companies, consumer electronics companies buying movie makers, software companies buying photograph collections etc. To do this you do not have to wait for digital technology convergence, you need fat wallets, complacent boards of directors and patient shareholders.
Digital technology brings a range of benefits but there are two main reason for its unstoppable adoption by industry since the invention of the transistor and the integrated circuit
Some examples are offered by the music record, speech telephony and satellite broadcasting:
Now, try and ask the layman to tell you the difference between the analogue and the digital version! People buy benefits, even if only perceived, not features.
I am not an unbeliever in convergence, which is part of life as much as divergence, but certainly not of the mentioned businesses. The first thing we must do if we want to have any chance of understanding, forecasting and, hopefully, shaping the future, is to acknowledge that the three industries of entertainment, telecommunications and computers do not provide the right dimensions to study the phenomenon.
Let us consider the Entertainment case. Most instances of the Entertainment industry represent vertical businesses, more or less extended through all layers depending on the cases:
"Telecommunications" is another vertical business spanning all communication layers. It even used to own the terminal equipment in subscribers' homes, the telephone set! "Computer" is an inextricable mixture of hardware and software, an underlying technology that is used everywhere, in information servers, telecommunication systems as well as in user devices.
Better axes to analyse the Multimedia phenomenon are provided by the categories of "Content", "Transport" and "Equipment". "Content" - the message - is what matters to the user who foots the entire bill and therefore justifies the existence of the entire system, "Transport" is what is needed to deliver "Content" to people who want it and "Equipment" - the user device - is what is needed to enable the human user to interact with the system and to convert "Content" into human-consumable form. There are different types of "Content": movies, TV programs, news, telephone calls and many ways to pack content in a way that makes users more prone to consume; different types of "Transport": radio channel, cable, twisted pair, at the physical level and emerging ones as middleware; and an almost infinite variety of "Equipment".
What each of the Entertainment, Telecommunication and Computer industries have always tried, and are still trying, to achieve is the complete control of the Content - Transport - Equipment chain.
Table 1 below gives some examples, possibly different from an environment to another, of how the different industries (first column) integrate or otherwise within themselves the content, transport and equipment components.
Tab. 1 - Examples of integration of industries
| Content | Transport | Equipment | |
| Terrestrial TV | X | X | 
 | 
| CATV | X | X | |
| Satellite TV | X | X | 
 | 
| Telecommunications | X | X | |
| Movies | X | X | 
 | 
| Consumer Electronics | X | X | |
| Video games | X | X | 
The convergence case can be made, even though I do not personally think the businesses will converge, nor that there is a cogent need for it. But this will not happen because the industries will decide to abandon the technologies proper to their businesses and convert to digital technologies, something they have been doing since a long time, while in search of rationalising their way of doing business. The telecommunication industry started using PCM in the network 30 years ago, the Consumer Electronic industry introduced the Compact Disc 15 years ago and the Computer Industry, well, they have been digital all time. What digital brings is a drive towards changing vertical integrated systems into layered systems, something that has clearly happened in the Computer domain, the most advanced because of its intrinsic digital nature. But as much as the Computer industry has become layered, with sustantial homogeneity of standards at the application and transport layer, to achieve convergence an alignment of the technical specifications of the layers across the different industries must take place. In other words the communication standards of one industry must be compatible with those of the other industries.
And this is a monumental task, seeing the diverging attitudes of the different industries vis-à-vis standardisation that are described in the next paragraph.
In spite of the long history of standardisation there is a great deal of confusion around this word. To address it properly one must "go back to basics".
If you take the Webster's under the entry "standard" you will find the following definition:
A pole or spear bearing some conspicuous object (as a banner) at the top formerly used in an army or fleet to mark a rallying point, to signal or to serve as an emblem
This refers to the original meaning of the word and is indeed a good definition. A standard is a model, a reference something to be followed, if one feels that to be the guide. Unfortunately one finds in the Webster's, immediately after:
Something that is established by authority, custom or general consent as a model or example to be followed.
This would be OK were it not for the word "authority". It happens so that sometimes public authorities feel they should impose use of certain standards by the force of law. The metre was adopted in the first years of the French Revolution as a way to rationalise, in line with the spirit of the times, the many different linear, surface and volume measures in use in the different parts of France. So far so good, however, its use was made compulsory later by Napoleon and others.
A major problem of Multimedia is created by the legal nature of some of standards used by some of the industries candidate to convergence. The broadcasting industry has traditionally been regulated in all countries and its technical standards have been given legal status. Something similar happened to the Telecommunication industry, while the Consumer Electronics and Computer industries have been largely unregulated. This is reflected in the fact that ITU, encompassing its two branches ITU-R (broadcasting standards) and ITU-T (telecommunication standards) is a Treaty Organisation, i.e. governments are represented in it, while ISO and IEC have status of private companies established according to the Swiss Civil Code.
The first clash of the different industries in need of common technical specifications is given by the different legal nature of broadcasting and telecommunication standards.Why solve the technical and legal problems at the same time? Good engineering sense suggests that it is better to solve "one problem at a time", if the problems are uncorrelated, at least as a first approximation. Once the first, technical, problem of developing standards, is solved one can attack the second, understandably more difficult, of converting technical standards into law, if that is still needed.
The above to introduce my definition of standardisation:
the process by which individuals of a group recognise the advantage of all doing certain things in an agreed way and codify that agreement.
Forget about law. An agreement is a compromise between total satisfaction of one side and total dissatisfaction of the other. If one accepts a compromise it is only because the perceived advantages earned by entering into an agreement exceed the perceived disadvantages caused by the accepted limitation of freedom. Standards are like contracts regulated by Civil Code for horse trading as well as anything else: you like them you sign them, you dont like them you shun them.
This definition of standardisation is not particularly revolutionary and is perfectly in line with the original meaning of the word standard. Even those very organisations that participate in international standards setting follow it. An example? Most countries in the world use A4-size papers. A4 is an element of an ISO standard (ISO 216:1975). ANSI is the ISO National Body for the United States, still ANSI does not use A4-size paper. Inconsistent? Not at all. Simply for ANSI, as an organisation, the perceived advantages earned by the international (in this case) agreement of using A4-size papers fall below the perceived disadvantages caused by the limitation of freedom of keeping on using letter size paper. (I would rather like to cede part of my freedom and in exchange get rid of the problem of my English version of Windows 95 and of all application programs that assume that all documents have letter size, forcing me to reset the printer every time I print).
The International Organisation for Standardisation (ISO) was established having this nature of standards in mind. ISO Technical Committee 1 (TC 1) "Screw threads" sets standards for threadings of nuts and bolts. It is obvious that having only a finite set of threads is very convenient for those who use them. But if you are yourself a manufacturer and you use nuts and bolts by the millions there is no reason why you could not make them with different threads, as long as they suit your needs. When, however, users of a machine equipped with these special nuts and bolts have to find replacements, a manufacturer's perfectly reasonable choice becomes a public nuisance. As for A4-size papers, this is another example of how standards, agreed by an original group of "contractors", may lose their meaning when their use extends beyond the original group. Just think of multimedia.
Particularly important standards, and more to the point of what we discuss here, are those that enable people to talk to people, which I call "communication" standards. Using the definition above one could define a language as "the agreement by members of a group that codifies the correspondence between certain sounds and certain objects or concepts". The standard called "language" is again a matter of convenience: in spite of efforts to keep it unchanged language evolves because the needs of people using it evolve and so the underlying agreement needs to change. Writing is another communication standard. For languages like Chinese, writing can be defined as the agreement by members of a group that some graphic symbols, isolated or in groups, correspond to particular objects or concepts. For languages like English, writing can be defined as the agreement by members of a group that some graphic symbols, in certain combinations and subject to certain dependencies, correspond to certain basic sounds that can be assembled into compound sounds and traced back to particular objects or concepts.
As we said one of the problems of Multimedia is created by the penchant of Public Authorities to convert some of these free agreements regarding communication standards into laws and regulations. Some examples are given by the adoption of Latin and Cyrillic alphabets in Middle and Eastern Europe that created the deep cultural barriers that still haunt Europe, the adoption of Chinese characters in Japan, Korea and Vietnam and the introduction of the hangul alphabet in Korea as a replacement of Chinese characters.
Public authorities were slow to recognise the impact that the little gadgets technology started offering a little more than one and a half century ago would make on their orderly world: the Morse alphabet was universally adopted in spite of it being clearly biased towards English in the selection of the variable length codes and in spite of the universally used --- representing the initials of three English words.
Communications require standards that define the syntax and semantics of the information when it reaches the destination. Starting from the Morse alphabet communication standards have become increasingly sophisticated and the different industries that have been created in the process have developed considerably diverging attitudes.
Over the years, however, the different industries described above have considerably evolved.
The conclusion that can be drawn from these evolutions is a simple one: standards have been created by agreement between manufacturers, service providers, regulators etc., not users, but the market has vindicated the party that was neglected in the agreement. And this is not a minor party, it is the party for which all this content, hardware and software has been produced: the end users. And what users want is to be able to communicate and not to be shown the message "sorry, I cannot understand this format". End-to-end interoperability is what user want and is what the market was forced to deliver, its traditional dominating forces notwithstanding. Better late than never, but at what cost!
It is against this background that MPEG has operated. The first-generation MPEG-1 and MPEG-2 audio-visual communication standards have been produced by a collaborative effort involving all stakeholders. No surprise that standards have been accepted by all.
The next chapter will describe what is MPEG and how it operates.
4. About MPEG
There are three main bodies engaged in international standardisation.
In 1987 a decision was made to merge ISO Technical Committee 97 "Data Processing" and IEC TC 46 "Microprocessors". JTC1 (Joint Technical Committee 1), the resulting technical committee, was given the title "Information Technology". Within JTC1 operate a number of "Subcommittees", among these SC2 on "Character Coding", SC6 on "Telecommunications", SC7 on "Software Engineering", SC24 on "Computer Graphics, SC29 on "Coding of Picture, Audio, Multimedia and Hypermedia" etc.
MPEG is the nickname of WG11 (working group 11) of SC29, with the title "Coding of Moving Pictures and Audio".
Established in 1988 (the first meeting was held on 10-12 May 1988 in Ottawa, ON, CA), MPEG has grown to form an unusually large committee. Some 300 experts take part in MPEG meetings, and the number of people working on MPEG-related matters without attending meetings is even larger.
The wide scope of technologies considered by MPEG and the large expertise available require an appropriate organization. Currently MPEG has the following subgroups:
| 1. Requirements | develops requirements for the standards under development (currently, MPEG-4 and MPEG-7). | 
| 1. DSM | develops standards for interfaces between Digital Storage Media (DSM), servers and clients for the purpose of managing DSM resources and controlling the delivery of MPEG bitstreams and associated data. | 
| 2. Delivery | develops standards for interfaces between MPEG-4 applications and peers or broadcast media, for the purpose of managing transport resources. | 
| 3. Systems | develops standards for the coding of the combination of individually coded audio, moving images and related information so that the combination can be used by any application. | 
| 4. Video | develops standards for coded representation of moving pictures of natural origin. | 
| 5. Audio | develops standards for coded representation of audio of natural origin. | 
| 6. SNHC | Synthetic- Natural Hybrid Coding: develops standards for coded representation of audio and moving pictures of natural and synthetic origin. SNHC concentrates on the coding of synthetic data. | 
| 7. Test | develops methods for and executes subjective evaluation tests of the quality of coded audio and moving pictures, both individually and combined, to test the quality of moving pictures and audio produced by MPEG standards | 
| 8. Implementation | evaluates coding techniques so as to provide guidelines to other groups upon realistic boundaries of implementation parameters. | 
| 9. Liaison | handles relations with bodies external to MPEG. | 
| 10. HoD | (Heads of Delegations): acts in advisory capacity on matters of general nature. | 
MPEG work takes place in two different instances. A large part of the technical work is done at MPEG meetings. Members submit contributions to the MPEG FTP site in electronic form (several hundreds of them at every meeting) one week in advance. Delegates are then able to come to meetings well prepared without having to spend precious meeting time to study other delegates' contributions.
A typical MPEG meeting lasts for one week. It starts on Monday at 9:00 with a Plenary session. Purpose of the session is:
The session usually lasts all morning. Starting from Monday afternoon each of the MPEG groups begins its own activities.
On Monday 18:00-20:00 the Convenor meets with the Heads of National Delegations (HoD). This meeting addresses matters of general interest, considers documents originating from National Bodies that cannot be addressed by the technical groups, proposes meeting schedules etc.
Work by the technical groups continues during all Tuesday. On that day 18:00-19:00 the Liaison group meets.
On Wednesday morning a two-hour plenary session starts at 9:00. Purpose of this plenary is
Work in the group resumes at 11:00 and continues until Friday at 14:00. At that time the last plenary session starts. Purpose of this plenary is:
The Friday plenary usually lasts until 20:00.
About 100 output documents are produced at every meeting; these capture the agreements reached. Particular importance assume:
Output documents, too, are stored on the MPEG FTP site. Access to input and output documents is restricted to MPEG members. At each meeting, however, some output documents are released for public use and are posted on the MPEG home page.
Equally, if not more, important is the work that is done by the ad-hoc groups in between two MPEG meetings. They work by e-mail under the guidance of a Chairman appointed at the Friday plenary meeting. In some exceptional cases, when reasons of urgency or other so require, they are authorized to hold physical meetings. Ad-hoc groups produce recommendations that are reported at the first plenary of the MPEG week and become valuable inputs to speed up deliberation during the meeting.
With MPEG-1 and MPEG-2, MPEG has produced common audio-visual coding standards that are used by all industries mentioned in Para. 3. This has enabled them to accelerate digital audio-visual technology development, share development costs, and, more fundamentally for users, flow of content unrestricted by built-in technical barriers.
It is worth looking at the way in which MPEG has operated in its over 9 years of activity and try to rationalise what has been a successful approach to standardisation serving the needs of multiple industries to see its applicability to the general multimedia communication case.
No business can survive if work is done living day by day. This is, unfortunately, the practice of some standards committees. They are in charge of producing something (the something being often itself loosely defined) without a date attached for delivering an output (the standard) or with a date that is just a reference. It would be as if a company promised its customers to deliver something sometime.
Standards are the goods that standards committees sell their customers. As for a company the goods have of course to be of high quality, have to be according to the specification issued by the customers but, foremost, they have to be delivered by the agreed date.
Standards are not novels, standards are the technology that enables companies to make products (those sold to end users). If a company makes a plan to go to the market by a certain date with a certain product that requires a certain technology, and makes the necessary investments for it, the company - the buyer vis-à-vis the standards committee - is not going to be happy if the standards committee - the supplier vis-à-vis the company - at the due date reports that they are "behind schedule".
MPEG has a strict workplan that specifies, for all parts of a given standard, the time when the levels of Working Draft, Committee Draft, Draft International Standards and International Standards are reached. So far there have been occasional minor slips at intermediate steps but no delay in reaching International Standard status compared to the planned dates.
Many agree that standards should be issued by standards bodies for which making standards is the raison d'être, however, the inability of many standards committees to deliver on time, has forced companies to take shortcuts, so-called "industry standards". These private specifications, possibly endorsed by a few other companies, are often submitted to a standards committee for ratification.
The main problem with such an approach is that the standards committee then becomes a place where discussions cease to be technical, i.e. definition of a technology, to become commercial. The issues discussed are no longer those aimed at making a good technical standard but definition of terms of exploitation, fitness of the technology to the current plans of companies etc. Of course there is nothing wrong with technology deals between companies, or with close alignment of standard to products within companies but this is wrong if it is done in a standards committee. MPEG instead takes a very clear attitude:
So far MPEG has successfully applied this principle. Standardisation items have been identified well in advance so that it can be claimed that no MPEG standard has endorsed an "industry standard". It must be borne in mind, however, that MPEG standards do not specify complete systems. It is therefore possible that "industry standards" are needed alongside with MPEG standards to make complete products.
The principles described above, applicable to standardisation in general, require further ingenuity when they are to be applied to the production of standards that serve multiple industries.
Industries making end-user products, by definition, need vertically integrated specifications in order to make products that satisfy some needs. Audio-visual decoding may well be a piece of technology that can be shared with other communities but in the event industries need to sell a Video CD player or a digital satellite receiver and these require integrated standards. But if different industries need the same standard, quite likely they will have different end systems in mind. Therefore only the components of a standard, the "tools", as they are called in MPEG, can be specified in a joint effort.
The implementation of this principle requires the change of the nature of standards from "system" standards to "component" standards. Industries will assemble the tool specifications from the standards body(ies) and build their own product specification.
If "tools" are the object of standardisation, a new process must be devised to produce meaningful standards. The following sequence of steps has been found to be practically implementable and to produce the desired result:
Still industry needs some guidance. It is therefore advisable that certain major combinations of tools be specified as normative, making sure that these are not application-specific, but functionality-specific. These standardised sets of tools have been called "profiles" in MPEG-2 Video.
What constitutes a tool, however, is not always obvious. Single channel and multichannel audio or conventional television and HDTV are components needed in many systems. Defining a single "tool" that does the job of coding both single channel and multichannel audio or conventional television and HDTV may be impractical because the technology has to be designed and manufactured to do things to an extent that in some cases are not needed. The "profile/level" philosophy successfully implemented by MPEG provides a solution: within a single tool one may define different "grades", called "levels" in MPEG.
In some environments it is very convenient to add to a standard those nice little things that bring a standard nearer to a product specification. This is, for instance, the case of industry standards or when standards are used to enforce the concept of "guaranteed quality" so dear to broadcasters and telecommunication operators because of their "public service" nature.
This practice must be abandoned when a standard is to be used by multiple industries.
Only the minimum that is necessary for interoperability can be specified. Going beyond
this border line requires a separate agreement involving all participating industries.
 
5.5 One functionality - one tool
As said before a standard is an agreement to do certain things in a definite way and in abstract terms everybody agrees that tools should be unique. Unfortunately, when people working for a company are in a standards committee that determination dwindles, if they see competing technologies to their company's prevail in the favours of the committee. The usual outcome of a dialectic battle lasting anywhere from an hour to ten years is compromising the intellectually accepted principle of one functionality - one tool and, voilà, "options" come in. Because of too many signalling options it took 10 years for European ISDN to achieve a decent level of interoperability between different telecommunications operators and, within the same operator, between equipment of different manufacturers. Because of too many options many standards were stillborn because the critical mass that would have justified the necessary investments by the industry could not be reached.
When a standard is defined by a single industry there is generally agreement on where a certain functionality resides in the system. In a multi-industry environment this is usually not the case. Take the case of encryption. Depending on which is your role in the audio-visual distribution chain you will like to have the encryption function located where it serves your place in the chain best, because encryption is an important value-added function. If the standard endorses your business model you will adopt the standard, if it does not you will antagonise it.
Not only must the technology be defined in a generic way, but also in such a way that the technology can be located at different points in the system.
Once the work is nearing completion it is important to make sure that the work done does indeed satisfy the requirements ("product specification") originally set. MPEG does that through a process called "Verification Tests" with the scope of ascertaining how well the standard produced meets the specification. This is obviously also an important promotional tool for the acceptance of the standard in the market place.
Let us see in some detail the technical content of MPEG-1 and MPEG-2.
The first standard developed by the group, nicknamed MPEG-1, was the coding of the combined audio-visual signal at a bitrate around 1.5 Mbit/s. This was motivated by the prospect that was becoming apparent in 1988, that it was possible to store video signals on a compact disc with a quality comparable to VHS cassettes'.
Coding of video at such low bitrates had become possible thanks to more than two decades of intense research in video coding algorithms. These algorithms, however, could not effectively be applied on full-resolution pictures but had to be applied to subsampled pictures - a single field from a frame and only ½ of samples in a line - to show their effectiveness. Coding of audio, as separate from speech, could rely on R&D work that had allowed reduction by about 1/6 of the PCM bitrate, typically 256 kbit/s for a stereo source, with virtual transparency.
MPEG-1 was a very innovative standard. For the first time a single audio-visual standard had been produced and all care had been taken so that all pieces of the standard fit together. And the very success of the standard triggered many re-organisations in R&D labs where audio and video had traditionally been kept separate. But MPEG-1 can claim more"first":
It is to be noted that, in keeping with the "specify the minimum" principle, the standard only specifies the decoder and not the encoder.
MPEG-1 is a standard providing normative statements that allow implementors to realise the traditional communication paradigm (see figure below). Audio-visual information is generated in real time from a natural scene or is stored on a server. In both cases a multiplexed bitstream reaches a decoder via a delivery medium (a telecommunication network, a broadcasting channel etc.). In the case of a local disk the delivery part of the scheme disappears but the model still remains valid. Encoded audio and video streams, constrained to have a common time base and combined into a single stream by the MPEG system layer, are extracted and handed over to the appropriate audio and video decoders that produce the intended sequences of PCM samples representing audio and video information.

Fig. 1 - General reference model of MPEG-1
MPEG-1, formally known as ISO/IEC 11172, is a standard in 5 parts. The first three parts are Systems, Video and Audio, in that order. Two more parts complete the suite of MPEG-1 standards: Conformance Testing, which specifies the methodology for verifying claims of conformance to the standard by manufacturers of equipment and producers of bitstreams, and Software Simulation, a full C-language implementation of the MPEG-1 standard (encoder and decoder).
Manifold have been the implementations of the MPEG-1 standard: from software implementations running on a consumer-grade PC of today in real time, to single boards for PCs, to the so-called Video CD etc. The last product has become a market success in some countries: in China alone millions of Video CD decoders have already been sold. MPEG-1 content is used for such services as DAB (Digital Audio Broadcasting) and is the standard format on the Internet for quality video.
By keeping its promise of delivering VHS-quality video and trasnparent CD Audio quality at a total bitrate of about 1.5 Mbit/s MPEG-1 made "digital video" possible. Even though the MPEG-1 syntax had been shown to operate successfully on higher-resolution pictures and a higher bitrate, a large number of industries had an interest in the much wider field of "digital television" but were unconfortable with a basic limitation of MPEG-1 Video: the constraint of progressively scanned pictures. A gain, eventually estimated to be around 20%, was expected to be achieved by exploiting the correlation of interlaced pictures. This was the main motivation to develop MPEG-2, titled "Generic coding of moving pictures and associated audio". Work on this standard could commence as early as July 1990 because:
The figure below provides a general reference model for the MPEG-2 standard. MPEG-2 encoded Audio-Visual information is stored on a server and reaches an MPEG-2 decoder via a delivery medium (disk, network, broadcasting channel etc.). The functionalities of the MPEG-2 decoder are similar to MPEG-1's, but the important technological component is the support of client-server interaction by means of a standard communication protocol.

Fig. 2 - General reference model of MPEG-2
Unlike MPEG-1, basically a standard to store moving pictures on a disk at low bitrates, the much larger number of applications of the MPEG-2 standard forced MPEG to develop and implement the "toolkit approach" described above. Different coding "tools" serving different purposes were developed and standardised. Different assemblies of tools - called "profiles" - were also standardised and could be used to serve different needs. Each profile had in general different "levels" for some parameters (e.g. picture size). Tab. 2 below gives the current situation of MPEG-2 Video Profiles and Levels.
Tab. 2 - MPEG-2 Profile/Level table
| Simple Profile | Main Profile | SNR Scalable Profile | Spatially Scalable Profile | High Profile | 4:2:2 Profile | MVP Profile | |
| High Level | X | X | |||||
| High-1440 Level | X | X | X | ||||
| Main Level | X | X | X | X | X | X | |
| Low Level | X | X | 
MPEG-2 Audio is an extension of MPEG-1 Audio to the multichannel case. This means that an MPEG-1 Audio decoder can decode two channels of the MPEG-2 stream and an MPEG-2 Audio decoder can decode an MPEG-1 Audio stream as if it were an MPEG-1 Audio decoder.
As for MPEG-1 the systems part of the MPEG-2 standard addresses the combination of one or more elementary streams of video and audio as well as other data into single or multiple streams which are suitable for storage or transmission. Two such combinations are specified: Program Stream and Transport Stream.
The Program Stream is analogous to MPEG-1 Systems Multiplex. It results from combining one or more Packetised Elementary Streams (PES), which have a common time base, into a single stream. The Program Stream is designed for use in relatively error-free environments and is suitable for applications which may involve software processing.
The Transport Stream combines one or more PESs with one or more independent time bases into a single stream. Elementary streams sharing a common timebase form a program. The Transport Stream is designed for use in environments where errors are likely, such as storage or transmission in lossy or noisy media.
MPEG-2, formally known as ISO/IEC 13818, is also a multi-part standard. The first 5 parts have the same function as the corresponding parts of MPEG-1. ITU-T has collaborated with MPEG in the development of MPEG-2 Systems and Video which have become ITU-T Recommendations for the purpose of broadband visual communications. This means that the same physical documents ISO/IEC 13818-1 and 13818-2 have the value of ISO standards and ITU-T Recommendations.
MPEG-2 has been a very successful standard, pieces of equipment that claim conformance to it have been manufactured and sold by the millions, receivers for digital satellite broadcasting being the most popular. More application domains are anticipated, such as digital receivers for CATV or DVD, a new generation of compact disc capable of playing back MPEG-2 bitstreams at a higher and variable bitrate and for a longer time than standard CDs.
MPEG-2 supports to a number of technical features, the most important of which are support to content addressing, encryption and copyright identification.
MPEG-2, with its parts 1, 2 and 3, with International Standard status since November 1994, provided the timely solution for those industries, such as satellite broadcasting, who felt the need to move from traditional analogue broadcasting to digital, because of the advantage of sending 5 to 10 times more programs on the same channel. Other industries, however, had in mind to provide interactive services such as in Video on Demand and Home Shopping. These require, in addition to a standard for audio-visual coding, a standardised terminal-to-server protocol.
Part 6 of MPEG-2, titled "Digital Storage Media Command and Control (DSM-CC)", an International Standard since July 1996, is the specification of a set of protocols which provide the control functions and operations specific to managing MPEG bitstreams. These protocols may be used to support applications in both stand-alone and heterogeneous network environments. In the DSM-CC model, a stream is sourced by a Server and delivered to a Client, both considered to be Users of the DSM-CC network. DSM-CC defines a logical entity called the Session and Resource Manager (SRM) which provides a logically centralised management of the DSM-CC Sessions and Resources.

Fig. 3 - DSM-CC model
Part 7 of MPEG-2 is the so-called Non-Backwards Compatible Audio Coding standard. The need for such a standard arose from the consideration that the backwards compatibility built in MPEG-2 Audio is an important service feature for many applications, such as television broadcasting. This compatibility, however, entails a degree of quality penalty that other applications need not pay. Work in this area has produced the so-called Advanced Audio Coding standard (AAC) for International Standard status has been reached in April 1997. AAC exploits the latest developments of audio coding technology to provide approximately the same quality of MPEG Audio at one half the bitrate.
Part 8 of MPEG-2 was originally planned to be coding of video when input samples are quantised with 10 bits, to provide room for post-processing. Work on this part was discontinued when the professional video industry that had requested the standard eventually shifted its interests to other domains.
Part 9 of MPEG-2, titled Real-time Interface (RTI), an International Standard since July 1996, provides a specification for a real-time interface to Transport Stream decoders which may be utilised for adaptation to all appropriate networks carrying Transport Streams. RTI can be used to achieve equipment-level interoperability in consumer electronic, computer, and other domains because it enables the building of network adaptation layers which are guaranteed to provide the required performance, and simultaneously it enables the building of decoders which are guaranteed to have appropriate behaviour of buffers and timing recovery mechanisms.
Part 10 of MPEG-2 is the Conformance Testing of DSM-CC and is still under development.
Other MPEG activities concern the definition of other MPEG-2 Video Profiles. The 4:2:2 Profile, completed in January 1996, provides a response to users of professional video equipment and services who were keen to exploit the existing consumer-electronics MPEG-2 Video technology for professional applications. The Multiview Profile, completed in October 1996, uses existing video coding tools for the purpose of providing an efficient way to encode two slightly different pictures such as those obtained from two slightly separated cameras shooting the same scene.
6.5 Other standards
complementing MPEG-2
6.5.1 MHEG-5
As the next chapter will discuss in more detail, very often a video image is just a component, albeit important, of a scene: you may need to add photo here, a logo there, an explanatory text somewhere and, if you think of interactive video, you will need menus and push buttons, as in the figure below.

Fig. 4 - A composite scene
Support to these functionalities is provided by a standard, called MHEG-5, produced by a group parallel to MPEG called MHEG.
MHEG-5 defines a coded representation of a scene, i.e. the syntax and the associated semantics allowing an author to compose a 2D scene with the following features
1. output components are audio streams and rectangular images such as
2. input components are
3. behaviour of the scene is based on events that trigger actions applied to output and input components.
Therefore the evolution of the scene is programmed by the author and can be modified by the user within the constraints imposed by the author.
MHEG-5 defines a file format made up of
The different types of data can either be included in the file or they can be referenced in a defined name space. DSM-CC can be used to retrieve the necessary data from the given name space.
MHEG-5 allows a substantial extension of the still very traditional communication paradigm of Fig. 2 above.

Fig. 5 - An MHEG-enabled MPEG-2 reference model
6.5.2 Service Information
MPEG-2 is a packet-based system designed to transport digital data having the meaning of "TV programs". A TV program can be made up of one or more video stream (just think of a new program with an inset which is the face of the announcer) and one or more audio stream (audio channel 1 is English, audio channel 2 is French etc.). There in obvious need to build a table of where the different data packets making a program can be found. This important function is provided by MPEG-2 Systems
Knowing where a program is important for a decoder. For a user it is important to know what a program is.
This functionality has not been developed by MPEG for MPEG-2 but solutions have been provided by two consortia, such as DVB (Digital Video Broadcasting, a European consortium developing specification for digital television broadcasting systems) and ATSC (Advanced Television Systems Committee, a US organisation establishing voluntary standards for advanced television system). In the following the DVB solution, called DVB SI (Service Information) will be explained.
SI provides information on "events", i.e. clips (of any duration) of audio-visual material, specifically
For instance content is classified in 16 categories,
| 0 | Undefined Content | 
| 1 | Movie | 
| 2 | News/Current affairs | 
| 3 | Show/Game show | 
| 4 | Sports | 
| 5 | Children's/Youth programmes | 
| 6 | Music/Ballet/Dance | 
| 7 | Arts/Culture (without music) | 
| 8 | Social/Political issues/Economics | 
| 9 | Education/Science/Factual topics | 
| A | Leisure hobbies | 
| B-E | Reserved for future use | 
| F | User defined | 
Each of the categories, e.g. leisure hobbies, can be further divided in 16 sub-categories
| 0 | leisure hobbies (general) | 
| 1 | tourism/travel | 
| 2 | handicraft | 
| 3 | motoring | 
| 4 | fitness & health | 
| 5 | cooking | 
| 6 | advertisement/shopping | 
| 7-E | reserved for future use | 
| F | user defined | 
By means of appropriate software on a set top box the use of SI one can indeed realise a complete "TV guide" by electronic means.
A broadcast TV program is sometimes a simple piece of linear audio and video, basically the output of a microphone and a video camera shooting a scene, but sometimes it is much more than that. Imagine your favourite evening TV news, what you may see is
What is the difference between this evening news program and an interactive multimedia application from your PC or a Web page? In terms of richness of multimedia presentation the TV program is by and large superior. However, you cannot do the most natural thing you do when you are shown the table of content of a Web page: choose the item you want. In a TV program you have to watch and listen to the first item and, if you find something of interest, you have no chance of being shown a button with the indication "click here for more detail", nor can you click on a part of the screen that is sensitive to your pointing device.
Does this matter? Depending on whom you are talking to you will be told that
MPEG-2 by itself cannot provide interactivity to a broadcast environment, but it is not a big deal to exploit some "hooks" and enhance the "multimedia look" that would provide some elements that can be perceived as interactivity:
There are many technical ways to do the things listed above. Therefore a standard is needed if all TV receivers are expected to render the different media correctly.
Now we are at a bifurcation point which is exactly at the same point we heard the two differing views on interactivity. Do we want to define this "multimedia embellishment" specification for a broadcast-only environment or do we want to define it in such a way that the broadcast-only environment is the zero-return channel case of a more general interactivity? In other words, do we want convergence of broadcasting, telecommunication and computer technologies or not?
If we do not want convergence then the multimedia look can be implemented by exploiting some simple hooks in MPEG-2 Systems that let you multiplex "other data" such as characters, graphic files etc. with their time and space information along with audio and video. The approach is to treat all additional information sources as supplementary to the audio-visual information, e.g.
It is clear that no representative of the computer world, the Web obviously included, is ever going to take this solution and use it down to the non-zero return channel case. They have been working for years, and still are, starting from the other end of the spectrum, building multimedia inch by inch as text + graphics + still pictures + with the intention of including eventually audio and video as the last step to full multimedia.
In the mind of the author who launched and implemented the first step of the idea of the Digital Audio-Visual Council (DAVIC) at the beginning of 1994 DAVIC should have provided the neutral interactive television solution acceptable to the broadcasting, telecommunications and computer worlds. Regrettably he failed. It was not possible to convince DAVIC people to address the problem in a rational way. Those of broadcast background were unable or unwilling to think of it without making reference to "subtitling". They further aggravated the problem with their inability to agree on a compression scheme for 2-D arrays of pixels representing graphic information, because they kept on saying that none of the one hundred or so graphic formats developed by the computer world suited their needs. They invented a new text coding scheme when a very small subset of HTML, the number of pages of which in the world numbers tens of millions, would have amply sufficed. Those of computer background stuck to their "application downloading" paradigm, an impossible marriage with broadcasting. DAVIC people of telecommunication background gathered around the MHEG standard because it fitted their idea of multimedia information representation, that would obviously require a standardised coded representation. In the event DAVIC settled with a double solution, one that contrasts the one-functionality - one tool principle: use of MPEG-2 hooks and MHEG. This is convergence of entertainment, telecommunications and computers at work!
Having so created a dividing line between the technology needed to make interactive and non interactive multimedia communication, DAVIC has automatically raised the entry threshold for the former. Interactive multimedia communication will happen, but it will not be via an extension of the current information consumption paradigm that is supported by broadcasting. Interactive multimedia communication will have to wait for a new approach that overcomes the current broadcasting/interactive antithesis.
The key technology breakthrough of MPEG-4 that makes it fundamentally different has been the ability to encode visual objects of arbitrary shape. Because MPEG-2 visual objects are constrained to be of rectangular shape, the composition of a scene is forced to be lttle more than a juxtaposition of loosely related objects. With MPEG-4 it is possible to compose scenes where different "people" (2D images) can be made to stay together around a table on the same screen. And, because 2.5D is not that difficult an extension, place the different 2D images in a 3D space. Presentation of the scene now becomes a function to be explicitly singled out, because a user is no longer presented with a scene conceived to be 2D and displayed as 2D, but conceived as 2.5D and displayed as 2D, in a way that may be different from what the author had thought because the viewpoint is different
The following figure gives a general reference model for an MPEG-4 receiver. This is an extension of the MHEG-enabled MPEG-2 model with the capability for a user to present the scene depending on the his view/hear point.

Fig. 6 - General reference model of MPEG-4
But MPEG-4 has made further advances in the area of coding of synthetic audio-visual content and aims at providing a unified representation if audio and visual information that is natural or synthetic or both.
MPEG-4 started in July 1993, it has reached Working Draft level in November 1996, will reach Committee Draft level in October 1997 and Draft International Standard level in December 1998. A thorough analysis of requirements has led the group to target a standard with the following characteristics
A communication with these features can be defined as Multimedia communication.
Even though the MPEG-4 project predates the Internet frenzy, the motivations at the basis of the project bear a high degree of similarity with some of the topics that make headlines today.
The MPEG-4 standard provides a set of technologies to support
The technical specifications of the standard will be contained in 6 parts
| Part 1 | Systems | 
| Part 2 | Visual | 
| Part 3 | Audio | 
| Part 4 | Conformance Testing | 
| Part 5 | Reference Software | 
| Part 6 | DMIF | 
Purpose of this chapter is to give some technical elements of MPEG-4.
The figure below can be taken as a typical example of the kind of MPEG-4 AVOs that populate an MPEG-4 audiovisual scene.

Fig. 7 - An example of an MPEG-4 audio-visual scene
Audiovisual scenes are composed of several AVOs, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive AVOs, such as :
MPEG standardizes a number of types of such primitive AVOs, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the AVOs mentioned above and shown in Figure 7, MPEG-4 defines the coded representation of objects such as:
In their coded form, these objects are represented as efficiently as possible. This means that the bits used for coding these objects are no more than necessary for supporting of desired functionalities. Examples of such functionalities are error robustness, allowing extraction and editing of an object, or having an object available in a scaleable form. It is important to note that in their coded form, objects (aural or visual) can be represented independent of their surroundings or background.
9.1.1 MPEG-4 Audio
MPEG-4 coding of audio objects provides tools for representing natural sounds (such as speech and music) and for synthesizing sounds based on structured descriptions. The representations provide compression and other functionalities, such as scalability or playing back at different speeds. The representation for synthesized sound can be formed by text or instrument descriptions and by coding parameters to provide effects such as reverberation and spatialization.
The AAC standard (part 7 of MPEG-2) has brought down to 64 kbit/s virtual transparency of single channel music which MPEG-1 Audio had set at 128 kbit/s and MPEG-4 Audio will bring interesting performance even at lower bitrates than 64 kbit/s. AAC is therefore already providing part of the MPEG-4 Audio standard.
MPEG-4 standardises natural audio coding at bitrates ranging from 2 kbit/s up to 64 kbit/s. For the bitrates from 2 kbit/s up to 64 kbit/s, the MPEG-4 standard normalises the bitstream syntax and decoding processes in terms of a set of tools. In order to achieve the highest audio quality within the full range of bitrates and at the same time provide the extra functionalities, three types of coder have been defined. The lowest bitrate range between about 2 and 6 kbit/s, mostly used for speech coding at 8 kHz sampling frequency, is covered by parametric coding techniques. Coding at the medium bitrates between about 6 and 24 kbit/s uses Code Excited Linear Predictive (CELP) coding techniques. In this region, two sampling rates, 8 and 16 kHz, are used to support a broader range of audio signals (other than speech). For the higher bitrates typically starting at about 16 kbit/s, time to frequency (T/F) coding techniques, namely VQ and AAC codecs, are applied. The audio signals in this region typically have bandwidths starting at 8 kHz.
To allow for smooth transitions between the bitrates and to allow for bitrate and bandwidth scalability, a general framework has been defined. This is illustrated inthe figure below.

Fig. 8 - MPEG-4 Audio coding techniques covering the 2-64 kbit/s range
Starting with a coder operating at a low bitrate, by adding enhancements both the coding quality as well as the audio bandwidth can be improved. These enhancements are realized within a single coder or, alternatively, by combining different techniques.
Additional functionalities are realized both within individual coders, and by means of additional tools around the coders. An example of a functionality within an individual coder is pitch change within the parametric coder.
Decoders are also available for generating sound based on structured inputs. Text input is converted to speech in the Text-To-Speech (TTS) decoder, while more general sounds including music are synthesised in accordance with a score description, which may include MIDI as the musical analog of text in TTS synthesis.
Text To Speech. Hybrid Scalable TTS allows a text or a text with prosodic parameters (pitch contour, phoneme duration, and so on) as its inputs to generate intelligible synthetic speech. It includes the following functionalities.
Score Driven Synthesis.
The Structured Audio/Effects Decoder decodes input data and produces output sounds. The sound synthesizers are defined by "instruments", downloaded in the bitstream, which create and process audio signals under the control of structured input data. An instrument is a small network of signal processing primitives that might emulate some specific sounds such as those of a natural acoustic instrument. The signal-processing network may be implemented in hardware or software and include both generation and processing of sounds and manipulation of prestored sounds. A script or score is a time-sequenced set of commands that invokes various instruments at specific times to contribute their output to an overall music performance or generation of sound effects. MIDI, an established protocol for transmitting score information, is a musical analog of the text input in the TTS synthesizer described above. The score description is more general than MIDI, which is a specific example, as the score can include additional control information to allow the composer finer control over the final synthesized sound. This, in conjunction with customized instrument definition, allows the generation of sounds ranging from simple audio effects such as footsteps or door closures, to the simulation of natural sounds such as rainfall or music played on conventional instruments to fully synthetic sounds for complex audio effects or futuristic music.
The Structured Audio/Effects Decoder processes decoded audio data to provide an output data stream that has been manipulated for special effects with timing accuracy consistent with the effect and the audio sampling rate. Hence this audio decoder is somewhat specialized in that it also allows input of a number of decoded audio channels as well as the parameters needed to control the effect. The effects are essentially specialized "instrument" descriptions serving the function of effects processors on the input streams.. The effects processing includes reverberators, spatializers, mixers, limiters, dynamic range control, filters, flanging, chorus or any hybrid of these effects.
In the case that instrument definitions in a particular Structure Audio/Effects Decoder bitstream include sound generation instruments as described in 2..4.2 above, then it can both realize a music composition and also organize any other kind of audio, such as speech, sound effects and general ambiance. Likewise, the audio sources can themselves be natural sounds, perhaps emanating from an audio channel decoder or stored wave-table, thus enabling synthetic and natural sources to be merged with complete timing accuracy before being composited with visual objects at the system layer.
9.1.2 MPEG-4 Visual
Visual objects can be either of natural or of synthetic origin. First, the objects of natural origin are described.
The tools for representing natural video in the MPEG-4 visual standard aim at providing standardized core technologies allowing efficient storage, transmission and manipulation of textures, images and video data for multimedia environments. These tools will allow the decoding and representation of atomic units of image and video content, called "video objects" (VOs). An example of a VO could be a talking person (without background) which can then be composed with other AVOs (audio-visual objects) to create a scene. Conventional rectangular imagery is handled as a special case of such objects.
In order to achieve this broad goal rather than a solution for a narrow set of applications, functionalities common to several applications are clustered. Therefore, the visual part of the MPEG-4 standard provides solutions in the form of tools and algorithms for:
The visual part of the MPEG-4 standard will provide a toolbox containing tools and algorithms bringing solutions to the above mentioned functionalities and more.
The MPEG-4 image and video coding algorithms will give an efficient representation of visual objects of arbitrary shape, with the goal to support so-called content-based functionalities. Next to this, it will support most functionalities already provided by MPEG-1 and MPEG-2, including the provision to efficiently compress standard rectangular sized image sequences at varying levels of input formats, frame rates, bit-rates, and various levels of spatial, temporal and quality scalability.
A basic classification of the bit rates and functionalities currently provided by the MPEG-4 Visual standard for natural images and video is depicted in Figure 9 below, with the attempt to cluster bit-rate levels versus sets of functionalities.

Figure 9 - Classification of the MPEG-4 Image and Video Coding Algorithms and Tools
At the bottom end a "VLBV Core" (VLBV: Very Low Bit-rate Video) provides algorithms and tools for applications operating at bit-rates typically between 5...64 kbits/s, supporting image sequences with low spatial resolution (typically up to CIF resolution) and low frame rates (typically up to 15 Hz). The basic applications specific functionalities supported by the VLBV Core include:
a) VLBV coding of conventional rectangular size image sequences with high coding efficiency and high error robustness/resilience, low latency and low complexity for real-time multimedia communications applications, and
b) provisions for "random access" and "fast forward" and "fast reverse" operations for VLB multimedia data-base storage and access applications.
The same basic functionalities outlined above are also supported at higher bit-rates with a higher range of spatial and temporal input parameters up to ITU-R Rec. 601 resolutions - employing identical or similar algorithms and tools as the VLBV Core. The bit-rates envisioned range typically from 64 kbits/s up to 4 Mb/s and applications envisioned include broadcast or interactive retrieval of signals with a quality comparable to digital TV. For these applications at higher bit-rates, tools for coding interlaced signals are specified in MPEG-4.
Content-based functionalities support the separate encoding and decoding of content (i.e. physical objects in a scene, VOs). This MPEG-4 feature provides the most elementary mechanism for interactivity, flexible representation and manipulation with/of VO content of images or video in the compressed domain, without the need for further segmentation or transcoding at the receiver.
For the hybrid coding of natural as well as synthetic visual data (e.g. for virtual presence or virtual environments) the content-based coding functionality allows mixing a number of VO's from different sources with synthetic objects, such as a virtual background.
The extended MPEG-4 algorithms and tools for content-based functionalities can be seen as a superset of the VLBV core and high bit-rate tools - meaning that the tools provided by the VLBV and HBV Cores are complemented by additional elements.
The MPEG-4 Video standard will support the decoding of conventional rectangular images and video as well as the decoding of images and video of arbitrary shape. This concept is illustrated in the Figure 10 below.

Fig. 10 - the VLBV Core and the Generic MPEG-4 Coder
The coding of conventional images and video is achieved similar to conventional MPEG-1/2 coding and involves motion prediction/compensation followed by texture coding. For the content-based functionalities, where the image sequence input may be of arbitrary shape and location, this approach is extended by also coding shape and transparency information. Shape may be either represented by an 8 bit transparency component - which allows the description of transparency if one VO is composed with other objects - or by a binary mask.
The extended MPEG-4 content-based approach can be seen as a logical extension of the conventional MPEG-4 VLBV Core or high bit-rate tools towards input of arbitrary shape.
Figure 11 below outlines the basic approach of the MPEG-4 video algorithms to encode rectangular as well as arbitrarily shaped input image sequences.

Figure 11 - Basic block diagram of MPEG-4 Video Coder
The basic coding structure involves shape coding (for arbitrarily shaped VOs) and motion compensation as well as DCT-based texture coding (using standard 8x8 DCT or shape adaptive DCT).
An important advantage of the content-based coding approach taken by MPEG-4, is that the compression efficiency can be significantly improved for some video sequences by using appropriate and dedicated object-based motion prediction "tools" for each object in a scene. A number of motion prediction techniques can be used to allow efficient coding and flexible presentation of the objects:
Efficient Coding of textures and still images is readily supported by the VLBV Core and the high bit-rate and content-based tools, with very low to very high bit compression ratios. In addition, a texture and image coding algorithm is supported which is based on a zero tree wavelet decomposition - to allow a good quality at very high compression ratios together with the ability to efficiently allow up to 32 levels of scalability.
MPEG-4 supports the coding of images and video objects with spatial and temporal scalability, both with conventional rectangular as well as with arbitrary shape. Scalability refers to the ability to only decode a part of a bit stream and reconstruct images or image sequences with:
This functionality is desired for progressive coding of images and video over heterogeneous networks, as well as for applications where the receiver is not willing or capable of displaying the full resolution or full quality images or video sequences. This could for instance happen when processing power or display resolution is limited.
For decoding of still images, the MPEG-4 standard will provide spatial or quality scalability with up to 32 levels of granularity. For video sequences a maximum of 3 levels of granularity will be supported.
MPEG-4 provides error robustness and resilience to allow accessing image or video information over a wide range of storage and transmission media. In particular, due to the rapid growth of mobile communications, it is extremely important that access is available to audio and video information via wireless networks. This implies a need for useful operation of audio and video compression algorithms in error-prone environments at low bit-rates (i.e., less than 64 Kbps).
The error resilience tools developed for MPEG-4 can be divided into three major areas. These areas or categories include resynchronization, data recovery, and error concealment. It should be noted that these categories are not unique to MPEG-4, but instead have been used by many researchers working in the area error resilience for video. It is, however, the tools contained in these categories that are of interest, and where MPEG-4 makes its contribution to the problem of error resilience.
Resynchronisation
Resynchronization tools, as the name implies, attempt to enable resynchronization between the decoder and the bitstream after a residual error or errors have been detected. Generally, the data between the synchronization point prior to the error and the first point where synchronization is reestablished, is discarded. If the resynchronization approach is effective at localizing the amount of data discarded by the decoder, then the ability of other types of tools which recover data and/or conceal the effects of errors is greatly enhanced.
The resynchronization approach adopted by MPEG-4, referred to as a packet approach, is similar to the Group of Blocks (GOBs) structure utilized by the ITU-T standards of H.261 and H.263. In these standards a GOB is defined as one or more rows of macroblocks (MBs). At the start of a new GOB, information called a GOB header is placed within the bitstream. This header information contains a GOB start code, which is different from a picture start code, and allows the decoder to locate this GOB. Furthermore, the GOB header contains information which allows the decoding process to be restarted (i.e., resynchronize the decoder to the bitstream and reset all predictively coded data).
The GOB approach to resynchronization is based on spatial resynchronization. That is, once a particular macroblock location is reached in the encoding process, a resynchronization marker is inserted into the bitstream. A potential problem with this approach is that since the encoding process is variable rate, these resynchronization markers will most likely be unevenly spaced throughout the bitstream. Therefore, certain portions of the scene, such as high motion areas, will be more susceptible to errors, which will also be more difficult to conceal.
The video packet approach adopted by MPEG-4, is based on providing periodic resynchronization markers throughout the bitstream. In other words, the length of the video packets are not based on the number of macroblocks, but instead on the number of bits contained in that packet. If the number of bits contained in the current video packet exceeds a predetermined threshold, then a new video packet is created at the start of the next macroblock.
A resynchronization marker is used to distinguished the start of a new video packet. This marker is distinguishable from all possible VLC codewords as well as the VOP start code. Header information is also provided at the start of a video packet. Contained in this header is the information necessary to restart the decoding process and includes: the macroblock number of the first macroblock contained in this packet and the quantization parameter necessary to decode that first macroblock. The macroblock number provides the necessary spatial resynchronization while the quantization parameter allows the differential decoding process to be resynchronized. It should be noted that when utilizing the error resilience tools within MPEG-4, some of the compression efficiency tools are modified. For example, all predictively encoded information must be confined within a video packet so as to prevent the propagation of errors.
In conjunction with the video packet approach to resynchronization, a second method called fixed interval synchronization has also been adopted by MPEG-4. This method requires that VOP start codes and resynchronization markers (i.e., the start of a video packet) appear only at legal fixed interval locations in the bitstream. This helps to avoid the problems associated with start codes emulations. That is, when errors are present in a bitstream it is possible for these errors to emulate a VOP start code. In this case, when fixed interval synchronization is utilized the decoder is only required to search for a VOP start code at the beginning of each fixed interval. The fixed interval synchronization method extends this approach to be any predetermined interval.
Data Recovery
After synchronization has been reestablished, data recovery tools attempt to recover data that in general would be lost. These tools are not simply error correcting codes, but instead techniques which encode the data in an error resilient manner. For instance, one particular tool that has been endorsed by the Video Group is Reversible Variable Length Codes (RVLC). In this approach, the variable length codewords are designed such that they can be read both in the forward as well as the reverse direction. Examples of such codewords are 111, 101, 010. Codewords such as 100 would not be used. Obviously, this approach reduces the compression efficiency achievable by the entropy encoder. However, the improvement in error resiliency is substantial.
Error Concealment
Error concealment is an extremely important component of any error robust video codec. Similar to the error resilience tools discussed above, the effectiveness of an error concealment strategy is highly dependent on the performance of the resynchronization scheme. Basically, if the resynchronization method can effectively localize the error then the error concealment problem becomes much more tractable. For low bitrate, low delay applications the current resynchronization scheme provides very acceptable results with a simple concealment strategy, such as copying blocks from the previous frame.
In recognizing the need to provide enhanced concealment capabilities, the Video Group has developed an additional error resilient mode that further improves the ability of the decoder to localize an error.
Specifically, this approach utilizes data partitioning by separating the motion and the texture. This approach requires that a second resynchronization marker be inserted between motion and texture information. If the texture information is lost, this approach utilizes the motion information to conceal these errors. That is, due to the errors the texture information is discarded, while the motion is used to motion compensate the previous decoded VOP.
9.1.2.8 Synthetic visual
In VRML it is relatively easy to create models of things that are not live, like table, chair etc. However, it is virtually impossible to create a good model of human face and body. For next generation of multimedia communications that is a very important piece missing in VRML.
MPEG-4 is first working on developing the capability to create representations and models for human faces and bodies. It is working on developing the standardised set of parameters required to model a human face and to also synchronise the facial expression and lip movements with audio. This in addition to VRML or VRML like language can allow to create realistic scenes.
The Face is an object capable of producing faces in the form of 3D polygon meshes ready for rendering. The shape, texture and expressions of the face are generally controlled by the bitstream containing instances of Facial Definition Parameter (FDP) sets and/or Facial Animation Parameter (FAP) sets. Upon construction, the Face object contains a generic face with a neutral expression. This face can already be rendered. It is also immediately capable of receiving the FAPs from the bitstream, which will produce animation of the face: expressions, speech etc. If FDPs are received, they are used to transform the generic face into a particular face determined by its shape and (optionally) texture.
The Body is an AV object capable of producing virtual human body models and animations in form of a set of 3D polygon meshes ready for rendering. Two sets of parameters are defined for the body: Body Definition Parameter (BDP) set, and Body Animation Parameter (BAP) set. BDP set defines the set of parameters to transform the default body to a customised body with its body surface, body dimensions, and (optionally) texture. The Body Animation Parameters (BAP)s, if correctly interpreted, will produce reasonably similar high level results in terms of body posture and animation on different body models, without the need to initialise or calibrate the model.
Figure 7 gives an example that highlights the way in which an audiovisual scene in MPEG-4 is composed of individual objects. The figure contains compound AVOs that group elementary AVOs together. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound AVO, containing both the aural and visual components of a talking person.
Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects.
More generally, MPEG-4 provides a standardized way to compose a scene, allowing for example to:
The scene composition borrows several concepts from VRML in terms of both its structure and the functionality of object composition nodes.
In order to facilitate the development of authoring, manipulation and interaction tools, scene descriptions are coded independently from streams related to primitive AV objects. Special care is devoted to the identification of the parameters belonging to the scene description. This is done by differentiating parameters that are used to improve the coding efficiency of an object (e.g., motion vectors in video coding algorithms), and the ones that are used as modifiers of an object (e.g., the position of the object in the scene). Since MPEG-4 should allow the modification of this latter set of parameters without having to decode the primitive AVOs themselves, these parameters are placed in the scene description and not in primitive AV objects.
The following list gives some examples of the information described in a scene description.
How objects are grouped together: An MPEG-4 scene follows a hierarchical structure which can be represented as a directed acyclic graph. Each node of the graph is an AV object, as illustrated in Figure 12 (note that this tree refers back to Figure 7). The tree structure is not necessarily static; node attributes (e.g., positioning parameters) can be changed while nodes can be added, replaced, or removed.

Figure 12- Logical structure of a scene
How objects are positioned in space and time: In the MPEG-4 model, audiovisual objects have both a spatial and a temporal extent. Each AV object has a local coordinate system. A local coordinate system for an object is one in which the object has a fixed spatio-temporal location and scale. The local coordinate system serves as a handle for manipulating the AV object in space and time. AV objects are positioned in a scene by specifying a coordinate transformation from the objects local coordinate system into a global coordinate system defined by one more parent scene description nodes in the tree.
Attribute Value Selection: Individual AV objects and scene description nodes expose a set of parameters to the composition layer through which part of their behavior can be controlled. Examples include the pitch of a sound, the color for a synthetic object, activation or deactivation of enhancement information for scaleable coding, etc.
Other transforms on AVOs: The scene description structure and node semantics are heavily influenced by VRML, including its event model. This provides MPEG-4 with a very rich set of scene construction operators, including graphics primitives, that can be used to construct sophistated scenes.
AV object data is conveyed in one or more Elementary Streams. The streams are characterized by the Quality of Service (QoS) they request for transmission (e.g., maximum bit rate, bit error rate, etc.), as well as other parameters, including stream type information to determine the required decoder resources and the precision for encoding timing information. How such streaming information is transported in a synchronized manner from source to destination, exploiting different QoS as available from the network, is specified in terms of an Access Unit Layer and a conceptual two-layer multiplexer, as depicted in Figure 13 below.

Fig. 13 - The MPEG-4 System Layer Model
The Access Unit Layer allows to identify Access Units (e.g., video or audio frames, scene description commands) in Elementary Streams, recover the AV objects or scene descriptions time base and enable synchronization among them. The Access Unit header can be configured in a large number of ways, allowing to be used in a broad spectrum of systems.
The "FlexMux" (Flexible Multiplexing) Layer is fully specified by MPEG. It contains a multiplexing tool which allows to group together Elementary Streams (ESs) with a low multiplexing overhead. This may e.g. be used to group ES with similar QoS requirements.
The "TransMux" (Transport Multiplexing) layer in Figure 13 models the layer that offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2s Transport Stream over a suitable link layer may become a specific TransMux instance. The choice is left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of operation environments.
Use of the FlexMux multiplexing tool is optional and this layer may be bypassed if the underlying TransMux instance provides equivalent functionality. The Access Unit Layer, however, is always present.
With regard to Figure 13, it will be possible to:
Part of the control functionalities will be available only in conjunction with a transport control entity like the DMIF framework.
DMIF (Delivery Multimedia Integration Framework) is a functionality located between the MPEG-4 application and the transport network as shown in figure below.

Fig. 14 - The DMIF Architecture
To the application DMIF presents a transparent interface, irrespective whether MPEG-4 is reached by interacting with a remote interactive peer over networks and/or by interacting with broadcast or storage media.
An MPEG-4 application through the DMIF interface can establish a multiple peer application session. Each peer is identified by a unique address. A peer may be a remote interactive peer over a network or can be pre-cast (over broadcast or storage media). An interactive peer irrespective whether it initiated the session can select a service, obtain a scene description and request specific streams for AVOs from the scene to be transmitted with the appropriate QoS.
The MPEG-4 application can request from DMIF the establishment of channels with specific QoSs and bandwidths for each elementary stream. DMIF ensures the timely establishment of the channels with the specified bandwidths while preserving the QoSs over a variety of intervening networks between the interactive peers. DMIF allows each peer to maintain its own view of the network, thus reducing the number of stacks supported at each terminal. Control of DMIF spans both the FlexMux and the TransMux layers shown in Figure 13 above. It uses an open interface which accommodates existing and future networks through templates called connection resource descriptors.
In a typical operation an end-user may access AVOs distributed over a number of remote interactive peers, broadcast and storage systems. The initial network connection to an interactive peer may consist of a best effort connection over a ubiquitous network. If the content warrants it, the end-user may seamlessly scale up the quality by adding enhanced AVO streams over connection resources with guaranteed QoS. DMIF provides a globally unique network session identifier which can be used to tag the resources and log their usage for subsequent billing.
MPEG-4 allows for user interaction with the presented content. This interaction can be separated into two major categories: client-side interaction and server-side interaction. Client-side interaction involves content manipulation which is handled locally at the end-users terminal, and can take several forms. In particular, the modification of an attribute of a scene description node, e.g., changing the position of an object, making it visible or invisible, changing the font size of a synthetic text node, etc., can be implemented by translating user events (e.g., mouse clicks or keyboard commands) to scene description updates. The commands can be processed by the MPEG-4 terminal in exactly the same way as if they originated from the original content source. As a result, this type of interaction does not require standardization.
Other forms of client-side interaction require support from the scene description syntax, and are specified by the standard. The use of the VRML event structure provides a rich model on which content developers can create compelling interactive content.
Server-side interaction involves content manipulation that occurs at the transmitting end, initiated by a user action. This, of course, requires that a back-channel is available.
In MPEG-1 and MPEG-2 extensive use was already made of simulation programs written in C language to implement the Simulation Models of MPEG-1 and the Test Models of MPEG-2. Parts 5 of both standards give a software implementation of both encoder and decoder.
The environment in which MPEG-4 is being developed is one where general purpose CPU's, possibly assisted by specialised accelerators, can implement a fully-fledged MPEG-4 decoder. The number of such software implementations can easily reach high numbers and become a de facto MPEG-4 standard specification. Because of these considerations MPEG-4 has already been labelled as a "software standard", to be treated with the same mechanisms of a software product.
MPEG has therefore substantially innovated the existing process with the definition of a Reference Implementation of the MPEG-4 Systems, Audio, Video and DMIF standard, written in C or C++. A large number of companies have donated the copyright of some of their software implementing parts of both encoder and decoder for all parts of the standard to ISO. This will become the bulk of part 5 of MPEG-4, called "Reference Software". Such software may be used for implementattions conforming to the MPEG-4 standard, including commercial applications. Of course the rights to patents that are necessary to implement the standard have still to be acquired.
In parallel to this activity MPEG is actively developing a complete software "MPEG-4 player". The intention is to make the player a freeware for people to use and understand the features offered by the standard.
In this context falls a recent decision made by MPEG to manage the evolution of MPEG-4 in versions. Version 1.0 hasl been developed according to the following schedule:
Tab. 3 - MPEG-4 Workplan
| Part | Title | WD | CD | FCD | DIS | IS | 
| 1 | Systems | 97/10 | 98/07 | 98/12 | 99/02 | |
| 2 | Visual | 97/10 | 98/07 | 98/12 | 99/02 | |
| 3 | Audio | 97/10 | 98/07 | 98/12 | 99/02 | |
| 4 | Conformance Testing | 97/11 | 98/10 | 99/07 | 99/12 | 00/02 | 
| 5 | Reference Software | 97/10 | 98/07 | 98/12 | 99/02 | |
| 6 | Delivery Multimedia Integration Framework (DMIF) | 97/07 | 97/10 | 98/07 | 98/12 | 99/02 | 
This version will not accommodate all expected features of the MPEG-4 standard. Those that will not supported in the first version will be moved to version 2.
No matter what is the different consumption paradigm brought about by the Internet and the WWW in particular, the role of content as the engine that drives authors to produce it and users to consume it will remain intact. Tthe nature and management of the relevant IPR, however, is not necessarily going to be the same as today.
In MPEG-1 no IPR management and protection was added to the standard. The reason was that the target applications were interactive video on CDs and digital audio broadcasting. At that time both applications did not need IPR management and protection.
In MPEG-2 some form of IPR management was added to the standard. This is provided by "content identification", i.e. the possibility to identify the agency managing the rights of a given piece of audio or video or audio-visual content. The fact that content is digital and can be "stamped" with the copyright descriptor gives the advantage that more effective IPR management by automatic processing in the delivery chain becomes possible. There is also a support to IPR protection, so that pay television services become possible.
MPEG-4 provides a mechanism to identify assets via the Intellectual Property Identification Data Set (IPI Data Set). The IPI Data Set identifies content either by means of internationally standardised numbering systems (e.g. ISRC, ISAN, ISWC-T/L, ISBN, DOI, etc.) or by privately generated key/value pairs (e.g. »Composer«/»Lohn Jennon«). The IPI Data Set can be used by IPMP systems as input to the management and protection process. For example, this can be used to generate audit trails that track content use.
More than for its predecessors MPEG-4 will serve as the basis for a variety of products and services in different domains. The intellectual property management and protection methods required are as diverse as these applications because the level and type of protection required depends on the contents value, complexity, and the sophistication of the associated business models. The MPEG-4 IPMP framework provides application builders with the ability to construct the most appropriate domain-specific IPMP solution.
This approach allows the design of domain-specific IPMP systems (IPMP-S). While MPEG-4 does not standardize IPMP systems, it does standardize the MPEG-4 IPMP interface. This interface consists of IPMP-Descriptors (IPMP-Ds) and IPMP-Elementary Streams (IPMP-ES). IPMP Elementary Streams are like any other MPEG-4 elementary stream and IPMP Descriptors are extensions to MPEG-4 object descriptors.
IPMP-Ds and IPMP-ESs provide a communication mechanism between IPMP systems and the MPEG-4 terminal. Note that an application may require multiple IPMP systems. When MPEG-4 objects require management and protection, they have IPMP-Ds associated with them. These IPMP-Ds indicate which IPMP systems are to be used and provide information to these systems about how to manage and protect the content.
Fig. 15 indicates a variety of points in the MPEG-4 terminal at which one might need IPMP control, e.g. between Demux and the elementary stream decoders or after stream decoding. For example, retrieval of watermarks introduced prior to content encoding can only be done after content decoding. Applying control to post-decode BIFS streams and individual elements might also be desired and is a fundamentally different kind of operation. In general, the IPMP control points involve different kinds of mechanisms ranging from rule processing to decryption to watermarking. The actual processing of this control occurs in the IPMP System.

Figure 15 - IPMP Framework in the ISO/IEC 14496 Terminal Architecture
For traditional hardware device implementations consumers pay a one-time fixed amount of money for the device. This includes a portion that corresponds to patent rights. Consumers then pay for each piece of content that is consumed. This model can be implemented directly using MPEG-4. However, it is expected that there will be software implementations of MPEG-4 players running on general purpose, programmable CPUs. In this case, it is not practical to adopt the traditional hardware implementation approach. Users may use players to render content that never exercises certain patents embedded in the software implementation. The question then arises as to whether or not users should pay royalties for patents that are never actually used.
The MPEG-4 IPMP framework allows implementers of the standard and service providers to deploy a wide variety of business models. These include the possibility to charge for the consumption of content-related IP as well as for technology-related IP. This is possible because the IPMP framework provides a mechanism to include information in the bit stream that enables non-normative IPMP systems to manage and protect any kind of MPEG-4 encoded content.
One business model that will be considered in the following example is the so-called pay-per-use model. Fig. 16 illustrates this example.

Fig. 16 -A Patent IP protection example
A mechanism for detecting, auditing and controlling the use of the different features of a player is needed. Such a mechanism should be part of the non-normative IPMP system and should process IPMP messages contained in the IPMP elementary stream or descriptors that describe not only how content should be managed, but also how the player should audit its use of the particular rendering engines to process the content. The players themselves might undergo some kind of registration process to control player distribution. Each player might implement an IPMP policy that is based on credentials and rules that describe the auditing processes required for the processing of different kinds of content. Clearly such functionality must be implemented in a tamper-resistant manner.
MPEG-1 and MPEG-2 have been designed and are widely used to encode content that has a clear identity such as a movie, a documentary etc. In the current usage of MPEG-2 the so-called "Service Information" describes each piece of content according to well-identified categories, so as to enable search by a user.
This solution serves well the purpose for which it has been designed: to find information of interest in a large but still manageable number of programs. It would be awkward, as an example, to extend the solution for use in content search in the Web. This is, however, the paradigm, if not exactly the environment, in which MPEG-4 will mostly be used.
The lack of suitable search technologies is one of the reasons why, in spite of the explosive growth of the Web, many are questioning its business value. The problem is exacerbated by the fact that HTML was just designed as a language to encode text and links without any consideration for the information searching function. Searching information is, however, not possible for audio-visual content, as no generally recognised description of this material exists. In general, is not possible to efficiently search the web for, say, a picture of 'the Motorbike from Terminator II'. In specific cases, solutions do exist. Multimedia databases on the market today allow searching for pictures using characteristics like colour, texture and information about the shape of objects in the picture.
That this limitation should be avoided in MPEG-4 has been clearly identified and a new project has started with the title "Multimedia Content Description Interface" and nicknamed MPEG-7, that will extend the limited search capabilities of today to include more information types. In other words: MPEG-7 will standardise a way to describe various types of multimedia information. This description will be associated with the content itself, to allow fast and efficient searching for material that a user may be interested in.
These types of information include: still pictures, graphics, audio, moving video, and information about how these elements are combined in a multimedia presentation ('scenarios', composition information). Special cases of these general formats include: facial expression, personal characteristics etc.
The description can be attached to any kind of multimedia material, no matter what the format of the representation is. Stored material, that has this information attached to it, can be indexed and searched for. Even though the MPEG-7 description does not depend on the (coded) representation of the material, the standards in a way builds on the MPEG-4 standard, that provides the means to encode audio-visual material as a number of objects having certain relations in space (on the screen) and time.
The standardised description of the different types of information can exist at a number of semantic levels. To take the example of visual material: a lower abstraction level could be a description of e.g. shape, size, texture, colour, and composition ('where in the scene can the object be found?'). The highest level gives semantic information: 'This is a scene with a brown dog on the left and a blue ball that falls down on the right' - coded in an efficient form. Intermediate levels of representation can also exist.
Next to having a description of the content, it may also be required to include other forms of information about the multimedia data:
To fully exploit the possibilities of such a description, an automatic feature extraction will be extremely useful. Such a feature extraction algorithm will, however, be outside of the scope of the standard. Also the search engines themselves will not be specified within the scope of MPEG-7. The figure below indicates what will be the specific field covered by the MPEG-7 standard.

Fig. 17 - Scope of MPEG-7
This paper has addressed the decade-old problem of multimedia communications, recognising the unfulfilled promises of this new communication domain and clearly separating the technical issues from the "convergence" hype of the early nineties.
The multi-industry nature of multimedia communications calls for cross-industry standards. The difficulty to deal with industries having so different approaches to standardisation has then been recognised but the successful recipe adopted by MPEG in its MPEG-1 and MPEG-2 standards can be applied again to the new standardisation project MPEG-4, which promises to become the enabling technology for multimedia communications.
12. Acknowledgements
The Author would like to thank Olivier Avaro (FT/CNET), Phil Chou (Xerox), Touradj Ebrahimi (EPFL), Paul Fellows (ST), Ajay Luthra (GI), Geoff Morrison (BTLabs), Pentti Haikonen (Nokia) Rob Koenen (KPN), Kevin O'Connel (Motorola), Sakae Okubo (GCL), Pete Schirling (IBM), Ali Tabatabai (Tektronix) for their careful reading of the manuscript and their valuable comments.
13. References
1. Sakae Okubo, Ken McCann, Andrew Lippman 
"MPEG-2 requirements, profiles and performance verification - Framework for
developing a generic video coding standard"
Signal Processing: Image Communication, Vol. 7, pp.201-209, 1995.