MPEG: from the conception of the idea to its effects

Leonardo Chiariglione - CSELT, Italy



1         Introduction

A group of experts, established 10 years ago in a largely unknown ISO working group, has changed the world of digital audio and video standardisation.

This paper will describe the context in which the idea of MPEG was born, which was the motivation that drove its establishment and the organisation that the group gave to itself. The good standardisation rules that the group follows are illustrated. These are considered to be sufficiently general to be applied to standardisation in general. The typical development process of an MPEG standard is also presented. A summary description of the 3 complete standards developed so far (MPEG-1, 2 e 4) is made along with a presentation of the plans for MPEG-7, the last element of the MPEG family.


2         The '80s: digital technologies reach the user

The digital signal processing theory started by Nyquist in 1928 bore its first fruits in the '60s when the CCITT (now ITU-T) published the famous recommendations for the digitisation of telephony speech by defining a sampling frequency of 8 kHz and two quantisation methods: m law with 7 bits/sample and A law with 8 bits/sample. Telecommunication operators started installing transmission multiplexers capable of transmitting 24 o 30 telephone signals at a bitrate of 1544 or 2048 kbit/s, respectively. This "network digitisation" did not have, however, a practical effect to the user because the telephone set was still the direct offspring of A. G. Bell's telephone.

At the beginning of the '80s, with the publication of its Recommendation H.120, the CCITT for the first time brought a 1.5/2 Mbit/s videoconference stream to the home of a subscriber, albeit a rather special one.

In the same years consumer electronics, and particularly Philips, RCA and Sony, began to put on the market the first equipment that carried bits with the meaning of musical audio to consumers' homes. After a brief battle the Compact Disc (CD) as defined by Philips e Sony prevailed on RCA's and succeeded to bring studio-quality stereophonic audio to millions of homes by using a physical support capable of producing a stream of bits at the rate of 1410 kbit/s, very similar to the rate used for CCITT H.120.

The mid '80s saw the beginning of the standardisation activity that would give rise to CCITT Recommendations H.221 (multiplex) e H.261 (video coding at px64 kbit/s), together with other CCITT Recommendations for coding speech sampled at 8 and 16 kbit/s with a bitrate less or equal to 64 kbit/s. These activities were synergistic with the huge CCITT standardisation project known as ISDN (Integrated Service Digital Network) that would bring 144 kbit/s to subscribers using the existing line. Lastly, at the end of the '80s, the first results of the ADSL (Asymmetric Digital Subscriber Line) technique enabling the transmission of 1.5 Mbit/s downstream and a few tens kbit/s upstream were shown.

In the broadcasting domain several laboratories, especially in Europe, carried out studies to develop modulation methods that would enable the trasmission of digitised audio and television signals in the VHF and UHF bands that are used today for broadcasting on the terrestrial network and were coming to the conclusion that such frequencies in typical conditions could carry between 1 and 4 bit/s per Hz.

In the consumer electronics domain several laboratories were studying methods to code audio-visual signals to record them in digital form on magnetic tapes, while Philips and RCA studied methods to code video signals at bitrates of 1.4 Mbit/s in order to enable storage on compact discs for interactive applications.

Laboratories of broadcasting companies and related industries were also active in the field of audio and video coding for broadcasting purposes. It is worth recalling here the pioneering efforts of RAI e Telettra in the development of HDTV codecs for satellite broadcasting and General Instrument's Digicipher system for terrestrial HDTV broadcasting.

The short list of examples above shows how at the end of the '80s the telecommunication and consumer electronics industries had already embarked in some implementations of industrial value that were based on digital technologies and provided services and applications to the end users that were capable of consolidating or extending their own businesses. The same industries, together with those engaged in broadcasting, were actively working for new generations of digital equipment that were likely to have a further impact on the business of different industries.

Beyond the superficial commonality of technological solutions, there were fundamental differences of traditions, strategies and regulatory concerns among the different industries and, within each of these industries, in the different countries or regions of the world. The telecommunication industry placed great value in the existence of standard solutions, but was typically weak in end-user equipment. These were forced, on the one hand, to adhere to standards and, on the other, were left to the goodwill of a manufacturing industry with a scarce inclination to invest because of the expectation to receive guaranteed orders from operators. The interest in digital solutions, setting aside the long-term plan to bring fibres to the end users with a virtually unlimited bandwidth, were subject to the prospects of the digitisation of the telephone network: basic-access ISDN (144 kbit/s) for the general user and at most primary-access ISDN (1.5/2Mbit/s) for the professional user.

The consumer electronic industry did not feel particularly constrained by the existence of standards (suffice it to mention the adoption of 44.1 kHz as the sampling frequency of compact disc audio, dictated by the use of video recorders for the encoding equipment used at the beginning) but had the culture and financial muscle to develop user equipment, including sophisticated integrated circuits. That industry, however, was weak when the equipment had to be brought to market because of the existence of other solutions from other manufacturers that were functionally similar but technologically incompatible (suffice it to mention V2000, Betamax and VHS, the three formats of home video cassette recorders).

Even more complex was the attitude of the entertainment industry. This was rigidly regulated in Europe and Japan and less visibly, but equally rigidly, regulated in the USA. In Europe the European Commission had laid down the policy of evolution of television towards HDTV going through D2-MAC (another format of 625-line television) and HD-MAC (a format with twice the number of lines), both via satellite. In the USA and Japan the policy was one of evolution from analogue NTSC to analogue HDTV. In Japan this was expected to happen via satellite while in the USA this was expected to happen as an evolution of the terrestrial network.

The attentive reader has probably noticed that all the canonical "converging" industries have been mentioned but the computer industry. The reason is that, even if the computer industry has been the first to use massively data processing techniques in user's premises, at the end of the '80s the computing machines within reach of end users - mostly Macintosh e IBM compatible Personal Computers, but also Commodore 64 and Amiga - still needed at least an order of magnitude more processing power to supply to their users moving images of acceptable size sound of acceptable quality.


3         MPEG - an idea that comes from farther away

During his previous incarnation as a researcher in the video coding field, the author has not been alien to attempts at unification. More than with the intention to promote "convergence" of industries, he wanted to achieve coding architectures that were common to different industries so as to enable a "sharing" of development cost of those integrated circuits that were necessary to realise user devices of a sufficiently reduced price that could be shared by all industries on an equal footing. Among the various attempt only the study of a universal video codec [1] based on DCT technology will be mentioned here.

The CSELT-led IVICO (Integrated Video Codec) project of the RACE (Research & Development of Advanced Communication for Europe) program combined telecommunication operators, broadcasting companies, integrated circuit and equipment manufacturers all sharing the idea of defining and developing a minimum number of common integrated circuits (motion estimation and compensation, DCT, memory control etc.) and using them for a wide spectrum of applications. The project was terminated after one year for several reasons, non least the hostility from certain quarters caused by the prospect of being offered in a few years integrated circuits that could be used in digital television applications, a clash with the official policy of the European Commission of using digital technologies applied to television only in the first decade of the third millennium.

The demonstrated impossibility to put in practice in the European environment a project of  definition and development of a generic audio-visual coding technology for use by multiple industries, prompted the need to operate on a world-wide scale so as to be sheltered from influences/pressions of a non technical nature. This was on the one hand a scaling down of the original ambitions of actually developing the technology, on the other operating on a world-wide scale could provide a truly global solution.

Chance (or providence) provided the solution. During the Globecom '86 conference in Houston in December 1986 the author was invited by Hiroshi Yasuda, an alumnus of the University of Tokyo during the same 1968-70 years while he was a foreign Ph. D. student there, to take part in the JPEG activity carried out by a group inside working group 8 (WG 8) of the then TC 97/SC 2 of ISO. SC 2 was the technical subcommittee of TC 97 "Data processing", the technical committee that became, two years later, the joint ISO/IEC technical committee ISO/IEC JTC 1 "Information Technology". SC 2's charter was the development of standards for "character sets", i.e. the way to represent characters. WG 8 "Coding of audio and image information" had been established because in those years the introduction of pictorial information in various teletext and videotex systems, already based on ISO standards for characters, was being considered. The JPEG group (from Joint ISO-CCITT Picture Coding Experts Group) was in the process of defining a standard for coded representation of photographic images.

At the March 1987 meeting the author was favourably impressed by the heterogeneous nature of the group. Unlike the various groups of CEPT (Conférence Européenne des Postes et Télécommunications) in which the author had operated, JPEG was populated by representatives of a wide range of companies such as telecommunication operators (BT, DT, KDD, NTT), broadcasting companies (CCETT, IBA), computer manufacturers (IBM, Digital), terminal equipment manifactureres (Mitsubishi), integrated circuits (Zoran) etc. In a few months (January 1988, Copenhagen) the author had convinced Yasuda to establish a parallel group to JPEG, called MPEG (Moving Picture Coding Experts Group) with the mandate of developing standards for the coded representation of moving pictures. The first project concerned coding at a bitrate of about 1.5 Mbit/s for storage and retrieval applications "on digital storage media".

A few months later its establishment the group mandate was extended with the addition of "and associated audio". The extension was motivated by the consideration that, while for historical reasons audio and video had always been treated in companies as well as in standards bodies by organisationally unrelated groups, users request complete audio-visual solutions. In a similar way is to be interpreted the further addition of "and their combination" to signal the fact that the coding of individual audio and video signal is accompanied by the necessity to define an infrastructure supporting the two signals and their temporal relations. Later on the original limitation of coding "for digital storage media" that, in its generality cover technically also transmission, was simply removed.


4         The organisation of the MPEG group

At its first meeting in Ottawa (10-12 May 1988) MPEG was attended by 15 people, most of them onlookers from the parallel JPEG meeting but in its 10 years and 47 meetings the group has grown more than 20-fold, while keeping its working group status.

MPEG participants come from more than 20 National Standards Bodies and represent more than 200 companies covering all industries with a stake in digital audio and video.

Every year 4 to 5 meetings lasting one week are held. An intense rhythm of work pervades the week: often an entire night is spent by some dedicated member to get the documents ready to document the consensus of the day for approval the following morning. But this frenetic activity does not stop at the end of the meeting because several tens of "ad-hoc groups" are established with a precise mandate and a chairman. These groups usually work by correspondence only and some of them can exchange several hundreds emails in a few weeks.

As a result of a natural evolution process the organisation is of matrix type, as follows:

Tab. 1 - The organisation of the MPEG group
































There are 5 groups with the task of producing standards and 4 groups with a consultancy function. The Video group produces standards for the coded representation of visual information of natural origin, the Audio group for information of audio of natural origin and the SNHC (Synthetic Natural Hybrid Coding) for audio and video information of synthetic origin.

The Systems group develops standards that create an infrastructure where the coded audio-visual information, of both natural and synthetic origin, can be hosted and used by an application. The Delivery group produces standards that adapt the audio-visual flows created by the other groups to different transport systems.

Before the 5 operational groups can start their work it is necessary that the Requirements group defines the applications addressed by the standard and from these derives the requirements that the standard must satisfy. In practice, however, the group remains active during all the duration of the standard development because the technical solutions that are developed are constantly compared with the requirements. A similar function is carried out by the "Implementation studies" group. This has the task to verify the implementation complexity, both in a hardware and software context, of the algorithms being considered and to express implementation preferences in case it becomes necessary to make choices. Lastly the "Test" group organises and carries out quality tests to measure in a quantitative fashion the performance of the standardised algorithms. The Liaison group has been established to manage the intense flow of input and output documents with other standardisation groups.

The "Head of delegation" group operates outside of the technical structure. Its task is to resolve general organisational matters.


5         The good rules of standardisation

During the development of its standards MPEG has developed a philosophy of its own whose value cannot be demosntrated a priori but can only be assessed from the value of the results achieved.

MPEG has implemented a radical mentality change, from a passive attitude of "being in charge of a technology area " to an entrepreneurial one in which standards are seen as the goods that the "MPEG company" sells its clients, i.e. the industries that consider MPEG as their supplier of standardised technology. Therefore the development of MPEG standards is governed by the same management tools that companies employ for the development of their products. As for every other companies the goods must have high quality, be according to the agreed specifications but, foremost, must be delivered by the agreed date. From this the first precept follows: keep the delivery deadline.

Another principle adopted by MPEG to keep its role of technology supplier, instead of a pure registration office of technologies developed by others is a-priori standardisation. With this precept we mean the ability to identify the need of a standard before the industries have a concrete need for it for their services or equipment. The decision is a critical one because a standard developed too much in advance risks being based on technologies that immature and likely to be replaced by more powerful ones. On the other hand a standard developed too late risks to arrive when industries have already made commitments in their own developments and are therefore little inclined to abandon their investiments.

MPEG's position is to be supplier of standardised tecnologies for a multeplicity ofindustries. For this reason MPEG standards cannot be at the same level of most other standards bodies. This because every industry has products in mind that may use the same technology components but are in competition between them. The MPEG standards, therefore, cannot be at the level of product specification, but at the level of technological component specification, called "tools" by MPEG. Tools are then assembled by users according to their needs. From this comes the third precept "not systems but tools".

A fourth precept "specify the minimum" (necessary for interoperability) should be the basis of standardisation in general. Instead the sectorial nature of standards bodies makes standards look more like product specifications than standards. This is the case of ITU whose approach to standardisation seems to still retain the philosophy of "public service" for telephony and broadcasting where "guaranteed quality" used to be a requirement. Even today ITU-T specifies both the encoder and the decoder for audio. The fact that in practice such codecs are used in tandem with other coding systems - and there are several tens around the world - and that quality can no longer be guaranteed, does not seem to matter.

Options in standards has always been une of the causes for which approved standards have never been adopted or have encountered difficulties for their adoptions. It took years for European ISDN to achieve interoperability because of the diverse options selected by the different operators (and within the domain of individual operators by the different options implemented by the different suppliers). It took barely 4 months for the newly-established ATM Forum to produce its first specification by simply slashing the forest of options it had found in the ITU-T recommandations. From these considerations stems the fifth precept: "one functionality - one tool".

"Tool relocatability" is required by the multi-industry nature of MPEG. Often an industry derives its "raison d'être" by the fact that a certain tool resides at a certain position in the system. This may be the case of encryption in audio-visual systems that utilise conditional-access. If a tool specification enables the use of a tool only at a given point of the system the industries that operate at other pointswill automatically become hostile to the standard.

The seventh precept "verification of the standard" is a direct consequence of the competitive environment in which being "standard" is not necessarily a prerequisite. To be called MPEG may be a quality assurance for knowledgeable customers but sometimes the salesman's pitch finds naive ears ready to believe that a proprietary product is "twice better than MPEG" (why 2 and not 2.25 and measured how?). The quality of MPEG standards has been accurately verified with long campaigns of subjective tests carried out by highly competent members. Whoever claims that his product is "better than MPEG" has the opportunity to substantiate his claims by providing results obtained with comparable professionalism.


6         The development of an MPEG standard

The process starts from the moment the need for a standard is identified. The group discusses the need for it and the time it will take to develop it. At this point a request to ISO is made to approve a new standard project. The request undergoes two levels of approval by vote of National Bodies, one is within the subcommittee (SC 29) and the other within the technical committee (JTC 1).

While the approval formalities progress (they may take one full year to be completed) within the Requirements group discussions are made to identify technical requirements for the new standard and, possibly with the help of other technical groups, the technologies that are needed for the standard are identified. At the end of this phase a "Call for Proposals" for technologies is issued. Proposals received are assessed, e.g. for their audio/video quality or for the complexity of the algorithms proposed.

From the analysis of algorithms retained as promising, the group comes to define a common reference model, a platform describing both the encoder and the decoder. The reference model is used to assess the relative merits of the different technologies being considered. Assessments are made using "core experiment". At least two participants from two different companies have to carry out the experiment under agreed conditions and provide comparable results. The chosen technology is then integrated into the reference model and a new iteration starts. Exchanging bitstreams between between encoders and decoders of different members is an important part of the process to consolidate the reference models.

When the reference model has become sufficiently mature the decoding part is used to draft the "Working Draft". This document, at the planned date, is sent with the title of "Committee Draft", to the SC 29 Secretariat so that the document can be balloted. A second ballot follows with the name of "Final Committee Draft" and lastly there is a ballot with the name of "Final Draft International Standard".

At each ballot, with the exception of the last, the group mobilises to consider the comments that National Bodies have made to the draft. In some cases comments may even be hundreds of pages, just to show the level of commitment that some National Bodies have to made the standard technically outstanding.

From this process it is clear that MPEG operates with a unique blend of research effort and standardisation process. In its 10 years of activity several thousands man years of efforts have rotated around it. This is the reason why its standards, in addition to being widely supported by multiple industries, are also technically excellent.


7         Something about MPEG-1, 2, 4 and 7

In the following some notes concerning MPEG standards will be given.

7.1        MPEG-1

MPEG-1 " Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s" was the first standard produced by MPEG. It is a standard in 5 "parts". Part 1 specifies how to put together multiple audio e video streams encoded according to what is specifed in part 2 (Video) and part 3 (Audio). Part 4 specifies procedures to verify that a decoder or a bitstream at the output of an encoder do indeed satisfy the requirements of the first 3 parts. Part 5 is a complete software implementation in the C language of an encoder and a decoder.

MPEG-1 was a collection of "first":

  1. It was the first integrated audio-visual standard. This was a great achievement if one considers that at that time in all standardisation bodies and most research institutions the audio e video competences were scattered and unable to achieve coordination. Typical is the example of the ISDN videophone. The recommendations address the multiplexing and the video coding but left undetermined the audio coding.
  2. It was the first standard that defined the receiver and not the transmitter. This looks like an obvious thing to do because interoperability is assessed from the ability of the receiving party to "understand" message, not from a prescription of how the message is to be formed by the transmitting party, but this was - and is - not what some industries think it should be. If the way information is encoded is undefined (of course within the constraints of the single syntax), then a healthy competition is created among manufacturers to provide better and better encoding equipment. This will prolong the life span of the standard that will become obsolete only when a new much more powerful standard will be developed.
  3. It was the first standard able to code the video signal in a native mode independently of the video format (NTSC/PAL/SECAM). This is achieved without any particular effort if one considers that the digital version of these formats has the same sample rate that can be produced by the same decoding device.
  4. It was the first standard that was developed jointly by all industries with a stake in the audio-visual business.
  5. It was the first standard that was developed entirely in software and that contains a software implementation of the standard.
  6. It was the first standard for which a quality performance was assessed (for the audio part). Test have been carried out for MPEG-1 Audio layer II and layer III, two different complexities of the algorithm, the latter backwards compatible with the former. It has been shown that subjective transparency is achieved at 256 and 192 kbit/s when two audio channels are sampled at 48 kHz with 16 bits/sample (compression ration of 6 and 7.75, respectively).

From the moment the standard has been approved in November 1992, MPEG-1 has been a string of successes:

  1. The Video CD stores movies with a VHS quality and transparent audio. More than 20 million players have already been sold;
  2. MPEG-1 is "the" audio and video format for the Personal Computer. Since Windows 95 all Windows versions have an MPEG-1 software decoder;
  3. MPEG-1 Audio is widely used on the Web, especially in its MP3 version (formally MPEG-1 Audio Layer III) e for DAB (Digital Audio Broadcasting), adopted in Europe and Canada;
  4. Portable MPEG-1 cameras are sold weighing a few hundreds grams.

The only industry not to draw immediate benefits from MPEG-1 was the telecommunication industry. In spite of several telecommunication-related laboratories having working prototypes of 1.5/2 Mbit/s ADSL transceivers, only a few field trials were carried out to test Video on Demand services. A reason was the fact that MPEG was working for something new and better, although at the cost on four times more bits/s.


7.2        MPEG-2

MPEG-1 was an audio-visual coding standard for a rather restricted application environment. No matter what is the definition of digital television, it is clear that more was needed to enable industry to offer digital television services.

  1. Technically MPEG-1 video is capable of coding non-interlaced (so-called "progressive") pictures. But the television programs that billions of people see every day on their television monitors are interlaced. These are produced by thousand of television studios and production houses all operating on interlaced 625/50 or 525/59.94 television pictures. No matter how unpleasant a legacy interlaced television is it is clear that support had to be provided for an effective digital representation of interlaced television.
  2. MPEG-1 Systems provided a way to interface multiplexed digital audio and video with digitised physical systems - mostly, but not exclusively, storage - with ideal, e.g. error-free, performance. But the delivery systems used for broadcasting television, once digitised, lacked the ideal performance characteristics at the basis of MPEG-1 Systems.
  3. Digital television needed to bring a more complete experience in terms of audio. This meant that multichannel audio support had to be performed.
  4. Part of the MPEG constituency was interested in support for television with some degree of interactivity. Thus an interaction protocol was needed.
  5. A major differentiation of digital television from its analogue counterpart was quality. A digital television standard had to support methods to protect content from unauthorised users.

MPEG-2 "Generic Coding of Moving Pictures and associated Audio" is the name of the container solution of all these requirements. Started in July 1990 it was completed in 4.5 years in November 1994, even though some small enhancements were produced afterwards. The structure of the standard is similar to MPEG-1's, but with the addition of new parts.

Part 1 defines two types of multiplex. The first, called "Program Stream", is similar to MPEG-1 System. The second, called "Transport Stream", defines a transport layer that interfaces to digitised analogue systems (cable, satellite, terrestrial network etc.). It also supplies an infrastructure on which proprietary access control systems can be plugged.

Part 2 extends MPEG-2 by supplying the tools to encode interlaced video. Part 3 extends MPEG-1 from the stereo to the multichannel case by preserving backwards compatibility (i.e. an MPEG-1 Audio decoder is able to extract the stereo part from an MPEG-2 Audio bitstream).

Parts 4 e 5 correspond to those of MPEG-1. Part 6 has the title "Digital Storage Media Command and Control (DSM-CC)" and supplies protocols to control audio-visual streams, to establish audio-visual sessions on heterogeneous networks and for broadcast carousels.

Part 7 has the title "Advanced Audio Coding (AAC)" and supplies an alternative, non MPEG-1 compatible way to encode stereo e multichannel audio. With AAC one half the bitrate of MPEG-1 Audio Layer II yields the same quality.

Part 9 has the title "Real Time Interface for System Decoders (RTI)" and defines the jitter level that a system decoder can tolerate.

MPEG-2 is a recognised success. In the first 4 years since its approval the audio-visual landscape has been completely changed:

  1. About 20 million decoders (set top boxes) for satellite and cable have been sold;
  2. About 2 million DVD players have been sold;
  3. MPEG-2 terrestrial services have started in the US (HDTV) and UK (conventional TV)
  4. The 4:2:2 profile is used in television studios for high-quality editing.

Profile has been an important innovation introduced by MPEG-2 Video. The standard is actually composed of "tools", i.e. basic technologies that achieve a certain purpose. Several profiles exist: Simple (without frame interpolation), Main (the one currently used) and 3 scalable profiles. Later the 4:2:2 profile had been added for high-quality applications. Complementary to Profiles are "Levels", roughly equivalent to picture resolution (TV/HDTV).

The acceptance of MPEG-2 by industry is also due to two other reason. The first is the high quality achieved by MPEG-2 Video. Thorough tests carried out have shown that at 6 Mbit/s a quality that is equivalent to composite video (NTSC/PAL/SECAM) in the studio is achieved. At 9 Mbit/s a quality that is equivalent to component video (Y, R-Y, B-Y) in the studio is achieved. The second is the large number of client industries of the standard. No single industry had to make investment for the MPEG-2 related technology, as this was autonomously made by the microelectronics industry. What the other industries had to do has been to develop the "set top box", the box converting MPEG-2 bitstreams into analogue television signals to feed a monitor.

Digital television provides sharp pictures that are particularly suited for such environments as satellite and cable broadcasting. Because a satellite transponder that used to carry a single television program can now carry, say, six high-quality pictures, it has been found that the advertisement-based model of today's television does not really make economic sense. The only meaningful way to exploit MPEG-2 is by offering pat TV services. Therefore MPEG-2 Systems provides two special types of messages, Entitlement Control Messages (ECM) and Entitlement Management Messages (EMM). These provide an infrastructure that enables the sending keys to remote decoders to descramble the audio-visual payload.

The payload is scrambled with a control word. This is sent scrambled with a service key via an ECM message. The control word is changed with a period of the order of 1 second. The service key is scrambled using a key drawn from the users’ database and sent via an EMM message. As all users must receive the scrambled service key, this information reaches the intended user with a period of the order of 1 hour. At the receiver the scrambled service key is extracted from the EMM message, and is converted to clear text using the key stored in a smart card. In its turn this descrambles the control word which can eventually be used to descramble the audio-visual payload. This is where MPEG-2 Systems stops. Nothing is said about the nature of the keys, the scrambling algorithms etc.

7.3        MPEG-4

The last few years have shown that people are keen to have

  1. moving pictures and audio (the television paradigm);
  2. hyperlinks, i.e. activating a relationship between two entities in a virtual world (the web paradigm);
  3. interactions in virtual worlds (the games and chats paradigms).

Indeed, it would be interesting to get additional information concerning a person who is talking or playing or to have some artificial character impersonating myself in a virtual space.

Providing standardised technology offering this sort of integration is the goal that MPEG set to itself in July 1993 when it started MPEG-4 "Coding of audio-visual objects", its third work item.

The design parameters of the MPEG-4 standard can be summarised as follows:

  1. to code video signals very efficiently, particularly at low bitrates, as this will be a basic limitation for quite some time (digitisation of the telephone line for Internet applications or  mobile channels);
  2. to use an algorithm to code the video signals in an upwards scaleable format to preserve continuity of the coding format when higher bitrates are available, say for stand-alone or broadcasting applications or when broadband will reach the telephone customer;
  3. to code the audio signal - speech and music - for a wide bitrate range, total transparency for music at bitrates lower than those obtained with AAC, to very low bitrates for speech;
  4. to provide error resilience embedded in the coding algorithm, particularly in mobile applications, even though it is often possible to decorrelate the coding method from the delivery system;
  5. to code, particularly for the video signals, the objects - a talking person, a playing champion, a running car etc. - as a separate entity from the rest of a video image;
  6. to code efficiently time-changing 3D objects of a general shape;
  7. to have an efficient representation of some specific time varying 3D objects, such as human face and body;
  8. to accompany a human face with some synthetically generated sound;
  9. to represent realistic synthetic music, beyond what is possible today with MIDI;
  10. to represent audio 3D spaces;
  11. to describe individual objects;
  12. to multiplex for efficiency individual objects to be carried by an external transport protocol (e.g. UDP/IP, MPEG-2 TS, H.223, AAL5/ATM, DAB etc.)
  13. to compose the different audio and visual objects (synthetic and natural) in a 3D space for presentation in a 2D (visual) and 3D (audio) space;
  14. to extend such composition methods with programmatic content;
  15. to provide a protocol for content interaction;
  16. to represent a complete application using a file format so that content can be exchanged between an author and a server or client;
  17. to extend the URL format of the Web to encompass real-time audio-visual media and different real-time delivery protocols;
  18. to represent content in a format that is independent of the delivery mechanism;
  19. to define a protocol to specify a level of performance from the delivery mechanism;
  20. to protect content so that only authorised users can have access to it.

This impressive list of requirements has been converted into explicit standardised technologies in 5.5 years of work by some 300 experts from 200 companies from 20 countries, representing the complete range of industries interested in technology convergence. The resulting standard, MPEG-4 or ISO/IEC 14496, is structured in 6 parts with a very similar content for each of the parts to those of MPEG-2.

Seeing this impressive range of technologies one may be tempted to ask: "Very good, but what is the MPEG-4 killer application?" This question is best answered with another question: "Which has been the last killer application?". If one tries to answer, one will see that Compact Disc Interactive is long dead, what remains is Video CD, a different way than VHS cassettes to distribute video. Digital Television is nothing more, in terms of services, than analogue television. Interactive Television never took off. The WWW is not an application but a set of technologies that people have variously exploited to provide successful applications. So the answer to the question is: "MPEG-4 is a set of technologies that have been standardised and that people can pick to implement new or refurbish old applications". And there are many candidates, as can be seen from this first list:

7.4        MPEG-7

Valueless is the information that is known to exist but cannot be located. The establishment of repositories of information of potential interest to people has been one of the tasks of public authorities since early times. In more recent years telephone companies or their affiliates moved from the list of their subscribers in alphabetic order ("white pages") to the list of some of their subscribers sorted according to their business ("yellow pages").

On the web the very amount of the information available triggered from the early days the establishment of "information search facilities". The primitive description capabilities of HTML, however, do not help information indexing. So most web search services limit themselves to send robots over the web and scan for all words. The words are then archived with the URL where they appear. A request for the word "computer" on the AltaVista search engine provides the following answer: "AltaVista found about 25,417,987 Web pages for you". Web search services provide different ways to overcome this limitation. In the AltaVista example before, further categories for refined searches are provided: computer games, computer science, computer virus, computer graphics etc. These are automatically extracted from the words that most often accompany the word "computer". Of course more than one word can be used for the search and if the exact sequence of word appearing in a document is known it is likely that the number of target documents is reduced.

The digital television world has taken a different approach. The Service Information (SI) specification of DVB provides information on “events” that are broadcast on an MPEG-2 multiplex:

Content is classified according to 16 categories, and each category can be further expanded into 16 subcategories.

MPEG-4 utilises "Object content information (OCI) descriptors" to convey descriptive information about audio-visual objects. The main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors, language descriptors, textual descriptors, and descriptors about the creation of the content. OCI descriptors can be included directly in the related object descriptor or elementary stream descriptor or, if it is time variant, it may be carried in an elementary stream by itself. An OCI stream is organised in a sequence of small, synchronised entities called events that contain a set of OCI descriptors.

The standards and projects described so far only deal with "textual" descriptions of content, but the world of audio and video is more complex. How can I formulate a query which efficiently describes that I want a picture of ‘the Motorbike from Terminator II or a sequence where “King Lear congratulates his assistants on the night after the battle,” or “twenty minutes of video according to my preferences of today” or that I want a song and whistle a melody?

This is the goal that MPEG has set to itself with its last work item nicknamed MPEG-7 "Multimedia Content Description Interface". This standard is again about information representation, but quite different from previous MPEG standards. One can classify MPEG-1 and MPEG-2 as picture and audio sample-based standards, because the goal of these two standards is simply to provide a bit-efficient representation of the information. True that from the coded representation of the audio and video information one can get hints at the content (motion vectors can give information about visual objects moving in the scene and frequency content in audio subbands can give information about the type of audio information) but these are by-products, not the objectives of the standards. Similarly one can classify MPEG-4 as an object-based standard, because the systems layer describes a scene in which individual objects have their own spatio-temporal location. About the objects themselves, however, MPEG-4 does not say much more than MPEG-1 and MPEG-2, as the algorithms are again pixel-based.

MPEG-7 can be described as a "semantic-based representation". MPEG-7 will specify a standard set of descriptors that can be used to describe various types of multimedia information. MPEG-7 will also standardise ways to define other descriptors as well as structures (Description Schemes) for the descriptors and their relationships. MPEG-7 will also standardise a language to specify description schemes, i.e. a Description Definition Language (DDL). Audio-visual material such as still pictures, graphics, 3D models, audio, speech, video, and information about how these elements are combined in a multimedia presentation (‘scenarios’, composition information etc.) described by means of MPEG-7, can be indexed and searched for.


8         Conclusion

MPEG is a success story. It has used the know-how developed by the different components of the multimedia industry and has integrated it in powerful standards that have responded to the needs of the times. MPEG could do this because it was able to convert business requirements into technical specifications for the work, while putting a firm dividing line between the two. Then it set out developing its standards by exploiting the enthusiasm of hundreds of researcher from all parts of the world, unencumbered by immediate business considerations.

Today the landscape has substantially changed. What was a laboratory technology yesterday have become products and services actually deployed by the millions. As a response MPEG has shifted its role to new areas where it can continue to apply its unique recipe for converting research results into standards for the benefit of all its constituent industries.


9         References

[1] L. Chiariglione, L. Corgnier, M. Guglielmo: An architecture for the universal video codec, ICASSP 86