Vision, history and future of media convergence with MPEG

Leonardo Chiariglione – Telecom Italia Lab, Italy

Talking of vision, once upon a time there were companies whose people had vision. There is a case, highly relevant to the subject of this talk, concerning a company that, ¾ of a century ago, was funding research for something that it would deploy more than half a century later. If you want to know which is the company I am talking about, it is the one that, after having spent tens of billion dollars to buy all sorts of cable TV networks, eventually decided to sell them at a fraction of the price it had bought them. I suspect that it was less costly and more productive for that company to pay for its researcher’s dream 75 years ago than for its CEO’s dreams 5 years ago.

But let’s talk of the vision thing. The purpose of the research was to find out if it was possible to represent time-continuous but bandwidth-limited waveforms with just samples of the waveforms. The conclusion was that it was indeed possible to have a representation that was statistically equivalent if the samples were taken at a frequency twice the bandwidth of the signal.

This result would have remained a curiosity if a few other things had not happened, such as the use of electronic circuits to make electronic computers to process numbers. Because the vacuum tubes used in the first computers were so unwieldy, the same company invented a new technology to make smaller devices with the same functionality. Research also continued to assess the error introduced by quantising those waveform samples.

Replacing bandwidth-limited analogue waveforms with their digital equivalent, however, did not make much sense because the digital equivalent would require something like 4 times the bandwidth (the exact number depends on the type of modulation that is used to carry the bits). This started the not-so-philosophical competition between those who were only happy to have more information to carry "because the capacity of the future pipe will be virtually infinite" and those who had to struggle to make ends meet "because my customer wants the bits delivered now".

Both groups had their own share of truth. The first group was right if they meant the core network, the second group was right if they meant the access network. But while the reasons of the second group sparked the long-distance transmission capacity glut that we all know, the reasons of the second group sparked a worldwide research movement. It took about 30 years, but in the end it became possible to reduce the bitrate needed to represent video information with about two orders of magnitude of the pure digital representation and audio (music) information with about one order of magnitude less than pure digital representation. These are somehow fuzzy numbers that the simple minds of decision makers in companies like the one we are talking about do not understand, but never mind.

The company that kick started all this and the companies that operated in the same business worldwide were major contributors to this "audio and video compression" research movement, but they did not get much benefit from it. Indeed, their bread-and-butter signal was voice and for this signal the "nearly-infinite bandwidth" group had the upper hand because at that time it was less expensive to put bigger pipes in the long distance than making the ends of the cables more complex with signal processing devices. In the access portion of the network, however, good analogue technology continued to be more economic and the introduction of digital stalled.

These companies kept on having dreams, not of the same quality as the first dream recounted in this paper. One of them was to repeat with images the same incredible success they had had 100 years before with speech. If people were ready to pay that much, the argument went, to speak and hear, how much will they pay to see? The place where these dreams could be given a first shape was the CCITT (now ITU-T), because communication without standards was meaningless to them. Recommendations H.120 and H.261 are the names of two standards for audio-visual communication that the CCITT developed and industry implemented. The sad side of the story – and to know how sad you have to believe one of the actors in this play – is that neither went anywhere in the marketplace.

Audio and video was not the prerogative of the telcos, though. Broadcasters were in that business with many more stakes than the telcos. The technologists in those companies loved digital technologies, but the heads of those companies and the politicians they were dealing with had widely differing opinions. This situation caused a political stalemate in the CCIR (now ITU-R), the standardisation body for these matters.

Digital audio and video was also highly relevant for package media. This was a purely industrial environment accustomed to the introduction of new products. But even package media are a form of communication requiring standards. These, however, were usually developed and adopted in a "hit and run" fashion. In the 1960s Bosch had introduced an audiocassette but Philips’ Compact Cassette (CC) had soon overcome it. In the 1980s RCA had introduced a digital laser disc for audio but Philips-Sony’s Compact Disc (CD) had soon overcome it. This pattern was not continued in the Video Cassette Recorder (VCR) case as Philips’ V2000 was soon overcome by Sony’s Betamax while the latter waged a long-drawn war with JVC’s VHS.

In the second half of the 1980s, I came to the conclusion that digital video, which was fast reaching the asymptote in compression, was ready for exploitation. Having worked for 15 years without prospects of exploitation of my digital video communication research, I was not targeting videotelephony but such mass-market applications like storing audio and video on CD for interactive applications, which industry at that time was very excited about. A challenging part of that project was the development of a good video compression algorithm in the form of a standard that the VLSI industry could easily design and manufacture. Strangely, audio was not part of the project. To be frank, at the bottom of my mind there was still the hope that the same VLSI technology could be used to make cheap videophones and therefore kick start the business of interest to my company.

MPEG-1 was such a standard. It started with the goal of compressing video at 1.5 Mbit/s, roughly the throughput of a CD at 1x (at that time this was the only speed a CD player could run) adding compression of audio (music) with a quality subjectively equivalent to the original uncompressed stereo audio. To make the picture complete, MPEG-1 also had to develop a multiplex so that it was possible to store a multimedia file, composed of multiple video and audio streams, with all the information that was necessary for its playback.

There are lights and shadows in MPEG-1. Sure, MPEG-1 triggered the birth of a VLSI industry dedicated to signal processing chips. But the flagship application "interactive video on CD" went nowhere. The second mass market application "storage of audio on CC" also went nowhere. The third mass market application Digital Audio Broadcasting (DAB) is still ongoing thanks to the persistence of some public authorities, but is having a wretched life. Still, MPEG-1 can hardly be described as a failure: it has been used for years as the universal format for digital video on PC, is widely used in Video CD (VCD) and also widely used for audio in digital television.

The lesson to be learnt is that, when developing a standard, one should listen to (industrial) users of the standard but keep an independent mind. No case lets appreciate the wisdom of this recipe better that the MP3 case. At the time MPEG-1 was approved, many considered MPEG-1 Audio Layer III (MP3), the result of a difficult effort to combine two competing technologies – subband coding and Discrete Cosine Transform (DCT) – more like an underdog that would never make it because of its complexity. But this is not how things evolved. MP3 has become what it is because of the concurrence of several elements, not all of which are technical: the first available solution to a long-debated problem, superior performance compared to other solutions, availability of encoder and decoder reference software and licensing terms that many consider smart because they encourage the use of the standard. Those who think they know all elements that make the success of a technology should think twice or, better, thrice.

The second MPEG project – MPEG-2 – was driven by the desire of some on the background of the silent opposition of others to have a standard for digital television. This was the case in which multiple industries from many countries with different backgrounds and conflicting agendas had to put up with the fact that MPEG, with the momentum it had acquired with MPEG-1, would be the place developing the technology for the hot stuff called digital television.

It was by no means an easy ride. Oversimplifying the case to make it manageable, one could say:

  1. In the USA, terrestrial broadcasters needed a standard for filling 6 MHz of VHF/UHF spectrum with HDTV and a technical solution to carry MPEG-2 streams on digitised channels;
  2. In Europe, terrestrial broadcasters, blessed by a wider 7-8 MHz channel spacing, were prepared to accept a standard that could accommodate scalability, i.e. to allow for HDTV from which standard definition TV (SDTV) could be extracted;
  3. In the USA CATV operators needed a standard for filling 6 MHz of cable spectrum with as many digital SDTV programs as possible with the ability to set up channels across multiple networks and a technical solution to carry MPEG-2 streams on digitised channels;
  4. A new generation of global satellite pay-TV operators needed a standard for filling a satellite transponder with as many digital SDTV programs as possible with support to Conditional Access (CA);
  5. Telcos were on the sidelines thinking that MPEG-2 could be what they needed to justify the plans of some of them to bring fibers to the home, but would have liked support for ATM transport and interactivity;
  6. Consumer Electronics (CE) companies feigned lack on interest in storage applications but nonetheless took care that a disc-friendly multiplexer was defined.

The result was definitely not bad. Eventually, it turned out that MPEG-2 Video offered component quality at 9 Mbit/s and composite quality at 6 Mbit/s and the results are much better than that today. Even a scalable solution was provided with the performance granted by the state of technology of that time to make European broadcasters happy. MPEG-2 Systems offered a complete transport solution called Transport Stream (TS) satisfying the captured requirements of the affected industries and a storage-friendly solution called Program Stream (PS). Lastly, DSM-CC provided the signaling and session set up functionalities for heterogeneous networks and additional services.

There are also a few regrets worth mentioning. MPEG-2 decided to target multichannel audio that provided backward compatibility with MPEG-1 Audio. This was a reasonable decision because many broadcasters planned to first start with stereo sound, achievable with MPEG-1 Audio. To support backward compatibility, however, a significant quality cost compared to state-of-the-art non-compatible solutions had to be paid. Eventually an unconstrained solution was developed as part 7 of MPEG-2 under the name "Advanced Video Coding" (AAC). But in spite of the excellent performance of the new standard, adoption of ACC remained spotty as most operators opted for MPEG-1 Audio or other solutions.

Another regret was the failure in bringing together all the industries that had an interest in a "transport solution for real-time media". Both those in need of a TS or PS solution went away with their part of the booty. Granted that reconciling conflicting requirements could have been challenging but now the two solutions are basically incompatible. The same can be said of telcos who looked from a distance at the TS/PS debate without even trying to join the discussion, being lost as they were in their AAL1/AAL2 vs. AAL5 dispute. The really sad thing is that MPEG had enlightened people who could have provided the unifying solution that could have withstood the truly awkward solution that is used today for real-time media on the network.

The lessons to be learnt from the MPEG-2 story are manifold. One is that going to digital TV (actually, even HDTV) from the low end was the right thing to do. MPEG-1 was generally considered to have a business value in itself – and indeed it turned out to be exactly that, even though not in the way it had been imagined – but most people assumed that MPEG-2 would be the real thing. So MPEG-1 provided the opportunity to tune the machine both technically and organisationally for the big challenge. The second lesson was that MPEG provided a couple of cases of clear economic advantages from the use of digital media technologies. These advantages provided the hard core of adopters of the standard and the benefits of an inexpensive standard solution propagated to other businesses as well. The third lesson is the successful development of the licensing. MPEG LA succeeded in bringing together most of the many patent rights holders and provided an almost single-stop licensing point with conditions that were familiar to the industrial users of the standard at that time.

The motivations for the third project – MPEG-4 – were not as straightforward as for the first two. Basically the MPEG-1 and MPEG-2 constituencies had obtained all they needed for their business. Lower bitrates were being dealt with by the CCITT, which was working on an improvement of their H.261 recommendation called H.263. Why another standard, then?

It is easy to answer this question now because today we are seeing around us all sort of interactive audio and video applications on the Web, on mobile and personal devices that some call "rich media". These are examples of the goals of the MPEG-4 project that doubters of that time kept on asking about.

MPEG-4 was not an easy project. On the one hand there was the traditional MPEG constituency fearing that a new compression standard would undermine the business they were building with MPEG-2. To assuage their fears, the original work plan that targeted the standard in 1997 was made longer by a year. Then we had the telco community (or at least part of it) that feared that the MPEG machine would steamroll their H.263 recommendation. Fortunately MPEG could still boast its biggest asset, its membership that was being reinforced by a new generation of participants. With their enthusiasm, competence and dedication it was possible to develop what can rightly be called the multimedia standard with rich features, the most important of which are:

  1. Very low bitrate audio and video compression. "Very low bitrate" was the original target of the standard, but the bitrate was later extended to cover high bitrate range so that video coding is possible starting from 5 kbit/s and may go up to as much as 1 Gbit/s. Audio coding (audio in MPEG-4 means both speech and music) is possible starting from 2 kbit/s up to 64 kbit/s per channel.
  2. Support to new features like object coding. The visual part of the standard can code an image that is not a rectangular array of pixels and provides a fine granularity scalability (FGS) solution. The audio part of the standard supports such features as error robustness, low delay, fine granularity scalability etc.
  3. Compression of synthetic objects. MPEG-4 provides very efficient means to reduce the size of 2D and 3D synthetic data, both static and time varying.
  4. Animation of special classes of synthetic objects. MPEG-4 provides a solution to animate synthetic faces and bodies that requires very few kbit/s.
  5. Structured Audio. MPEG-4 defines the means to generate sound based on several kinds of ‘structured’ inputs.
  6. Means to compose objects in a space. MPEG-4 defines the means to compose a scene made up of time-varying objects using a technology called Binary Format for Scenes (BIFS), an extension of the Virtual Reality Modeling Language (VRML). Additionally, the Extensible MPEG-4 Textual (XMT) format provides a bridge between the powerful BIFS composition technology and other simpler formats.
  7. Possibility to interface with any network and protocol. MPEG-4 gives the means to abstract from the type of network (local, broadcast and interactive) and underlying transport protocol (HTTP, RTP, MPEG-2 TS etc.)
  8. General file format. The MPEG-4 File Format provides a powerful tool to process multimedia files for the purpose of editing, file exchange and streaming.
  9. Tools interaction protocol for Intellectual Property Management and Protection (IPMP). IPMP-X (X for extension) lets the tools used to protect a piece of content communicate with a terminal that needs to process (e.g. decrypt, decode, present) the content.

Many have ventured to make their opinions known about MPEG-4 and this paper would miss part of its target if it did not provide a view on some of the arguments that have been made. The first argument is that MPEG-4 was not driven by industry. If I wanted to make a formal point, I could say that people just have to look at the size of industry representation in the years that MPEG-4 was developed to see how hollow the claim is. The substance of this comment, however, needs consideration. Making MPEG-2 a success was kind of easy – I say this hoping that the people who worked on it are not upset by my words – because the target of the standard was to provide clearly identified economic benefits by changing the internals of the existing analogue television with digital technology while letting the externals unchanged. On the other hand, making MPEG-1 a success was not at all easy and the three major applications, targeted with so much industry support, viz. interactive video from CD, music on CC and DAB, as has been recalled above, turned out to be a disappointment at best. Industry can consider itself lucky that this "non industry-driven standard" was developed or today it would find itself without any standard solutions for the manifold application areas where MPEG-4 is currently being used.

The second remark that people like to make is that interactivity made possible by shape coding and BIFS is not widely used. This remark is not baseless, but I would like to ask what successful examples of media interactivity are there? VRML is going nowhere and MHP is nowhere to be seen. Interactivity on DVD is an example of successful deployment of an interaction technology, but I would be curious to know how many people do actually use it besides choosing the language and pressing the play button. The problem is that interactivity is a difficult beast, and before the value of interactivity has become part of the user mindset not much is going to happen. These observations notwithstanding, there are nice examples of commercially available BIFS-enabled platforms. I think these should be carefully looked at, not so much because they use the MPEG-4 technology, but because they can provide the opportunity to experiment and provide answers to some of the basic questions about media interactivity.

I said before that the availability of MPEG-2 licensing was a major reason for its market success and, as early as the beginning of 1997, some industrial participants started discussing how the MPEG-2 success could be repeated by the timely availability of licensing for the MPEG-4 standard. Six years later it is time to have a look at the process that led to the licensing terms that have recently been published for Systems and Visual and indeed I happen to have some comments I would like to make. The first is that working out licensing terms for a standard that could be used in such diverse cases as mobile and CE devices, for personal and streaming applications, in hardware and software-based solutions was a daunting task and commending words must not be spared to those who worked to make a licensing scheme possible. The actual terms of the license, however, are another story. I am not sure the licensing terms have the right balance that some people think the MP3 license has. A superficial observation is that charging for both receivers and content makes the two sides unhappy. Another is that there is a perception that the component of "licensing fees now and from anything" prevailed over the creation of a business out of which much bigger revenues for the licensors could have been obtained.

With the full range of bitrates covered by the three standards it could have been expected that MPEG would follow the advice of an interested party: to take a sabbatical or, better, to simply disband. I am saying "interested" because there are too many cases of people claiming: "my algorithm performs 3 times better than MPEG" (typical claim of a conference paper) or "my algorithm performs 5 times better than MPEG" (typical claim of a naïve salesman) or "my algorithm performs 10 times better than MPEG" (typical claim of a confessed liar). As a corollary, I could myself have followed another piece of advice from the same interested party and looked after my vineyard at the foot of the Alps. But it was not so.

The fourth MPEG project – MPEG-7 – was driven by the idea that, with the MPEG-1, MPEG-2 and MPEG-4 standards, industry had all the tools it needed to make business out of content. But the more content became plentiful and interactive, the more the naïve approach to content provisioning and fruition exemplified by the sentence "what’s on TV tonight" becomes unworkable.

The solution provided by MPEG-7 is based on two types of "Description Tools" (DT), namely Descriptors (Ds), designed primarily to describe low-level audio or visual features such as color, texture, motion, audio energy etc., as well as attributes of AV content such as location, time, quality etc. and Description Schemes (DSs), designed primarily to describe higher-level AV features such as regions, segments, objects, events; and other static metadata related to creation and production, usage etc. The DSs can produce more complex descriptions by integrating together multiple Ds and DSs, and by declaring relationships among the description components.

MPEG-7 also comprises Description Definition Language (DDL), a language used for the syntactic definition of DTs and for allowing extensibility of DTs. There is also a Binary Format for MPEG-7 (BiM) that provides an efficient coding of descriptions, an important tool because DTs, which are represented in very verbose XML, can be efficiently represented using BiM.

New MPEG standards have always presented cultural challenges. MPEG-1 provided the first opportunity for people from different industry backgrounds to come together and develop an audio-visual coding standard. MPEG-2 saw a massive intake of people from the television world, both from broadcasting and CE, who successfully worked together. At the time of MPEG-4 more people from new industry segments, particularly with computer science background, came to MPEG develop the multimedia standard.

No challenge, however, was as big as the one presented by MPEG-7. Indeed the gradual shift of the membershift from "engineering" to "computer science", already witnessed in MPEG-4 times, accelerated. In the new "MPEG-7 community" that gradually built up, established MPEG concepts were put in question, e.g. the difference between what is needed in an algorithm, in a standard and an implementation, the nature of the interfaces the MPEG-7 standard makes reference to, what is an MPEG-7 "encoder" etc. Even today the role of the textual versus binary representation in MPEG is not fully resolved.

MPEG-7 is an important standard because it provides truly generic metadata. Unfortunately it has to deal with the competition of tens of other metadata standards, all developed by specific communities with specific target applications in mind. But convergence is the negation of industry-specific metadata and indeed some good results are being achieved. TV Anytime is making use of MPEG-7 technologies and 3GPP is also adopting some simple static MPEG-7 metadata for addition to the MP4 file format. Harmonisation efforts with other metadata standards, such as MXF, are also under way.

For two years, starting in 1999, I was involved in an initiative that was a great personal – although difficult – experience, the Secure Digital Music Initiative (SDMI). That experience taught me a lot and made me realise the extent of complexities that are created when media become digital and established roles on the value chains are subverted.

The result has been the ISO standardisation project called MPEG-21. The goal of the MPEG-21 project is to enable electronic commerce of Digital Items (DI), these being the units of transactions between Users. An example of a DI is a music compilation, full with MP3 files, metadata, all sort of related links etc. By "User", I mean all entities that act on the value network, i.e. creators, market players, regulators and consumers.

Even more than for other standards, MPEG-21 is a toolkit standard. The first tool is a standard way of defining DIs in terms of components and structure (i.e. resources and metadata). This standard is called "Digital Item Declaration" (DID).

For each transaction we need a means to identify the object of the transaction. That is why we need a standard to uniquely identify DIs. Called "Digital Item Identification" (DII) the standard plays very much the same role as ISBN does for books and ISSN for periodicals.

Getting an identifier for a DI is important, but how are we going to put a "sticker" on it? This is where Persistent Association Technologies come in. The Secure Digital Music Initiative (SDMI) struggled with the selection of very advanced "Phase I" and "Phase II" screening technologies and its task was made harder by the fact that no established methods exist today to assess the performance of these technologies. That is why we are developing another part of the standard called "Evaluation Methods for Persistent Association Technologies". This is not meant to be a "prescriptive" (normative) standard but more like "best practice" for those who need to assess the performance of watermarking and similar technologies.

The next element is the reference architecture of Intellectual Property Management and Protection (IPMP) to manage and protect DIs. This part is still under development.

Already in the physical world we seldom have absolute rights to an object. In the virtual world, where the disembodiment of content from carriage augments the flexibility with which business can be conducted, this trend is likely to continue. That is why we have a "Rights Expression Language" (REL) so that rights about a digital item can be expressed in a way that can be interpreted by a computer.

A right exists to perform actions on something. Today we use such verbs as: "display", "print", "copy" or "store" and, in a given context, we humans know what we mean. But computers do not know and must to be taught the meaning. That is why we need a "Rights Data Dictionary" (RDD) that gives the precise semantics of all the verbs that are used in the REL.

Information and Communication Technologies (ICT) let people do more than just new ways of doing old business. Content and service providers used to know their customers very well. They used to know – even control – the means through which their content is delivered. Consumers used to know the meaning of well-classified services such as television, movie and music. Today we are having less and less such certainties: end users are less and less predictable, the same piece of content can reach them through a variety of delivery systems and can be enjoyed by a plethora of widely differing consuming devices. How can we cope with this unpredictability of end user features, delivery systems and consumption devices? This is where "Digital Item Adaptation" (DIA) comes to help, providing the means to describe how a DI should be adapted (i.e. transformed) so that it best matches the specific features of the User, the Network and the Device.

I will complete the current list of basic technologies mentioning "Event Reporting" (ER), whose purpose is to provide metrics and interfaces for performance of all reportable events, "File Format" (FF), that provides a standard way to store and transmit DIs, and "Digital Item Processing" (DIP), whose purpose is to provide the means to "play" a DI.

I am – slowly – coming to the conclusion of my talk. What remains for me to talk about is what are MPEG’s plans for the future. It is clear that, after 15 years of intense work, the digital audio and video is an industry deeply rooted in the market, if not outright mature. An analysis that I have carried out with the help of the Chairs and Heads of Delegation in MPEG leads to the conclusion that existing MPEG standards are likely to provide standard solutions for most of the technology needs of the industry.

Should we then disband MPEG or, maybe downsize it just with the task of managing the maintenance of its standards? There are a number of reasons why I think this should not happen for the foreseeable future:

  1. The first reason is, as I said, the large body of standards that need maintenance, either in the form of corrigenda (very few indeed, if one considers the number of standards) or in the form of amendments to keep up with the demands from users of the standards;
  2. The second reason is the need to bring to conclusion the huge MPEG-21 project;
  3. The third reason is given by the recent cases of the MPEG-4 Advanced Video Coding (AVC) standard, the AAC Bandwidth Extension standard and, probably, the new scalable video coding project. These are a proof that the technologies underpinning the digital media world are far from having exhausted their capability to innovate. Even though I tend to be more conservative than most when it comes to the business value of "improvements" in the standards, it is clear that it is important to have a standards group with a clear brand that people can make reference to for their digital media related needs.
  4. The fourth reason is the demonstrated inability of some industrial fora to overcome their internal differences and provide solutions for their constituencies. This may be the case of the Multimedia Middleware API standard that MPEG has recently started.
  5. Lastly there are some fields like 3D Audio and Video that are a natural extension of the traditional MPEG area of competence for which technology may be ready to provide standards that industry may need.

The conclusion of my talk is that MPEG is the result of literally thousands of people who have donated their companies’ time, and very often their own personal time, to create the multimedia world that we now all enjoy. They have created MPEG, I have had the honour of presenting the result of their work.