Riding the Media Bits  chiariglione.org
Riding the Media Bits
Digital Media Project
Digital Media Manifesto
Leonardo
Acronyms
Site Map
Home

Inside MPEG-7


e-mail

 Last update: 2005/03/08

 

An overview of the technical content of MPEG-7.

 

Before starting this page the reader must be warned that MPEG-7 has more elements of abstractness, compared to previous MPEG standards that may make reading it more difficult than most of these pages. With this warning, let's start from some definitions that will assist a better understanding.

Element Definition
Data The audio-visual information to be described using MPEG-7.
Examples: MPEG-4 elementary streams, Audio CDs containing music, hard disks containing MP3 files, synthetically generated pictures or drawings on a piece of paper. 
Feature A distinctive characteristic of a Data item that means something to somebody.
Examples: the colour of a picture, the particular rhythm of a piece of music, the camera movement in a video or the cast of a movie.
Descriptor A representation of a Feature. It defines the syntax and semantic of the representation of the Feature. Different Descriptors may very well represent the same Feature.
Examples: the case of the Feature "colour" that can be represented as a histogram or as the frequency spectrum. 
Descriptor Value An instantiation of a Descriptor for a given Data set. Descriptor Values are combined through the Description Scheme mechanism to form a Description.
Description Scheme The specification of the structure and semantics of relationships among its components. These can be Descriptors or, recursively, Description Schemes. The distinction between a DS and a D is that a D just contains basic data types and does not make reference to any other D (and, obviously, DS).
Examples: a movie that is temporally structured in scenes, with textual descriptions at scene level and some audio descriptors of dialogues and background music.

In the following, Ds and DSs are collectively called Description Tools (DT).

The figure below represents the main elements making up the MPEG-7 standard.

The main MPEG-7 elements

MPEG-7 provides a wide range of low-level descriptors: MPEG-7 Visual DTs consist of basic structures and Ds that cover the following basic visual features: Color, Texture, Shape, Motion and Localisation. 

The Color feature has multiple Ds. Some of them are: 

  • Color Quantization, a Descriptor to express colour histograms keeping the flexibility of linear and non-linear quantisation and look-up tables; 

  • Dominant Color(s), a D to represent features where a small number of colours suffice to characterize the color information in the region of interest; 

  • Scalable Color, a Descriptor that is useful for image-to-image matching and retrieval based on colour feature; 

  • Color Structure, a D that captures both colour content (similar to a color histogram) and information about the structure of this content whose intended use is for still-image retrieval because its main functionality is image-to-image matching; 

  • Color Layout, a D that specifies the spatial distribution of colors that can be used for image-to-image matching and video-clip-to-video-clip matching or for layout-based retrieval for color, such as sketch-to-image matching. 

The Texture feature has 3 Ds. 

  • Homogeneous texture is a D that is used for searching and browsing through large collections of similar looking patterns. An image can be considered as a mosaic of homogeneous textures so that these texture features associated with the regions can be used to index the image data. Agricultural areas and vegetation patches are examples of homogeneous textures commonly found in aerial and satellite imagery. 

  • Texture Browsing is a D that provides a perceptual characterization of texture, similar to a human characterization, in terms of regularity, coarseness and directionality. The computation of this descriptor proceeds similarly to the Homogeneous Texture D. First, the image is filtered with a bank of special filters. From the filtered outputs, two dominant texture orientations are identified. Then the regularity and coarseness is determined by analysing the filtered image projections along the dominant orientations. 

  • Edge histogram is a D that represents the spatial distribution of five types of edges: four directional edges and one non-directional edge. Edge histogram can retrieve images with similar semantic meaning, since edges play an important role for image perception. 

The Shape feature has 4 Ds. Region-Based Shape is a D to describe any shapes. This is a complex task because the shape of an object may consist of either a single region or a set of regions as well as some holes in the object or several disjoint regions. 

The Motion feature has 4 Ds: camera motion, object motion trajectory, parametric object motion, and motion activity. 

  • Camera Motion is a D that characterises motion parameters of a camera in a 3-D space. This motion parameter information can be extracted automatically or generated by capture devices. 

  • Motion Trajectory is a D of the motion trajectory of an object, defined as the localisation, in time and space, of one representative point of this object. In surveillance, alarms can be triggered if some object has a trajectory identified as dangerous (e.g. passing through a forbidden area, being unusually quick, etc.). In sports, specific actions (e.g. tennis rallies taking place at the net) can be recognized. 

MPEG-7 Audio Tools. There are seventeen low-level Audio temporal and spectral Ds that may be used in a variety of applications. While low-level audio Ds in general can serve many conceivable applications, the Spectral Flatness D specifically supports the functionality of robust matching of audio signals. Applications include audio fingerprinting, identification of audio based on a database of known works and, thus, locating metadata for legacy audio content without metadata annotation. 

Four sets of audio DTs - roughly representing application areas - are integrated in the standard: sound recognition, musical instrument timbre, spoken content, and melodic contour. Timbre is defined as the perceptual features that make two sounds with the same pitch and loudness sound different.

  • Musical Instrument Timbre is a D that aims at describing perceptual features of instrument sounds. The aim of the Timbre D is to describe the perceptual features with a reduced set of Ds that relate to notions such as "attack", "brightness" or "richness" of a sound. 

  • Sound Recognition is a set of Ds for indexing and categorisation of general sounds, with immediate application to sound effects. 

  • Spoken Content is a D that allows detailed description of words spoken within an audio stream. This trades compactness for robustness of search, because current Automatic Speech Recognition (ASR) technologies have their limits, and one will always encounter out-of-vocabulary utterances. To accomplish this, the tools represent the output and what might normally be seen as intermediate ASR results. The tools can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech. 

One can easily see, from this cursory presentation, the range of tools that are being offered to content owners and to application developers. The MPEG-7 Ds are designed for describing a wide range of information types: low-level audio-visual features such as color, texture, motion, audio energy, and so forth, as illustrated above; high-level features of semantic objects, events and abstract concepts; content management processes; information about the storage media; and so forth. It is expected that most Ds corresponding to low-level features will be extracted automatically, whereas more human intervention will be is likely to be required for producing higher-level Ds. 

The MPEG-7 Multimedia Description Schemes part of the standard defines a set of DTs dealing with generic as well as multimedia entities. Generic entities are features which are used in audio, visual, and text descriptions, and are therefore "generic" to all media. These are, for instance, "vector", "time", etc. More complex DTs are also standardised. They are used whenever more than one medium needs to be described (e.g. audio and video.) These DTs can be grouped into 6 different classes according to their functionality as in the following figure.

The MPEG-7 Multimedia Description Schemes 

 
Elements Definition
Basic Elements facilitate the creation and packaging of descriptions
Content description representation of perceivable information
Content management information about the media features, the creation and the usage of the AV content
Content organization representation of the analysis and classification of several AV contents
Navigation and access specification of summaries and variations of the AV content
User Interaction describes user preferences and usage history

Basic Elements define a number of Schema Tools that facilitate the creation and packaging of MPEG-7 descriptions, a number of basic data types and mathematical structures, such as vectors and matrices, which are important for audio-visual content description. There are also constructs for linking media files and localising segments, regions, and so forth. Many of the basic elements address specific needs of audio-visual content description, such as the description of time, places, persons, individuals, groups, organisations, and other textual annotation. 

Content Description describes the Structure (regions, video frames, and audio segments) and Semantics (objects, events, abstract notions). Structural aspects describe the audio-visual content from the viewpoint of its structure. Conceptual aspects describe the audio-visual content from the viewpoint of real-world semantics and conceptual notions. 

The Content Management DTs allow the description of the life cycle of the content, from content to consumption including media coding, storage, and file formats and content usage.

  • Creation Information describes the creation and classification of the audio-visual content and other material that is related to the audio-visual content: Title (which may itself be textual or another piece of audio-visual content), Textual Annotation, and Creation Information such as creators, creation locations and dates. 

  • Classification Information describes how the audio-visual material is classified into categories such as genre, subject, purpose, language, and so forth. It also provides review and guidance information such as age classification, subjective review, parental guidance, and so forth. 

  • Finally, Related Material Information describes whether other audio-visual material exists that is related to the content being described. Usage Information describes the usage information related to the audio-visual content such as usage rights (through links to the rights holders and other information related to rights management and protection), availability, usage record, and financial information. Media Description describes the storage media such as the compression, coding and storage format of the audio-visual data. 

Content Organisation organises and models collections of audio-visual content and of descriptions. 

Navigation and Access facilitates browsing and retrieval of audio-visual content by defining summaries, partitions, decompositions, and variations of the audio-visual material. 

  • Summaries provide compact summaries of the audio-visual content to enable discovery, browsing, navigation, visualization and sonification of audio-visual content. 

  • Partitions and Decompositions describe different decompositions of the audio-visual signals in space, time and frequency. Variations provide information about different variations of audio-visual programs, such as summaries and abstracts; scaled, compressed and low-resolution versions; and versions with different languages and modalities - audio, video, image, text, and so forth. 

User Interaction describes user preferences and usage history pertaining to the consumption of the multimedia material. This allows, for example, matching between user preferences and MPEG-7 content descriptions in order to facilitate personalization of audio-visual content access, presentation and consumption. 

The main tools used to implement MPEG-7 descriptions are DDL, DSs, and Ds. Ds bind a feature to a set of values. DSs are models of the multimedia objects and of the universes that they represent. They specify the types of the Ds that can be used in a given description, and the relationships between these Ds or between other DSs. The DDL provides the descriptive foundation by which users can create their own DSs and Ds and defines the syntactic rules to express and combine DSs and Ds. 

The Description Definition Language satisfies the requirement of being able to express spatial, temporal, structural, and conceptual relationships between the elements of a DS, and between DSs. It provides a rich model for links and references between one or more descriptions and the data that it describes. The DDL Parser is also capable of validating Description Schemes (content and structure) and D data types, both primitive (integer, text, date, time) and composite (histograms, enumerated types). MPEG-7 adopted XML Schema Language as the DDL but added certain extensions in order to satisfy all requirements. The DDL can be broken down into the following logical normative components: the XML Schema structural language components; the XML Schema datatype language components and the MPEG-7 specific extensions. 

The information representation specified in the MPEG-7 standard provides the means to represent coded multimedia content description information. The entity that makes use of such coded representation of the multimedia content is an "MPEG-7 terminal". This may be a standalone application or be part of an application system. The architecture of such a terminal is depicted in the figure below. 

Model of an MPEG-7 terminal

The Delivery layer, placed at the bottom of the figure, provides MPEG-7 elementary streams to the Systems layer. MPEG-7 elementary streams consist of consecutive individually accessible portions of data named Access Units. An access unit (AU) is the smallest data entity to which timing information can be attributed. MPEG-7 elementary streams contain Schema information, that defines the structure of the MPEG-7 description and Descriptions information. The latter can be either the complete description of the multimedia content or fragments of the description. 

MPEG-7 data can be represented either in textual format, in binary format or a mixture of the two formats, depending on application requirements. A unique mapping between the binary format and the textual format is defined by the standard. A bi-directional loss-less mapping between the textual representation and the binary representation is possible, but this need not always be used. Some applications may not want to transmit all the information contained in the textual representation and may prefer to use a more bit-efficient binary lossy transmission. The syntax of the textual format is defined by the DDL and the syntax of the binary format, called  Binary format for MPEG-7 data (BiM) is defined in Part 1 (Systems) of the standard. 

At the compression layer, the flow of AUs (either textual or binary) is parsed, and the content description is reconstructed. The MPEG-7 binary stream can be either parsed by the BiM parser, transformed into textual format and then transmitted in textual format for further reconstruction processing, or the binary stream can be parsed by the BiM parser and then transmitted in proprietary format for further processing. 

AUs are further structured as commands encapsulating the schema or the description information. Commands allow a description to be delivered in a single chunk or to be fragmented in small pieces. They allow basic operations such as updating a D, deleting part of the description or adding a new DDL structure. The reconstruction stage of the compression layer updates the description information and associated schema information by consuming these commands. Further structure of the schema or description is out of the scope of the MPEG-7 standard in its current form.

 

 

Send an e-mail to commentSee the communication policy

 

Copyright © 2003 chiariglione.org