| Riding the Media Bits | chiariglione.org | |||||||||||||||||||||||||||||||||||
|
Inside MPEG-7 |
|
|||||||||||||||||||||||||||||||||||
|
Last update: 2005/03/08 |
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| An overview of the technical content of MPEG-7. | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
Before
starting this page the reader must be warned that MPEG-7 has more elements
of abstractness, compared to previous MPEG standards that may make reading
it more difficult than most of these pages. With this warning, let's
start from some definitions that will assist a better understanding.
In the following, Ds and DSs are collectively called Description Tools (DT). The figure below represents the main elements making up the MPEG-7 standard.
The main MPEG-7 elements MPEG-7 provides a wide range of low-level descriptors: MPEG-7 Visual DTs consist of basic structures and Ds that cover the following basic visual features: Color, Texture, Shape, Motion and Localisation. The Color feature has multiple Ds. Some of them are:
The Texture feature has 3 Ds.
The Shape feature has 4 Ds. Region-Based Shape is a D to describe any shapes. This is a complex task because the shape of an object may consist of either a single region or a set of regions as well as some holes in the object or several disjoint regions. The Motion feature has 4 Ds: camera motion, object motion trajectory, parametric object motion, and motion activity.
MPEG-7 Audio Tools. There are seventeen low-level Audio temporal and spectral Ds that may be used in a variety of applications. While low-level audio Ds in general can serve many conceivable applications, the Spectral Flatness D specifically supports the functionality of robust matching of audio signals. Applications include audio fingerprinting, identification of audio based on a database of known works and, thus, locating metadata for legacy audio content without metadata annotation. Four sets of audio DTs - roughly representing application areas - are integrated in the standard: sound recognition, musical instrument timbre, spoken content, and melodic contour. Timbre is defined as the perceptual features that make two sounds with the same pitch and loudness sound different.
One can easily see, from this cursory presentation, the range of tools that are being offered to content owners and to application developers. The MPEG-7 Ds are designed for describing a wide range of information types: low-level audio-visual features such as color, texture, motion, audio energy, and so forth, as illustrated above; high-level features of semantic objects, events and abstract concepts; content management processes; information about the storage media; and so forth. It is expected that most Ds corresponding to low-level features will be extracted automatically, whereas more human intervention will be is likely to be required for producing higher-level Ds. The MPEG-7 Multimedia Description Schemes part of the standard defines a set of DTs dealing with generic as well as multimedia entities. Generic entities are features which are used in audio, visual, and text descriptions, and are therefore "generic" to all media. These are, for instance, "vector", "time", etc. More complex DTs are also standardised. They are used whenever more than one medium needs to be described (e.g. audio and video.) These DTs can be grouped into 6 different classes according to their functionality as in the following figure.
The MPEG-7 Multimedia Description Schemes
Basic Elements define a number of Schema Tools that facilitate the creation and packaging of MPEG-7 descriptions, a number of basic data types and mathematical structures, such as vectors and matrices, which are important for audio-visual content description. There are also constructs for linking media files and localising segments, regions, and so forth. Many of the basic elements address specific needs of audio-visual content description, such as the description of time, places, persons, individuals, groups, organisations, and other textual annotation. Content Description describes the Structure (regions, video frames, and audio segments) and Semantics (objects, events, abstract notions). Structural aspects describe the audio-visual content from the viewpoint of its structure. Conceptual aspects describe the audio-visual content from the viewpoint of real-world semantics and conceptual notions. The Content Management DTs allow the description of the life cycle of the content, from content to consumption including media coding, storage, and file formats and content usage.
Content Organisation organises and models collections of audio-visual content and of descriptions. Navigation and Access facilitates browsing and retrieval of audio-visual content by defining summaries, partitions, decompositions, and variations of the audio-visual material.
User Interaction describes user preferences and usage history pertaining to the consumption of the multimedia material. This allows, for example, matching between user preferences and MPEG-7 content descriptions in order to facilitate personalization of audio-visual content access, presentation and consumption. The main tools used to implement MPEG-7 descriptions are DDL, DSs, and Ds. Ds bind a feature to a set of values. DSs are models of the multimedia objects and of the universes that they represent. They specify the types of the Ds that can be used in a given description, and the relationships between these Ds or between other DSs. The DDL provides the descriptive foundation by which users can create their own DSs and Ds and defines the syntactic rules to express and combine DSs and Ds. The Description Definition Language satisfies the requirement of being able to express spatial, temporal, structural, and conceptual relationships between the elements of a DS, and between DSs. It provides a rich model for links and references between one or more descriptions and the data that it describes. The DDL Parser is also capable of validating Description Schemes (content and structure) and D data types, both primitive (integer, text, date, time) and composite (histograms, enumerated types). MPEG-7 adopted XML Schema Language as the DDL but added certain extensions in order to satisfy all requirements. The DDL can be broken down into the following logical normative components: the XML Schema structural language components; the XML Schema datatype language components and the MPEG-7 specific extensions. The information representation specified in the MPEG-7 standard provides the means to represent coded multimedia content description information. The entity that makes use of such coded representation of the multimedia content is an "MPEG-7 terminal". This may be a standalone application or be part of an application system. The architecture of such a terminal is depicted in the figure below.
Model of an MPEG-7 terminal The Delivery layer, placed at the bottom of the figure, provides MPEG-7 elementary streams to the Systems layer. MPEG-7 elementary streams consist of consecutive individually accessible portions of data named Access Units. An access unit (AU) is the smallest data entity to which timing information can be attributed. MPEG-7 elementary streams contain Schema information, that defines the structure of the MPEG-7 description and Descriptions information. The latter can be either the complete description of the multimedia content or fragments of the description. MPEG-7 data can be represented either in textual format, in binary format or a mixture of the two formats, depending on application requirements. A unique mapping between the binary format and the textual format is defined by the standard. A bi-directional loss-less mapping between the textual representation and the binary representation is possible, but this need not always be used. Some applications may not want to transmit all the information contained in the textual representation and may prefer to use a more bit-efficient binary lossy transmission. The syntax of the textual format is defined by the DDL and the syntax of the binary format, called Binary format for MPEG-7 data (BiM) is defined in Part 1 (Systems) of the standard. At the compression layer, the flow of AUs (either textual or binary) is parsed, and the content description is reconstructed. The MPEG-7 binary stream can be either parsed by the BiM parser, transformed into textual format and then transmitted in textual format for further reconstruction processing, or the binary stream can be parsed by the BiM parser and then transmitted in proprietary format for further processing. AUs are further structured as commands encapsulating the schema or the description information. Commands allow a description to be delivered in a single chunk or to be fragmented in small pieces. They allow basic operations such as updating a D, deleting part of the description or adding a new DDL structure. The reconstruction stage of the compression layer updates the description information and associated schema information by consuming these commands. Further structure of the schema or description is out of the scope of the MPEG-7 standard in its current form. |
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
Copyright © 2003 chiariglione.org |
||||||||||||||||||||||||||||||||||||