Overview of MPEG-7 Audio

Publisher:乘风翻浪Latest update time:2010-07-26 Source: 依马狮广电网Keywords:MPEG-7  standard  audio Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

introduction

MPEG-1, MPEG-2, MPEG-4 data compression and coding standards are only representations of multimedia information content itself, while MPEG-7 standard is built on the basis of MPEG-1, MPEG-2, MPEG-4 standards and can be used independently of them. It provides standardized description information about multimedia information content, not the content itself, but "data about data". MPEG-7 standard is not intended to replace these standards, but to provide a standard description representation for these standards. In fact, MPEG-7 was proposed as early as 1997, and the proposer at that time had a good prediction of the network world we face today. Now every Internet user faces hundreds of millions of gigabytes of audio and video information on the Internet every day, but there is no unified method to search for the audio and video content you want. In this way, MPEG-7 came into being. It defines a universal standard structure that can exchange data describing multimedia content information, thereby supporting content search and management. Therefore, the official name of MPEG-7 is Multimedia Content Description Interface, and this name also indicates the scope of application it represents. (See Figure 1)

Although there are many ways to describe a piece of multimedia content, the definition of the description format is usually standardized, so it has usability, uniformity and interoperability. Since it is a standard description, the MPEG-7 standard only specifies the description format (syntax and semantics).

Although the original intention of proposing the multimedia content description interface was to solve the problem of people being at a loss when faced with the excessive amount of multimedia information on the Internet, the application of MPEG-7 is by no means limited to searching. It also has many other types of applications, such as real-time monitoring, broadcast filtering, semi-automatic editing, and automatic playlist generation.

In this paper, we will discuss the basic theory of MPEG-7, focusing on the audio aspect.

1. MPEG-7 Basics

1. Definition

The basic description entity of MPEG-7 is called a descriptor, which represents the properties, characteristics and attributes of specific content by defining syntax and semantics. In the audio field, a descriptor can describe the spectral envelope of an audio signal.

Description schemes are used to combine and construct description components to meet application requirements. Description schemes contain a series of descriptors and other description schemes in the same system.

Descriptors and description schemes are semantically defined by the so-called description definition language (DDL), which can be extended. MPEG-7 DDL is based on XML language because it can realize the textual representation of content description and allows the extension of description tools.

2. Description Definition Language - DDL

The description definition language is a descriptive language based on XML (text format). XML was adopted partly because it is an extension of the SGML language, and its popularity ensures the extensibility of the description tool. In addition, MPEG-7 adopts XML because it plays a huge role in the creation of data structures related to multimedia content description.

However, MPEG-7 DDL is not a complete copy of the XML language specification, it has a few changes, such as it provides extended support for some specific data types, and it removes redundant features, etc. In particular, DDL introduces a new structure for defining arrays and matrices, extending the functionality of the XML language architecture.

3. Multimedia Description Scheme (MDS)

The Multimedia Description Scheme was created not only to meet the needs of tools that are not applicable to video or audio alone, but also to meet the needs of multimedia content. Therefore, the MDS in the MPEG-7 standard provides a set of two levels of tools related to the MPEG-7 audio standard. First, there is a set of low-level tools, which includes extended data types, such as basic definition types of segments of audio and video data, and a set of high-level tools, which go beyond the scope of audio and video description and allow audio and video related, such as semantic description of high-level content. In order to avoid redundant functions in the MPEG-7 standard, its audio part is the tool set required by MDS to provide audio files.

2. MPEG-7 Audio

The structure of the MPEG-7 audio standard can be divided into two categories, namely the general audio description framework and specific application tools. The audio description framework is the basic compatibility layer (toolbox), which is the basis for general description and specific application construction, and it also includes extensible series schemes, low-level descriptors (LLDs), and silence segments. On the other hand, sound recognition tools, instrument timbre description tools, speech content description tools, melody description tools, and robust audio matching tools focus on their respective application fields, so their description capabilities are relatively strong.

1. Describe the structure

MPEG-7 Audio relies on two basic structures, segments and scalable sequences.

The segment data type is actually inherited from the MDS of the MPEG standard, and was adopted by MPEG-7 audio description from the beginning. Audio segments and segment decomposition work together to continuously decompose the audio stream. The audio is divided into "segments" based on at least one characteristic, whether conceptual or mathematical. You can split the audio stream into any resolution you want, and at any level of depth. As long as the time domain range of the sub-segments is completely on the parent segment, the characteristics of the parent segment itself will limit the characteristics of the sub-segments, and sub-segments may have gaps, overlaps, both, or neither. A specific period of time in the audio stream can be described by any number of segments.

The core of an extensible sequence is a series of sampled values ​​corresponding to a descriptor. The most common concept is that it is a time series, but it is also suitable for sampling in the frequency domain. Extensible sequences can also store various summary values, such as the maximum, minimum, and variance of the descriptor values.

2. Low-level audio descriptor

Generally speaking, the low-level descriptors related to most audio signals can be divided into the following groups, and the description range of the descriptors is given below:

Basic: instantaneous waveform and power value.

Basic spectrum: logarithmic frequency power spectrum and spectrum characteristics, including spectral center, spectrum extension, and spectrum flatness.

Signal parameters: fundamental frequency and signal harmonics of quasi-periodic signal.

Timbre time domain: log attack time, time center of mono audio with time domain split.

Tone spectrum: the spectrum characteristics in linear space, such as the spectral center of monophonic audio, and also includes the spectral characteristics of the harmonic part of the signal, that is, the harmonic spectral center, spectral deviation, spectral extension, and spectral variation.

Basic representation of the spectrum: mainly used for sound recognition characteristics, generally used for recognition in low-dimensional space.

In 2003, MPEG released MPEG-7 Audio Version 2, adding some low-level descriptors, including audio signal quality descriptors and music beat (BPM) descriptors. The scope of audio signal quality descriptors includes background noise, audio channel cross-correlation, relative delay, balance, DC offset, bandwidth and transmission technology, as well as recording errors. In addition, the original MPEG-7 Audio was expanded to add descriptions of stereo/surround sound and speech content.

3. Application-oriented audio tools

(1) Voice recognition tools

[page]

The Sound Recognition Descriptors and Description Schemes are a set of tools for indexing and categorizing all sound effects. They support automated sound effect recognition and indexing and include tools for detailing sound classification and sound recognition. This identifier can be used for automatic indexing and track segmentation.

(2) Speech content description tool

Based on the fact that the current speech system is not perfect, the speech content description tool was created. It is not a simple speech text (although it can also adapt to this situation), this description scheme includes merged words and phoneme grids, which can serve every speaker in the audio stream. By combining the phoneme grid, the problem of out-of-vocabulary words is largely solved. Even if the original decoding is wrong, or the word is beyond the scope of the recognition engine's vocabulary, retrieval information can still be performed. It can be used for two major categories of retrieval schemes: indexing and retrieval of audio streams and indexing of audio multimedia objects.

(3) Instrument timbre description tool

Timbre description is aimed at describing the perceptual characteristics of the sound of an instrument. Timbre is defined in the library as a perceptual characteristic. When two sounds have the same fundamental pitch and loudness, but sound different, it is because the timbre of the two sounds is different. The Timbre Description Tool describes these perceptual characteristics using a simplified set of descriptors. Descriptors refer to concepts such as the attack time, brightness or richness of the sound.

There are four categories of musical instrument sounds: harmonic, continuous, continuous sounds; percussive, non-continuous sounds; disharmonic, continuous, continuous sounds; and non-continuous, continuous sounds. Of these four categories, the first two have been detailed in the MPEG-7 standard and are being updated. The other two categories are considered low priority because they are relatively rare, but the standard still describes them. The timbre description tool makes extensive use of the low-level descriptors of timbre in the time and frequency domains discussed above.

(4) Melody Description Tool

There are two ways to describe melodic properties, depending on how concise and accurate you need them to be. The melody contour description scheme is a way to concisely describe melodic information, which allows for efficient and robust melodic similarity matching, for example, via hum queries. The melody contour description scheme uses a 5-step contour method (representing the difference in interval length between adjacent notes), and the intervals are quantized. The melody contour description scheme can also represent rhythmic information by storing the nearest beat number of each note, and this can significantly improve the accuracy of matching information extracted from the database.

For applications, better description accuracy and reconstruction of a given melody are required, so the melody description scheme supports the extension of the descriptor set and high-precision interval coding. There is an accurate pitch interval between notes (accurate to cents or higher), rather than just quantized to 5 levels. Accurate rhythm information is obtained by encoding the logarithmic ratio of the difference in the start time of notes with similar pitch intervals. These core descriptors are a series of optional supporting descriptors, such as lyrics, key, rhythm and starting note, to meet the needs of applications.

(5) Robust matching tool

Robust audio matching tools can robustly and efficiently perform identity matching of audio signals, that is, can distinguish whether two audio signals are essentially the same, even after linear or nonlinear distortion of the signals. Unknown audio signals and related material database entries can be robustly and efficiently matched, which makes it possible to automatically identify audio materials and simulate the ability to recognize sounds based on memory, just like human abilities. More importantly, in the MPEG-7 standard, methods are established to find content description data (e.g., song title or singer name) for a given audio content block in existing traditional audio formats. For example, CDs do not provide any link to the corresponding description database entries. Although robust audio matching can in principle be achieved by several features, it can be perfectly achieved using the spectrum flatness descriptor in MPEG7.

3. Application of MPEG-7 Audio

1. Search for spoken content

Speech is the most important means of communication between humans and is closely related to our lives. Therefore, it is very important to be able to search for spoken content. Today's automatic speech recognizers are based on words/phonemes.

I have briefly talked about the method of describing speech content in MPEG-7. It first stores the phoneme grid as description data (rather than in the form of plain text), then adopts the query/matching method, and it allows fuzzy processing and retrieval of unknown words. This application allows you to retrieve a certain speech content and also annotate a certain content using voice.

2. General sound recognition and indexing

When you are faced with a variety of sounds, even complex mixed sounds, how do you identify one of them? When you are faced with two similar sounds, how do you solve this problem? They can be distinguished well by performing spectral independent component analysis and using hidden Markov models.

3. Archive and restore

This application is a good solution to the problem of preserving cultural heritage (audio part), which can be archived for future use. We often want to save the original recording so that we can post-process or restore it later. For archiving and restoring audio files, MPEG-7 achieves this through the coding of sound quality description, which includes recording and encoding their general sound quality, technical recording parameters, location and type of defects.

4. Search for instrument sounds

Human perception of sound includes not only pitch, loudness and duration, but also timbre. MPEG-7 uses perceptually relevant features to describe monophonic instruments to compare sounds, and it also takes into account continuous harmonic sounds and percussive instruments.

5. Melody Search

As for the problem of how to search for melody, it requires the search tool to be efficient and tolerant of slight inaccuracies in pitch and time. The method of MPEG-7 is to use melody description coding, which encodes pitch, rhythm, time information, etc. for search.

6. Audio recognition/fingerprinting

Regarding the problem of how to identify a recording, the method used by MPEG-7 is to store the MPEG-7 signature/fingerprint of the original file in the database, and then identify unknown audio materials by matching with the database signature. This audio fingerprint method that automatically identifies audio content by extracting unique signatures from audio signals has gained widespread interest. In addition, MPEG-7 can also be used for speaker recognition.

IV. Conclusion

After publishing several generations of audio and video compression standards, the MPEG standards group extended its work to the description of multimedia content. Because the demand for efficient search and retrieval of audio and video content is growing, the MPEG-7 standard attempts to provide a standard content description method for audio and video. In particular, many description tools in the standard are used to effectively represent the internal structure and characteristics of the content, rather than describing it in a pure annotation form like other metadata standards. MPEG-7 audio provides common concepts and its application-oriented characteristics. At the same time, it has many functions, such as music humming retrieval applications, sound effect recognition, instrument timbre description, speech content annotation, robust matching of audio signals, etc.

Keywords:MPEG-7  standard  audio Reference address:Overview of MPEG-7 Audio

Previous article:A charge pump circuit design for white light LED driving
Next article:How to realize 3D? Analysis of the principle of 3D TV stereoscopic imaging

Recommended ReadingLatest update time:2024-11-16 23:38

Fraunhofer IIS and LG enter into patent licensing agreement for MPEG-H Audio
 Fraunhofer IIS announced that it has licensed its MPEG-H Audio patented technology to LG Electronics, allowing LG to produce devices with MPEG-H Audio capabilities.   The MPEG-H Audio system was developed by Fraunhofer IIS and has been incorporated into the ATSC 3.0 and DVB broadcast standards. In addition, MPEG-H Au
[Internet of Things]
Latest Home Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号