What are the lossless compression algorithms?-EEWORLD

Collect

Definition of Lossless Compression

The so-called lossless compression format, as the name implies, is an audio format that compresses sound signals without loss. Common formats such as MP3 and WMA are lossy compression formats. Compared with the WAV file as the source, they have a considerable degree of signal loss, which is the fundamental reason why they can achieve a compression rate of 10%. The lossless compression format is like using compression software such as Zip or RAR to compress the audio signal, and the resulting compressed format is restored to a WAV file, which is exactly the same as the WAV file as the source! However, if you use Zip or RAR to compress the WAV file, you must decompress the compressed package before you can play it. The lossless compression format can be directly played in real time through the playback software, and it is exactly the same as lossy formats such as MP3. In short, the lossless compression format is a format that can reduce the size of the WAV file without sacrificing any audio signal.

Commonly used lossless compression algorithms include Shannon-Fano coding, Huffman coding, run-length coding, LZW (Lempel-Ziv-Welch) coding and arithmetic coding.

Huffman Coding

This method constructs the codeword with the shortest average length of different-character headers based entirely on the probability of character occurrence. It is sometimes called the best code, and is generally called Huffman coding. It is a coding method that can achieve the minimum average code length for statistically independent information sources. The coding efficiency is high.

Basic principle:

The code is constructed according to the probability of occurrence of the source characters. A shorter code length is given to the source characters with a higher probability of occurrence, and a longer code length is given to the source characters with a lower probability of occurrence, so that the average codeword of the encoding is the shortest.

Coding steps:

(1) Initialization: sort the symbols in descending order according to their probability.

(2) The two symbols with the smallest probability form a node.

(3) Repeat step 2.

(4) Starting from the root node to the "leaf" corresponding to each symbol, mark "0" (upper branch) or "1" (lower branch) from top to bottom. It does not matter which one is "1" and which one is "0". The final result is just that the assigned codes are different, while the average length of the codes is the same.

(5) Start from the root node and follow the branches to each leaf to write the code for each symbol.

Notes on Huffman coding:

Huffman coding has no error protection function. If there is an error in the code, it may cause a series of subsequent decoding errors.

Huffman coding is a variable-length coding, so it is difficult to arbitrarily search or call the file content.

Huffman relies on the statistical properties of the source. Each codeword of Huffman coding is an integer: therefore, in practice, the average code length is difficult to reach the size of information entropy.

Huffman encoding and decoding requires a code table. If the number of messages is large, the code table that needs to be stored will also be large, which will affect the system's storage capacity and encoding and decoding speed.

Arithmetic Coding

Arithmetic coding represents a source set as an interval between 0 and 1 on the real number line. Each element in this set is used to shorten this interval. The more elements in the source set, the smaller the interval is. When the interval becomes smaller, more digits are needed to represent the interval. This is the principle of interval as code. Arithmetic coding first assumes a probability model of the source, and then uses these probabilities to reduce the interval representing the source set.

Arithmetic coding uses two basic parameters :

The probability of a symbol and its coding interval

The probability of the source symbol determines the efficiency of the compression coding and also determines the intervals of the source symbols during the encoding process, and these intervals are between 0 and 1.

Several issues need to be noted in arithmetic coding:

1. Since the precision of actual computers cannot be infinite, overflow in calculations is an obvious problem, but most machines have 16-bit, 32-bit or 64-bit precision, so this problem can be solved using the scaling method.

2. Arithmetic The encoder generates only one codeword for the entire message, which is a real number in the interval [0, 1), so the decoder cannot decode before receiving all the bits representing this real number.

3. Arithmetic coding is also a coding method that is very sensitive to errors. If one bit is wrong, the entire message will be mistranslated.

Run-length encoding

RLE (Run-Length Encoding) is a compression scheme for data that contains multiple repetitions in a sequential order. The principle is to replace a series of repeated values with a single value plus a count value. The run length is the number of consecutive and repeated units. If you want to get the original data, just expand this encoding.

Comparing the number of codes before and after RLE encoding, we can find that before encoding, 73 codes are used to represent the data of this row, while after encoding, only 10 codes are used to represent the original 73 codes. The ratio of the amount of data before and after compression is about 771, that is, the compression ratio is 7:1. This shows that RLE is indeed a compression technology, and the encoding technology is practical.

The performance of RLE depends mainly on the characteristics of the image itself. RLE compression coding is particularly suitable for computer-generated images and is very effective in reducing the storage space of image files. However, since colorful natural images often have few consecutive pixels with the same color on the same line, and the number of consecutive lines with the same color value for several consecutive lines is even smaller, if the RLE encoding method is still used, not only will the image data not be compressed, but the original image data may become larger.

Decoding follows the same rules as encoding, and the data obtained after restoration is exactly the same as the data before compression.

Therefore, RLE is a lossless compression technology. It is used in BMP, JPEG/MPEG, TI FF and PDF encoding, and is also used in fax machines.

LZW encoding

LZW achieves compression by creating a string table and using shorter codes to represent longer strings. The correspondence between strings and codes is dynamically generated during the compression process and is implicit in the compressed data. It is restored according to the table during decompression. It is a lossless compression algorithm, the full name of which is Lempel-ziy- Welch encoding, or LZW for short.

Rationale

Extract different characters from the original text file data, create a compilation table based on these characters, and then use the index of the characters in the compilation table to replace the corresponding characters in the original text file data to reduce the size of the original data. It looks similar to the implementation principle of the palette image, but it should be noted that the compilation table here is not created in advance, but dynamically created according to the original file data. When decoding, the original compilation table must be restored from the encoded data.

The specific execution steps of the LZW encoding algorithm are as follows:

1. The dictionary at the beginning contains all possible roots, and the current prefix P is empty;

2. Current character (C): = the next character in the character stream;

3. Judgment level - whether the string P+C is in the dictionary

(1) If "yes": P: = P + C // (extend P with C)

(2) If “No

① Output the codeword representing the current prefix P to the codeword stream

②Add the affix-string P+C to the dictionary;

③Let P: = C/ (now P contains only one sub-symbol C)

4. Determine whether there are more codewords to be translated in the codeword stream

(1) If “yes”, return to step 2

(2) If “No”

① Output the code representing the current prefix P to the codeword stream

②End

Reference address：What are the lossless compression algorithms?

Previous article：Simple AM radio circuit diagram based on CD4011
Next article：Difference between lossy and lossless compression

Popular Resources
Popular amplifiers