DefData Compression
Data compression is the process if replacing a message with a smaller message which can be reliably converted back to the origin.
Formalising Coding
Definition
DefSource Code
We denote the set of all finite binary strings by . Given an ensemble , a function is a source code for . The number of symbols in is the length of the codeword for .
Definition
DefExtension of Source Codes
The extension of a source code is defined by where is the set of all finite sequences of symbols from .
e.g. The code names outcomes from by , , . The length of the codeword for each outcome is . The extension of gives .
Definition
DefUniform Code
Let be an ensemble, and be a source code. We say is a uniform code if is the same for all . We say is a variable-length code otherwise.
Lossless Code
Let be an ensemble, and be a source code. We say is a lossless code if is injective. Otherwise, we say is a lossy code.
e.g. Let . Then , , , is uniform lossless; while , , , is uniform lossy.
Formalising Compression
Definition
DefRaw Bit Content
If is an ensemble with outcome set then its raw bit content is where is the cardinality of .
Comment
There is an inherent trade off between the number of bits required in a uniform lossy code and the probability of being able to code an outcome.
Definition
DefSmallest -sufficient Subset
Let be an ensemble and for , define the smallest -sufficient subset to be the smallest subset of such that e.g.
Algorithm
AlgorithmFind The Smallest -sufficient Subset
Intuitively, construct by removing elements of in ascending order of probability, till we have reached the threshold.
Definition
DefEssential Bit Content
Let be an ensemble the for the essential bit content of is where is the smallest -sufficient subset.
Typical Sets and AEP
Extended Ensemble
The extended ensemble of blocks of size is denoted . Outcomes from are denoted . The probability of is defined to be .
e.g. Let be an ensemble with outcomes with and . Then (3 flips of the coin) has outcomes with and .
Definition
DefTypical Set
For closeness the typical set for an ensemble is
Remark
The name “typical” is used since � will have roughly occurrences of symbol , occurrences of symbol , and so on, where are the probabilities of the symbols in .
ThrmSource Coding Theorem
Let be an ensemble with entropy bits. Given and , there exists a positive integer such that for all we have Proof Recall that for the typical set , we have By AEP,
Remark
The source coding theorem shows that given a tiny probability of error, the average bits per outcome can be made as close to as required. Even if we allow a large probability of error, we cannot compress more than bits per outcome for large sequences.