Candidate: Ahmad Sajedi
Title: Coding for Data Analytics: New Information Distances
Date: August 21, 2020
Time: 2:00 PM
Place: Remote attendance
Supervisor(s): Yang, En-Hui
Distance plays a vital role in many applications of data analytics. For instance, in image retrieval, organization, and management, one needs a proper similarity distance to measure the perceptual similarity between any two images X and Y. Once such a similarity distance is defined for any two images, it could be used to retrieve images in a database that are perceptually similar to a query image according to the similarity distance in image retrieval, and to organize images into different groups according to their mutual similarity in image management. Likewise, in data clustering and bioinformatics, the notion of distance also plays a dominant role.
The concept of distance between any two data objects X and Y (continuous or discrete) is addressed from the perspective of Shannon information theory. Consider a coding paradigm where X and Y are encoded into a sequence of coded bits specifying a code word (or method) which would, in turn, convert Y into ^X , and X into ^ Y such that both the distortion between X and ^X and the distortion between Y and ^ Y are less than or equal to a prescribed threshold D. To have a universality to some extent, we consider a class C of coding schemes within the coding paradigm. Given a class C, the information distance RC(X; Y;D) between X and Y at the distortion level D is defined as the smallest number of coded bits afforded by coding schemes from C. RC(X; Y;D) is shown to be indeed a pseudo distance in some sense; it is further characterized or bounded. New information distance is defined for all stationary sources with discrete alphabets and Independent and Identically Distributed (IID) sources with continuous alphabets such that the distortion level is small. For example, a pseudo distance for memoryless jointly Gaussian sources is presented when the distortion level is small or the distortion level is less than or equal to a special term which is a function of statistical properties of the sources, such as variances.
When C is the class of so-called separately precoded broadcast codes, it is shown that for any Discrete Memoryless Source (DMS) sources X and Y, RC(X; Y; D) is equal to the maximum of the Wyner-Ziv coding rate of X with Y as side information and the Wyner-Ziv coding rate of Y with X as side information. In the general case where C consists of all codes within the coding paradigm, upper and lower bounds to RC(X; Y;D) are established, and are further shown to be tight when X and Y are memoryless jointly Gaussian or memoryless Doubly Symmetric Binary Source (DSBS)s. The distance RC(X; Y; D) generalizes the notion of information distance defined within the framework of Kolmogorov complexity. In contrast to other information distances in the literature, this information distance is
also applicable to both discrete and continuous-valued data.