Candidate:
Ahmad
Sajedi
Title:
Coding
for
Data
Analytics:
New
Information
Distances
Date:
August
21,
2020
Time:
2:00
PM
Place:
Remote
attendance
Supervisor(s):
Yang,
En-Hui
Abstract:
Distance
plays
a
vital
role
in
many
applications
of
data
analytics.
For
instance,
in
image
retrieval,
organization,
and
management,
one
needs
a
proper
similarity
distance
to
measure
the
perceptual
similarity
between
any
two
images
X
and
Y.
Once
such
a
similarity
distance
is
defined
for
any
two
images,
it
could
be
used
to
retrieve
images
in
a
database
that
are
perceptually
similar
to
a
query
image
according
to
the
similarity
distance
in
image
retrieval,
and
to
organize
images
into
different
groups
according
to
their
mutual
similarity
in
image
management.
Likewise,
in
data
clustering
and
bioinformatics,
the
notion
of
distance
also
plays
a
dominant
role.
The
concept
of
distance
between
any
two
data
objects
X
and
Y
(continuous
or
discrete)
is
addressed
from
the
perspective
of
Shannon
information
theory.
Consider
a
coding
paradigm
where
X
and
Y
are
encoded
into
a
sequence
of
coded
bits
specifying
a
code
word
(or
method)
which
would,
in
turn,
convert
Y
into
^X
,
and
X
into
^
Y
such
that
both
the
distortion
between
X
and
^X
and
the
distortion
between
Y
and
^
Y
are
less
than
or
equal
to
a
prescribed
threshold
D.
To
have
a
universality
to
some
extent,
we
consider
a
class
C
of
coding
schemes
within
the
coding
paradigm.
Given
a
class
C,
the
information
distance
RC(X;
Y;D)
between
X
and
Y
at
the
distortion
level
D
is
defined
as
the
smallest
number
of
coded
bits
afforded
by
coding
schemes
from
C.
RC(X;
Y;D)
is
shown
to
be
indeed
a
pseudo
distance
in
some
sense;
it
is
further
characterized
or
bounded.
New
information
distance
is
defined
for
all
stationary
sources
with
discrete
alphabets
and
Independent
and
Identically
Distributed
(IID)
sources
with
continuous
alphabets
such
that
the
distortion
level
is
small.
For
example,
a
pseudo
distance
for
memoryless
jointly
Gaussian
sources
is
presented
when
the
distortion
level
is
small
or
the
distortion
level
is
less
than
or
equal
to
a
special
term
which
is
a
function
of
statistical
properties
of
the
sources,
such
as
variances.
When
C
is
the
class
of
so-called
separately
precoded
broadcast
codes,
it
is
shown
that
for
any
Discrete
Memoryless
Source
(DMS)
sources
X
and
Y,
RC(X;
Y;
D)
is
equal
to
the
maximum
of
the
Wyner-Ziv
coding
rate
of
X
with
Y
as
side
information
and
the
Wyner-Ziv
coding
rate
of
Y
with
X
as
side
information.
In
the
general
case
where
C
consists
of
all
codes
within
the
coding
paradigm,
upper
and
lower
bounds
to
RC(X;
Y;D)
are
established,
and
are
further
shown
to
be
tight
when
X
and
Y
are
memoryless
jointly
Gaussian
or
memoryless
Doubly
Symmetric
Binary
Source
(DSBS)s.
The
distance
RC(X;
Y;
D)
generalizes
the
notion
of
information
distance
defined
within
the
framework
of
Kolmogorov
complexity.
In
contrast
to
other
information
distances
in
the
literature,
this
information
distance
is
also
applicable
to
both
discrete
and
continuous-valued
data.