Statistics and Biostatistics seminar series
En-Hui Yang
University of Waterloo, Department of Electrical and Computer Engineering
Room: M3 3127
Information Theory Inspired Deep Learning
While deep learning-based artificial intelligence (AI) is advancing our information age to new heights, the current approach of scaling up---through the use of massive deep neural networks (DNNs) and vast amounts of data---has many limitations including enormous demands for power, computing resources, and high-throughput networking, as well as challenges with AI interpretability and security. Its high cost also limits its accessibility to only a few big players, reducing its potential for creating more jobs for the society. Therefore, it is crucial to explore and develop alternative AI paradigms which are more efficient, interpretable, secure, and inclusive.
Drawing an analogy between human students and DNNs, in this talk, we first abstract a classification DNN as a high dimensional nonlinear function that maps inputs to probability distributions. We then introduce an information geometry perspective to evaluate the mapping structural performance of the DNN, new metrics distinct from traditional error probability. Specifically, we use conditional mutual information (CMI) from information theory (IT) to measure intra-class concentration in the DNN’s output probability space, and a novel information quantity called inter-class cross entropy to quantify inter-class separation.
Based on the above structural performance metrics, we illustrate how to apply optimization techniques from IT (i.e., for rate distortion function and channel capacity) to either minimize or maximize CMI and the inter-class cross entropy during DNN training. This approach yields several new deep learning (DL) paradigms: CMI constrained DL (CMIC-DL), knowledge distillation (KD) resistant DL, and KD-amplifying DL. Extensive experiment results show that DNNs trained within these IT inspired DL paradigms outperform the state-of-the-art models trained within the standard DL and other loss functions in the literature, both in terms of prediction error rate and robustness, while also providing improved interpretability.