PhD Seminar: Image and Video Compression: Human And Machine Vision Perspectives

Wednesday, April 8, 2020 11:00 am - 11:00 am EDT (GMT -04:00)

Candidate: Hossam Amer

Title: Image and Video Compression: Human And Machine Vision Perspectives

Date: April 8, 2020

Time: 11:00AM

Place: REMOTE PARTICIPATION

Supervisor(s): Yang, En-Hui

Abstract:

As we start a new decade, image and video compression should further improve to satisfy each of the human and machine visions.

Human and machine visions have different perspectives on the perceived images and videos, which are compressed due to bandwidth and storage requirements. From a human vision (HV) perspective, one key aspect for human satisfaction is the perceived quality of these compressed images and videos. From a machine vision (MV) perspective, especially in image classification, one crucial aspect for machine satisfaction is the ability to accurately recognize patterns or objects in these compressed images and videos. This thesis is motivated to address a variety of image/video compression problems to serve each of human and machine vision perspectives. For HV,our goal is focused on video compression to improve the trade-off between compression rate, compression distortion, and time complexity, while our goal for MV is to show that compression, if used in the right manner, helps improve deep neural network (DNN) machines in terms of classification accuracy, while reducing the size in bits of the input image.

Towards the HV perspective, we first introduced a global rate distortion optimization (RDO) model rather than the existing RDO in the state-of-the-art video codec, High Efficiency Video Coding (HEVC), that is traditionally performed within each frame with fixed quantization parameters (QPs), without fully considering the coding dependencies between the current frame and future frames within a temporal propagation chain. To further improve the coding efficiency of HEVC, it is desirable to perform a global RDO among consecutive frames while maintaining a similar coding complexity. To address this problem, temporal dependencies are first measured via a model for the energy of prediction residuals that enables the formulation of the global RDO in low-delay (LD) HEVC. Second, we introduce the notion of propagation length, which is defined as the impact length of the current frame on future frames. This length is estimated via offline experiments and used to propose two novel methods to predict the impact of the coding distortion of the current frame on future frames from previous frames of similar coding properties. Third, we apply these two methods to adaptively determine the Lagrangian multiplier and its corresponding QP for each frame in the LD configuration of HEVC. Experimental results show that, in comparison to the default LD HEVC, the first method can achieve, on average, BD-rate savings of 5.0% and 4.9% in low-delay-P (LDP) and low-delay-B (LDB) configurations, respectively, and the second can achieve, on average, BD-rate savings of 4.9% and 4.9% in the LDP and LDB configurations, respectively, all with only 1% increase in the encoding time.

Along the HV perspective, despite the rate distortion performance improvement that HEVC offers, it is computationally expensive due to the adoption of a large variety of coding unit (CU) sizes in its RDO. Thus, we investigated the application of fully connected neural networks (NNs) to this time-sensitive application to improve its time complexity, while controlling the resulting bitrate loss. Specifically, four NNs are introduced with one NN for each depth of the coding tree unit. These NNs either split the current CU or terminate the CU search algorithm. Because training of NNs is time-consuming and requires large training data, we further propose a novel training strategy in which offline training and online adaptation work together to overcome this limitation. Our features are extracted from original frames based on the Laplacian Transparent Composite Model (LPTCM). Experiments carried out on all-intra configuration for HEVC reveal that our method is among the best NN methods, with an average time saving of 32% and an average controlled bitrate loss of 1.6%, compared to the original HEVC. In our CU partition algorithm, a fully connected NN machine ’saw’ extracted LPTCM features to help reduce the computational intensity of compression at a controlled trade-o  between compression rate and compression distortion.

Turning to MV perspective where DNNs typically ’see’ the input as JPEG image, we revisited the impact of JPEG compression on deep learning (DL) in image classification. Given an underlying DNN pre-trained with pristine ImageNet images, we demonstrated that if for any original image, one can select, among its many JPEG compressed versions including its original version, a suitable version as an input to the underlying DNN, then the classification accuracy of the underlying DNN can be improved significantly while the size in bits of the selected input is, on average, reduced dramatically in comparison with the original image. This is in contrast to the conventional understanding that JPEG compression generally degrades the classification accuracy of DL. Specifically, for each original image, consider its 10 JPEG compressed versions with their quality factor (QF) values from {100, 90, 80, 70, 60, 50, 40, 30, 20, 10}. Under the assumption that the ground truth label of the original image is known at the time of selecting an input, but unknown to the underlying DNN, we presented a selector called Highest Rank Selector (HRS). It is shown that HRS is optimal in the sense of achieving the highest top-k accuracy on any set of images for any k among all possible selectors. When the underlying DNN is Inception V3 or ResNet-50 V2, HRS improves, on average, the top-1 classification accuracy and top-5 classification accuracy on the whole ImageNet validation dataset by 5.6% and 1.9%, respectively, while reducing the input size in bits dramatically—the compression ratio (CR) between the size of the original images and the size of the selected input images by HRS is 8 for the whole ImageNet validation dataset. When the ground truth label of the original image is unknown at the time of selection, we further propose selectors that either maintain the top-1 accuracy, the top-5 accuracy, or the top-1 and top-5 accuracy of the underlying DNN, while achieving CRs of 8.8, 3.3, and 3.1, respectively.