Please note: This master’s thesis presentation will take place online.
Albert Ding, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Olga Veksler
In the field of computer vision, it can be useful to solve the problem of modeling the visually “salient” areas in a scene viewed by a human. One way to formulate this problem is to give an estimate of how likely a pixel is to belong to an “object” (as opposed to the background) which altogether forms a “saliency” mask. This has become a task commonly known as Salient Object Detection (SOD).
Supervised methods for this task are given a set of images and their corresponding pixel-precise saliency mask targets (often hand-labeled) and aim to learn the relationship between them. Unsupervised SOD methods in comparison attempt to identify salient objects by solely examining an image. This gives unsupervised methods the advantage of not needing expensive labels which are susceptible to human error. One good place to start an unsupervised method from is a technique called unsupervised feature learning. Interestingly, the Vision Transformer (ViT) [30] which struggled to find its place in the task of image classification, was able to produce features that are directly useful for SOD when it is trained in an unsupervised manner [11] called self-DIstillation with NO labels (DINO).
The authors of LOST [97] which investigates DINO features further have found that the “keys” from the last attention layer stand out as the most useful. Melas-kyriazi et al. [80] and TokenCut [120] explore how applying spectral clustering methods to these features can be a good way to do tasks in computer vision without supervision. In this thesis we choose to continue to investigate these features for use in SOD by developing a K-means clustering-based method. From our experiments with a multitude of methods for using DINO features we observe that the clusters are rich in salient information. This could mean that the features themselves are particularly useful for SOD when processed by clustering methods.
We first apply K-means clustering to DINO features, then select some clusters to be salient according to various heuristics, upscale and post-process the resulting coarse saliency maps, obtaining pseudo-ground truth. Finally, we train a SelfReformer [136] saliency model on our pseudo-ground truth. The most important step of our approach is the heuristic which decides on salient clusters after K-means clustering. We have done an extensive development and evaluation of many heuristics, and, surprisingly, the simplest heuristic which assigns the clusters not touching image border to be salient works the best, outperforming the complex method based on eigenvectors presented in [80], after SelfReformer training.