Ahmed Elgohary Ghoneim
Scalable Embeddings for Data Clustering on MapReduce
Karray, Fakhreddine and Kamel, Mohamed S.
In today's era of big data, there is an increasing demand from businesses and industries to get an edge over competitors by making the best use of their data. Clustering is one of the powerful tools that data scientists can employ to discover natural groupings in data. The k-means algorithm is the most commonly-used data clustering method. It has gained popularity due to its effectiveness on many data sets as well as the ease of its implementation on different computing architectures. The k-means algorithm, however, assumes that data are available in an attribute-value format, and that all attributes can be turned into numeric values so that each data instance is represented as a point or vector in some space, where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the utilization of complex data structures in real-world clustering problems. Examples include grouping users in a social network based on their friendship networks, clustering customers according to their behaviour, and grouping proteins based on their structure. Data scientists tend to simplify these complex structures to a vectorized format, and accordingly lose the richness of the data they have.
The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling the kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for the kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically.