Xiao
Meng,
Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
In many modern data management scenarios, we encounter tasks, operations or phases that are data-intensive, where the sheer volume of data proves to be overwhelming to handle and becomes a performance bottleneck.
For data-intensive tasks, the bottleneck is data loading, where the cost of loading data into memory is more significant than the cost of actual computation. For data-intensive shuffling, the bottleneck is data transfer, where intermediate data are scattered and shuffled for further processing.
This thesis addresses two data-intensive scheduling problems: (1) multi-processor scheduling for data-intensive tasks to reduce redundant data loading; (2) reducer scheduling for data-intensive shuffling to reduce redundant data communication.
For data-intensive tasks, we focus on workloads with precedence constraints of data dependencies, which are common in various applications such as data analytics and ETL processing. These workloads are often known in advance, are presented as directed acyclic graphs (DAG), and are data-intensive and sensitive to cache misses. We solve the problem of scheduling DAGs of data-intensive tasks on multiple processors or machines, in order to minimize execution time.
To do so, we propose scheduling algorithms that take cache misses into account. Simulations and an experimental evaluation using a Spark cluster demonstrate the advantages of our solutions in terms of workload completion time.
For data-intensive shuffling, we focus on MapReduce-style processing. It incurs communication overhead in the Shuffle stage, which sends intermediate results from mappers to reducers. We solve this problem: given a collection of mapper outputs (intermediate key-value pairs) and a partitioning of this collection among the reducers, which node should each reducer run on to minimize data transfer? We reduce two natural formulations of this problem to optimization problems for which polynomial solutions exist. We show that our techniques can cut communication costs by 50 percent or more compared to Hadoop’s default reducer placement, which leads to lower network utilization and faster MapReduce job runtimes.