Wednesday, November 5, 2014 2:30 pm
-
2:30 pm
EST (GMT -05:00)
Speaker: | Arian Baer, FTW Telecom Research Centre, Vienna |
Abstract: | Shared workload optimization is feasible if the set of tasks to be executed is known in advance, as is the case in updating a set of materialized views or executing an extract-transform-load workflow. In this talk, we consider data-intensive shared workloads with precedence constraints arising from data dependencies, i.e., before executing some task, other tasks may have to run first and generate some data needed by the next task(s). While there has been previous work on identifying common subexpressions in shared workloads and task re-ordering to enable shared scans, we go a step further and solve the problem of scheduling shared data-intensive workloads in a cache-oblivious way. Our solution relies on a novel formulation of precedence constrained scheduling with the additional constraint that once a data item is in the cache, all tasks that require this data item should execute as soon as possible thereafter. The intuition behind this formulation is that the longer a data item remains in the cache, the more likely it is to be evicted regardless of the cache size. We give an optimal ordering algorithm using A* search over the space of possible orderings, and we propose efficient and effective heuristics that obtain nearly-optimal results in much less time. We present experimental results on real-life data warehouse workloads and the TCP-DS benchmark to validate our claims. |