Please note: This PhD defence will take place in DC 3317.
Thierry
Delisle,
PhD
candidate
David
R.
Cheriton
School
of
Computer
Science
Supervisor: Professor Peter Buhr
User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. Indeed, over-partitioning into small work-units with user threading significantly eases load balancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated. Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. When user-threading parallelism does drop, how and when should idle kernel threads be put to sleep to avoid wasting CPU resources. Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
This thesis analyses multiple scheduler systems, where each system attempts to fulfill the requirements for user-level threading. The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
After examining, selecting and testing specific approaches to these scheduling issues, a complete implementation was created and tested in the C∀ (C-for-all) runtime system. C∀ is a modern extension of C using user-level threading as its fundamental threading model. As one of its primary goals, C∀ aims to offer increased safety and productivity without sacrificing performance. The new scheduler achieves this goal by demonstrating equivalent performance to work-stealing schedulers while offering better fairness. The implementation uses several optimizations that successfully balance the cost of fairness against performance; some of these optimizations rely on interesting hardware optimizations present on modern CPUs. The new scheduler also includes support for implicit nonblocking I/O, allowing applications to have more user-threads blocking on \io operations than there are kernel threads. The implementation is based on io_uring, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using select, epoll, etc. To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.