Mohammed
Alfatafta,
Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
We present a comprehensive study of system failures from 12 popular systems caused by a peculiar type of network partitioning faults: partial partitions. Partial partitions isolate a set of nodes from some, but not all, nodes in the cluster. Our study reveals that the studied failures are catastrophic; they lead to data loss, complete system unavailability, or stale and dirty reads. Furthermore, our study reveals that, once a partial partition occurs, most studied failures require little to no interaction between the user and the system for a failure to manifest, and that most of the failures are deterministic.
We dissected the implemented fault tolerance techniques and found that they either patch a specific mechanism or exacerbate the problem and turn a partial partition into a complete partition. The latter approach is generic yet unnecessarily leads to lower performance and impacts system availability.
Finally, we present NIFTY, a generic layer that leverages the capabilities of modern software-defined networking to monitor and recover the connectivity of the cluster in case of partial network partitions. We built NiftyDB, a database system atop NIFTY. NiftyDB implements a set of optimizations. Compared to current fault tolerance techniques, our evaluations show that NiftyDB tolerates a wide range of partial network partitions without imposing additional overheads.