Tolerating Concurrency Bugs Using Transactions as Lifeguards

Jie Yu and Satish Narayanasamy

International Symposium on Microarchitecture (MICRO), Dec 2010.

Abstract:

Parallel programming is hard, because it is impractical to test all possible thread interleavings. One promising approach to improve a multi-threaded program's reliability is to constrain a production run's thread interleavings in such a way that untested interleavings are avoided as much as possible. Such an approach would avoid hard-to-test rare thread interleavings in production runs, and thereby improve correctness. However, a key challenge in realizing this goal is in determining thread interleaving constraints from the tested correct interleavings, and enforcing them efficiently in production runs.

In this paper, we propose a new method to determine thread interleaving constraints from the tested interleavings in the form of lifeguard transactions (LifeTxes). An untested code region initially is contained in a single LifeTx. As the code region is tested for more thread interleavings, its original LifeTx is automatically split into multiple smaller LifeTxes so that those tested interleavings are permitted in a production run. The LifeTx interleaving constraints can be enforced efficiently using hardware transactional memory support.

We show that 12 out of 15 real concurrency bugs in programs like Apache, MySQL and Mozilla could be avoided using the proposed approach. LifeTx can also help improve the testing process. Instead of blindly stress testing as many rare interleavings as possible, testers and tools can prioritize their efforts on exposing more frequent interleavings for code regions contained in the largest LifeTx.