Performance Analysis of Delays Caused by Synchronization in Parallel Computers

Analysis of Delays Caused by Local Synchronization

Julia Lipman Quentin F. Stout
Computer Science and Engineering, University of Michigan

Abstract: Synchronization is often necessary in parallel computing, but it can create delays whenever the receiving processor is idle, waiting for information to arrive. This is especially true for barrier, or global, synchronization, in which every processor must synchronize with every other processor. Nonetheless, barriers are the only form of synchronization explicitly supplied in MPI and OpenMP.

Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze since processors do not all start tasks at the same time.

We show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with success probability p = 0.5, then the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2+√2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5. We also present some of the combinatorial properties of the synchronization problem with geometrically distributed tasks on an undirected cycle.

Nondeterministic synchronization is also examined, where processors decide randomly at the beginning of each task which neighbors(s) to synchronize with. We show that the expected number of task dependencies for random synchronization on an undirected cycle is the same as for deterministic synchronization on a directed cycle.

Simulations are included to extend the analytic results. They show that more heavy-tailed distributions can actually create fewer delays than less heavy-tailed ones if the number of processors is small for some random-neighbor synchronization models. The results also show the rate of convergence to the steady state for various task distributions and synchronization graphs.

Keywords: synchronization delay, overhead, parallel computing, barrier, performance analysis, geometric distribution, heavy-tailed distribution, stochastic task times, idle time, MPI_BARRIER, MPI_SEND, blocking, OpenMP, operating system noise, jitter

Complete paper. This paper appears in Siam J. Computing 39 (2010), pp. 3860-3884. A preliminary version appeared in Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA) 2006.

Related work: This research was motivated by earlier measurements of the performance of MPI collective communication on parallel computers.