Julia Lipman
Quentin F. Stout

Computer Science and Engineering, University of Michigan

**Abstract:**
Synchronization is often necessary in parallel
computing, but it can create delays whenever the receiving processor is
idle, waiting for information to arrive. This is especially true for
barrier, or global, synchronization, in which every processor must
synchronize with every other processor. Nonetheless, barriers are the
only form of synchronization explicitly supplied in MPI and OpenMP.

Many applications do not actually require global synchronization; local synchronization, in which a processor synchronizes only with those processors from which it has an incoming edge in some directed graph, is often adequate. However, the behavior of a system under local synchronization is more difficult to analyze since processors do not all start tasks at the same time.

We show that if the synchronization graph is a directed cycle and the task times are geometrically distributed with success probability p = 0.5, then the time it takes for a processor to complete a task, including synchronization time, approaches an exact limit of 2+√2 as the number of processors in the cycle approaches infinity. Under global synchronization, however, the time is unbounded, increasing logarithmically with the number of processors. Similar results also apply for p ≠ 0.5. We also present some of the combinatorial properties of the synchronization problem with geometrically distributed tasks on an undirected cycle.

Nondeterministic synchronization is also examined, where processors decide randomly at the beginning of each task which neighbors(s) to synchronize with. We show that the expected number of task dependencies for random synchronization on an undirected cycle is the same as for deterministic synchronization on a directed cycle.

Simulations are included to extend the analytic results. They show that more heavy-tailed distributions can actually create fewer delays than less heavy-tailed ones if the number of processors is small for some random-neighbor synchronization models. The results also show the rate of convergence to the steady state for various task distributions and synchronization graphs.

**Keywords:**
synchronization delay, overhead, parallel computing, barrier,
performance analysis, geometric distribution, heavy-tailed distribution,
stochastic task times, idle time, MPI_BARRIER, MPI_SEND, blocking,
OpenMP, operating system noise, jitter

**Complete paper**.
This paper appears in *Siam J. Computing* 39 (2010),
pp. 3860-3884.
A preliminary version appeared in
*Proceedings of Symposium on Parallelism in
Algorithms and Architectures (SPAA) * 2006.

Copyright © 2006-2016 Quentin F. Stout. |