Fault Tolerance of Allocation in Parallel Computers

In Proc. 2nd Symp. Frontiers of Massively Parallel Computation (1988), pp. 491-494.

Fault Tolerance of Allocation Schemes in Massively Parallel Computers

Marilynn Livingston
Dept. of Computer Science, Southern Illinois University at Edwardsville

Quentin F. Stout
EECS Department, University of Michigan

Abstract: This paper examines the problem of locating and allocating large fault-free subsystems in multiuser massively parallel computer systems. Since the allocation schemes used in such large systems cannot allocate all possible subsystems a reduction in fault tolerance is experienced. We analyze the effect of different allocation methods including the buddy and Gray-coded buddy schemes for the allocation of subsystems in the hypercube and in the 2-dimensional mesh and torus. Both worst case and expected case performance is studied.

Generalizing the buddy and Gray-coded systems, we introduce a new family of allocation schemes which exhibits a significant improvement in fault tolerance over the existing schemes and which uses relatively few additional resources. For purposes of comparison, we study the behavior of the various schemes on the allocation of subsystems of 2¹⁸ processors in the hypercube, mesh, and torus consisting of 2²⁰ processors. Our methods involve a combination of analytical techniques and simulation.

Keywords: fault tolerance, processor allocation, hypercube computer, mesh, torus, buddy system, parallel computing, supercomputing, graph theory, computer science, resource allocation, scheduling

Full paper (Postscript)
Full paper (PDF)

Other papers in parallel computing
Overview of work on parallel computing