Parallel Computing: Overview, Definitions, Examples and Explanations

What is Parallel Computing?
A Not Too Serious Explanation.

Before I explain parallel computing, it's important to understand that

You can run, but you can't hide. I'll come back to this later.

Suppose you have a lot of work to be done, and want to get it done much faster, so you hire 100 workers. If the work is 100 separate jobs that don't depend on each other, and they all take the same amount of time and can be easily parceled out to the workers, then you'll get it done about 100 times faster. This is so easy that it is called embarrassingly parallel. Just because it is embarrassing doesn't mean you shouldn't do it, and in fact it is probably exactly what you should do. A different option would be to parallelize each job, as discussed below, and then run these parallelizations one after another. However, as will be shown, this is probably a less efficient way of doing the work. Occasionally this isn't true because on computers, doing all of the job on one processor may require storing many results on disk, while the parallel job may spread the intermediate results in the RAM of the different processors. RAM is much faster than disks. However, if the program isn't spending a lot of time using the disk then embarrassingly parallel is the smart way to go. Assume this is what you should do unless you analyze the situation and determine that it isn't. Embarrassingly parallel is simple, and if you can get the workers do it for free then it is the cheapest solution as well.

What if the jobs take widely different amounts of time, but still have no interaction? If you have enough of them, just start handing them out until every worker has one, and then when workers are done they come back and request another. This will work fairly well, especially if you can put longer jobs first and shorter ones later. This form of parallelization is still pretty simple (unless it is teenagers that are doing the jobs - see "Do Teenagers Deserve Negative Wages?"). Many of the departmental computers that have more than one processor are run in this fashion, servicing jobs as they arrive. This is also how many airlines run their check-in queue: in queuing theory this is known as a "single queue multiple server" system, which is more efficient than the "multiple queue multiple server" systems common in checkout lines at grocery stores.

Suppose instead that this work you have is only a single job, but it takes a very long time. Now you have to do something to reorganize the job, somehow breaking it into pieces that can be done concurrently. For example, if the job is to build a house, it can be broken up into plumbing, electrical, etc. However, while many jobs can be done at the same time, some have specific orderings, such as putting in the foundation before the walls can go up. If all of the workers are there all of the time, then there will be periods when most of them are just waiting around for some task (such as the foundation) to be finished. Not very cost-effective, and you are not getting the job done 100 times faster. Such is the life of a parallel programmer.

Definition: Parallel computing is the use of two or more processors (cores, computers) in combination to solve a single problem.

The programmer has to figure out how to break the problem into pieces, and has to figure out how the pieces relate to each other. For example, a parallel program to play chess might look at all the possible first moves it could make. Each different first move could be explored by a different processor, to see how the game would continue from that point. At the end, these results have to be combined to figure out which is the best first move. Actually, the situation is even more complicated, because if the program is looking ahead several moves, then different starts can end up at the same board position. To be efficient, the program would have to keep track of this, so that if one processor had already evaluated that position, then others would not waste time duplicating the effort. This is how must parallel chess-playing systems work, including the famous IBM Deep Blue machine that beat Kasparov.

Is Parallel Computing Easier or Harder than Serial Computing?

(Standard computing is also known as "serial computing", so you've been doing serial computing for years!)

Easier:

Most of the universe is inherently parallel, with numerous things happening simultaneously. You may put your pants on one leg at a time, but millions of people are putting their pants on simultaneously. Many problems have an abundance of natural parallelism that a programmer can exploit. Sometimes people are forced to rethink the problem when they try to develop a parallel program, and they change their entire approach in order to directly utilize the inherent parallelism.

Harder:

Blatent poor performance: it is easier to see that you're not doing well. On a serial computer the processor may wait as it moves data from RAM to cache, but it basically appears busy all the time and it is only after you do a more detailed analysis that you determine that it is actually performing poorly (see "Using Cache Saves Cash"). On a parallel computer of 100 processors most people quickly notice that a program is not running 100 times faster. The serial program may not have been very good (see "Why Algorithms Experts Deserve Big Bucks") but it is not generally as obvious as the fact that many of the parallel processors are often idle.
Little experience: most programmers have little or no experience with parallel computing, and there are few parallel programs to use off-the shelf or even good examples to copy from. Hence people often have to reinvent the parallel wheel (see "Parallelism needs Classes for the Masses"). People developing parallel systems software are similarly behind on their learning curves. Often they reuse material developed for serial systems, even when it causes performance problems (see "Serial Sins in a Parallel World").
Poor portability: a program may work work on one machine, but when the program is ported to a new machine it may be so different that drastic changes need to be made just to permit the program to be run. Most serial computers have the same basic organization, but this is not so for parallel computers. Some parallel computers, such as the popular cluster systems, are essentially just a collection of computers linked together with Ethernet. They use simple commands, similar to read and write, to communicate among the processors. (The most common system for doing this communication is MPI.) Such parallel computers are known as message-passing systems, or distributed memory computers. Clusters are usually quite inexpensive because they are built from commodity parts, and they have become quite common in businesses and universities. Distributed memory computers with better performance have faster interconnections among the processors, and software better tuned to support parallelism, but this increases the cost because many of the parts are more non-standard.

Other parallel computers, such as some of the large systems from Cray or dual socket single boards from Intel and others, are shared-memory systems, where every processor can directly read and write data from every memory location. Think of this as being like everyone using a blackboard to do their calculations, where everyone has chalk and an eraser and can decide to write anywhere on the board. For these systems, there is a standard known as OpenMP, which is a collection of language extensions to Fortran and C/C++. Shared memory computers must use many non-commodity parts to support the sharing, and hence tend to be more expensive than distributed memory systems. However, it is often easier to write a program for shared memory systems than for distributed memory ones.

Yet other computers exploit graphics processor units, GPUs, to achieve parallelism. However, while a GPU may have 1000 cores, there are serious restrictions on their programability, and serious performance problems if there is much data movement. Further, the languages to use for GPUs are rapidly evolving, so it is unclear if you should use CUDA, OpenCL, or accelerator extensions in OpenMP and OpenACC, etc.
Amdahl's law: if the foundation takes 5% of the time, and only one worker can do it while everything else has to wait until the foundation is done, then you can never get the job done in less than 5% of the original time, no matter how many workers you have. Incidentally, this helps explain why it is much easier to be a successful parallel programmer on a small system. In such a situation, if everything else parallelizes perfectly, then 100 workers would take 5% (serial) + 95%/100 (perfect parallel) = 5.95% of the original time, compared to 100%/100= 1% of the original time if everything was perfectly parallel. 1/(100*0.0595) ≈ 0.17, so the efficiency is only about 17%, where efficiency is (time for 1 worker)/[(number of workers)*(the time they take)]. You can think of this as the wages if one worker does it, divided by the total wages when the group does it. 20 workers would work at about 51% efficiency, and 10 workers at about 69%. The fewer the workers, the higher the efficiency.
Weakest links: if a group of workers all depend on each other, then the group can go no faster than the slowest worker. If one worker is talking on the cellphone trying to arrange a date, everyone slows down, especially if they are trying to listen in. Other weak links can be the compiler, operating system, communication system, etc.

Why can't you hide?

In the past, when someone needed to solve bigger problems they could often wait a bit, and a faster computer would come out. This was mainly due to Moore's law, which people interpreted to mean that the speed of computers would double about every two years. However, this isn't true any longer: if you look at the GHz speed of the processors in a desktop computer 2 years ago, versus the speed now, it has barely, if any, increased. Basically this happened because they can't make reliable low-cost processors that run significantly faster.

What has doubled, however, is the number of processors on a single chip. Some smartphones now have 8 cores (processors), you can buy CPU chips with 12 cores, and this number will soon increase. Even the Apple watch now has 2 cores. These are called "multi-core" or "many-core" chips. There are also graphics processing units (GPUs) with over 1000 highly specialized processors. This is closer to what Moore was talking about, because what he really said was that the number of transistors would keep doubling. The rate of improvement is slowing down, but significant increases in the number of transistors should continue for at least a decade.

If there are only a few cores in your computer then it is pretty easy to find tasks that can be done simultaneously, such as waiting for keystrokes and running a browser, i.e., embarrassingly parallel. However, as the number of processors keeps increasing, the parallelization problems mentioned above become increasingly serious. There are now parallel computers that have > 1,000,000 cores, and people are planning for ones that will have > 100,000,000.

Overall, this means that there is a massive need to make use of the parallelism in multi-core chips for almost any problem, and to use many of these combined together for large problems. Job security for me! (and you, if you learn parallel computing)

Some Resources on Parallel Computing

If you want to learn more about parallel computing, there are some books available, though I don't like most of them. Many colleges and universities teach classes in this subject, and there are some tutorials available. For example, the author teaches a parallel computing class and a tutorial on parallel computing. These are aimed at larger problems, not ones that a single multi-core chip is sufficient for. However, my university (and many others) is developing courses addressing multi-core parallelism. Some of my colleagues have even developed a few teaching modules used in high school and elementary school classes, where the students become the parallel processors. If you attend my tutorial or class you too will become a processor in a parallel computer, and one of your tasks will be to flip a coin. You'll have to attend to see why. Don't forget to bring a coin.

There are also resources available via the web - here are some pointers to parallel computing resources such as manuals, software, parallel computers, etc.

Some addresses for me:

Quentin F. Stout
Computer Science and Engineering
University of Michigan
Ann Arbor, MI 48109-2121

734-763-1518 (office, messages) 734-763-8094 (fax)
qstout@umich·edu www.eecs.umich.edu/~qstout/

If you are close enough, "Hi" is an effective address.