By: Haytham ElFadeel, May 2009
Research and education in Parallel computing technologies is more important than ever. Here I present a perspective on the past contributions, current status, and future direction of the parallelism technologies.
While machine power will grow impressively, increased parallelism, rather than clock rate, will be driving force in computing in the foreseeable future. This ongoing shift toward parallel architectural paradigms is one of the greatest challenges for the microprocessor and software industries. In 2005, Justin Ratter, chief technology officer of Intel Corporation, said ‘We are at the cusp of a transition to multi-core, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require…’
A Little history:
Before we discuss the parallelism challenge with respect to today’s applications and challenges, it would be helpful to explore the history of parallelism. Even by the 1960s when the world was still wet with the morning dew of the computer age it was becoming clear that a single central processing unit executing a single instruction stream would result in unnecessarily limited system performance.
While computer designers experimented with different ideas to circumvent this limitation, it was the introduction of the Burroughs B5000 in 1961 that proffered the idea that ultimately proved to be the way forward: disjoint CPUs paralleling executing different instruction streams but sharing a common memory. In this regard (as in many) the B5000 was at least a decade ahead of its time. It was not until the 1980s that the need for multiprocessing became clear to a wider body of researchers, who over the course of the decade explored cache coherence protocols (e.g., the Xerox Dragon and DEC Firefly), prototyped parallel operating systems (e.g., multiprocessor Unix running on the AT&T 3B20A), and developed parallel databases (e.g., Gamma at the University of Wisconsin).
In the 1990s, the seeds planted by researchers in the 1980s bore the fruit of practical systems, with many computer companies (e.g., Sun, SGI, Sequent, Pyramid) placing big bets on symmetric multiprocessing. These bets on parallel hardware necessitated corresponding bets on concurrent software if an operating system cannot execute in parallel, neither can much else in the system and these companies independently came to the realization that their operating systems would need to be rewritten around the notion of concurrent execution. These rewrites took place early in the 1990s, and the resulting systems were polished over the decade; much of the resulting technology can today be seen in open source operating systems such as OpenSolaris, FreeBSD, and Linux. Just as several computer companies made big bets around multiprocessing, several database vendors made bets around highly parallel relational databases.
The rise of concurrent systems in the 1990s coincided with another trend: while CPU clock rate continued to increase, the speed of main memory was not keeping up. To cope with this relatively slower memory, microprocessor architects incorporated deeper (and more complicated) pipelines, caches, and prediction units. Even then, the clock rates themselves were quickly becoming something of a fib: while the CPU might be able to execute at the advertised rate, only a slim fraction of code could actually achieve (let alone surpass) the rate of one cycle per instruction most code was mired spending three, four, five, or more cycles per instruction.
Many saw these two trends the rise of concurrency and the futility of increasing clock rate and came to the logical conclusion: instead of spending transistor budget on ‘faster’ CPUs that weren’t actually yielding much in terms of performance gains (and had terrible costs in terms of power, heat, and area).
Parallelism Challenges:
We live in the era of multicore processors. The number of cores on processor chips is likely to double every two to three years. Therefore, by 2020, microprocessors are likely to have 64, 96, or 128 cores, heterogeneous and possibly specialized for different functionalities. Exploiting large-scale parallel hardware will be essential for improving the applications performance or its capabilities in terms of executing speed and power consumption. The challenge is how to enable the exploitation of the power of the target machine, including its parallelism, without undue programmer effort.
Davud Kuch an Intel Fellow, emphasized: the importance of parallel and related technologies research for addressing the multicore challenge. He said that the challenge of optimal compilation lies in its combinatorial complexity. Languages expand as computer use reaches new application domains and new architectural features arise. Architectural complexity (Uni- and Multicore) grows to support performance and parallelism and related technologies such as: compilers, Frameworks, Runtime services must bridge this widening gap.
Make parallel programming mainstream. Although research in parallel programming began more than 30 years ago, parallel programming is the norm in only a few applications areas, such as: HPC and Computational Sciences, Databases, Games, and a little in server-side applications. Bring the parallel programming to mainstream is hard because it’s required new tools, abstraction, and programmers can deal with such heterogeneous stuff.
Under the hood, Parallelism Challenges
To use these extra cores that exist today into the processors, programs must be parallelized. Multiple paths of execution have to work together to complete the tasks the program has to perform, and as much of that work as possible has to happen concurrently. Only then is it possible to speed up the program (i.e., reduce the total runtime). Amdahl’s law expresses this as:

Where P is the function of the program that can be parallelized, and S is the number of execution units. This is the theory, making it a reality is another issue.
Synchronization problems:
Synchronization is fundamental in parallel world. Unless the program consists of multiple independent pieces from the onset and should have been written as separate programs in the first place, the individual pieces have to collaborate. This usually takes the form of sharing data in memory or on secondary storage.
So write access to shared data cannot happen in an uncontrolled fashion. Allowing a program to see an inconsistent, and hence unexpected, state must be avoided at all times. This is a problem if the state is represented by the content of multiple memory locations. Processors are not able to modify an arbitrary number (in most cases not even two) of independent memory locations atomically.
To deal with multiple memory locations, ‘traditional’ parallel programming has had to resort to synchronization. With the help of mutex (mutual exclusion) directives, a program can ensure that it is alone in executing an operation protected by the mutex object. If all read or write accesses to the protected state are performed while holding the mutex lock, it is guaranteed that the program will never see an inconsistent state.
So it’s clear that synchronization is really hard. Simply you can forget to release the mutex lock, also using a single program-wide mutex would in most cases dramatically hurt program performance by decreasing the portion of the program that can run in parallel (P in the formula). Using more mutexes decreases not only P, but also increase the overhead that associated with locking and unlocking the mutexes. This is especially problematic if, as it should be, the critical regions are only lightly contended. Dealing with multiple mutexes also means the potential for deadlocks exists. Deadlocks happen if overlapping mutexes are locked by multiple threads in a different order. This is a mistake that happens all too easily.
CAS problems:
The programmer usually caught between two problems:
Even if they want to increase the part that can be executed in parallel there are other problems, such as: CAS problems.
CAS means compare-and-swap. This term is commonly used in parallelism community. CAS also known as ‘Interlocked’. CAS is set entail X86 instructions like XCHG, CMPXCHG, and certain instructions prefixed with LOCK, such as INC, ADD and so one.
CAS given us atomic operation built inside the hardware, and it’s often using in lock-free, and wait-free algorithms. But CAS operations have critical issues in performance and scalability, and we should think many times before we use it. Because:
So it’s so clear that the parallelism is so hard, and it’s require experts to decide which part to parallelized, and how!
The future of the parallelism:
The main problem with the parallelism is managing the shared data, and this actually drives many peoples and organization to investigate in Shared-Nothing Architecture, and functional programming.
In foreseeable future the programmers will get more mature tools for coding such applications at a high level of abstraction without explicit management of parallelism, locality, communication, load balancing, and other dimensions of parallel computing. Furthermore, as parallelism becomes ubiquitous, performance portability of programs across parallel platforms and processor generation will be essential for developing productive software.
For instance Microsoft .NET Framework 4.0 content good support for parallelisms via Parallelism Extension and improved parallelism debugger. Also Java 1.7 will content similar tools.
In future also we will see mature GPGPU frameworks.