BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

Programming Concepts

Parallel Programming

This is the first in a series of articles introducing the parallel programming techniques that are available in the C# programming language and the .NET framework version 4.0. The first part describes some of the concepts of parallel programming.

Decomposition

Parallel programming relies on the principle of decomposition. This is the process of breaking a program, algorithm or data set into sections that can be processed independently. Some algorithms can easily be decomposed but others are naturally sequential and do not support parallelism. You may have to replace an algorithm entirely to achieve a result that can be better decomposed, otherwise the sequential parts can remove the benefits of parallelism. For example, if you have a routine that takes ten minutes to run and the algorithm supports easy decomposition, allocating 25% of the work to each of four processors could reduce the duration towards two and a half minutes. If 90% of the algorithm must be handled sequentially, the ten minute task may be divided between four cores but one core would spend nine minutes working on it whilst the remaining cores were idle.

There are two types of decomposition, both of which may be used in a single algorithm. These are data decomposition and task decomposition.

Data Decomposition

Data decomposition is usually applied to large data processing tasks. It is the process of splitting a large number of data into several smaller groups. Each of those smaller units can then be processed by a separate CPU or core in parallel. At the end of the process, the smaller data units may be recombined into one larger set of results.

A common task that can benefit from data decomposition is image processing. A digital photograph may contain ten million pixels. If we are applying a filter to such a photograph we may need to perform the same calculation once for every pixel. If we use a single thread, we must wait for all ten million tasks to execute sequentially. If we can spread the load over four processor cores, the processing time may be reduced by up to three quarters.

Task Decomposition

Task decomposition is generally more complicated than data decomposition and harder to achieve. Instead of looking for large data sets to break up, we look at the algorithms being used and try to split them into smaller tasks that can run in parallel. In some cases algorithms are built from units of code that are tightly dependent upon each other, making it impossible to segregate the smaller tasks. These algorithms must be replaced entirely to gain a parallel processing advantage..

As an example, imagine that we have a method that retrieves a number of sets of data from various sources, including web services, databases or files. Each data set requires some pre-processing before they are all combined and processed together to generate a result. If the data gathering portions of the method are independent of each other, we could create tasks for each that could run in parallel. Once all of the parallel tasks have completed, we could combined the data sets and generate the final results. It may even be that this final task could benefit from data decomposition.

Common Parallel Programming Problems

The common problems faced when developing parallel code are the same as those seen when using multiple threads. Some of these problems are lessened by the parallelism classes of the .NET framework but they are not removed completely, so are worthy of mention. The terminology will be used again later in the tutorial.

Synchronisation

Synchronisation refers to a class of problem that you will encounter with parallel programming. When you start a number of tasks simultaneously, at some point in the future those tasks must "join up", perhaps to combine their results. A type of synchronisation controls this process to ensure that the result of a task is not used until it has been completed.

Another type of synchronisation is used to prevent parallel tasks from interfering with each other. If you have code that is not thread-safe, it may be necessary to prevent two processes from accessing that code simultaneously. Similarly, when working with mutable data, it can be necessary to ensure that two or more tasks are prevented from accessing that data at the same time. This can be handled in several ways. Ideally, synchronisation problems should be eliminated with the use of algorithms that prevent them. In reality, the solutions to some problems do not have such algorithms so locking mechanisms are used that stop code or data being accessed until the thread holding the lock releases it. This can severely affect performance, especially when using a large number of processors that all need to access the lockable code.

Race Conditions

A race condition occurs when parallel tasks are dependent on shared data, generally when the synchronisation around that data is not implemented correctly. One process may perform operations using shared data that temporarily leave a value in an inconsistent state. If the other process uses the inconsistent data, unpredictable behaviour can occur. Worse, errors may only occur rarely and be difficult to predict or recreate.

Let's consider a simple example. We may have a program that increments a counter variable twice. On a single processor machine this would happen sequentially. The process and the state generated are shown in the table below. The Counter column shows the value of the main counter variable after each operation. The tempCounter shows the value of a temporary variable used in the process:

OperationtempCounterCounter
Read the Counter value00
Add one to the temporary variable10
Store the temporary variable in the Counter11
Read the Counter value11
Add one to the temporary variable21
Store the temporary variable in the Counter22

The process gives the expected final result. We started with zero, added one twice and got the answer "two". However, if this program was decomposed so that each iteration of the three steps could be allocated to a different processor, we see a race condition. The following table explains this. In this case there is a temporary counter for each processor and one increment operation happening on each processor. The processors are shown as #1 and #2.

CPUOperationtempCounter #1tempCounter #2Counter
#1Read the Counter value0(not set)0
#2Read the Counter value000
#1Add one to temporary variable #1100
#2Add one to temporary variable #2110
#1Store temporary variable #1 in the Counter111
#2Store temporary variable #2 in the Counter111

In this scenario we get an incorrect final result because the second task was permitted to read the Counter variable whilst it was in an inconsistent state. This is a subtle and potentially intermittent bug that could be very difficult to track down.

8 August 2011