Parallel Pattern Library (PPL) in Visual C++ 2010
Visual C++ 2010 comes with a brand new library called the Parallel Pattern Library or PPL. It is a powerful library that makes writing parallel code easier which is getting more and more important with the current and upcoming multicore CPUs. This article will give an overview of the PPL.
The PPL provides the following features:
- Parallel algorithms: generic algorithms that act on collections of data in parallel.
- Parallel containers and objects: generic container types that provide safe concurrent access to their elements.
- Task parallelism: a mechanism to execute several work items (tasks) in parallel.
Parallel Algorithms
The library comes with 3 parallel algorihtms
- parallel_for: can be used to parallelize a standard for loop. For example, suppose you have the following for loop:
for(int i=0; i<100; ++i) { /* ... */ }
This can be converted to use parallel_for as follows:
parallel_for(0, 100, 1, [&](int i) { /* ... */ });
Now the for loop will execute in several threads. The PPL will automatically decide the number of threads based on the number of cores there are in your machine.
- parallel_for_each: can be used to parallelize the STL for_each construct. The following code demonstrates how to run a function in parallel on each element of a vector:
vector<int> vec; parallel_for_each(vec.begin(), vec.end(), [&](int& i) { /* ... */ });
- parallel_invoke: can be used to start running several functions at once in parallel. For example, the quicksort algorithm will divide the array in two parts and will then quicksort both parts. parallel_invoke can be used to launch the quicksort algorithm on both parts in parallel as is shown in the following code:
parallel_invoke( [&]{quicksort(a, left, i-1, comp);}, [&]{quicksort(a, i+1, right, comp);} );
Note that all the parallel algorithms use Lambda expressions which is also a new feature in Visual C++ 2010.
Parallel Containers and Objects
The Parallel Pattern Library comes with a parallel container called combinable. A combinable provides lock-free access to shared resources. Normally this is used in a two step approach:
- Perform fine-grained computations and store the results in the combinable container.
- After all fine-grained computation threads have finished, all the results in the combinable object are “combined” or “reduced” into the final results.
Suppose we want to calculate the sum of the elements of a vector of integers. We want to perform this calculation in parallel using the parallel_for_each algorithm. The end result of this calculation is 1 number, so the parallel_for_each would have to use a critical section to have access to the shared resource. This is shown in the following code:
vector<int> values(10); int i = 0; generate(values.begin(), values.end(), [&]{return ++i;} ); // Calculate the sum of the elements int result = 0; critical_section cs; parallel_for_each(values.begin(), values.end(), [&](int n) { cs.lock(); result += n; cs.unlock(); });
To get safe access to the shared resource result variable, the parallel_for_each threads need to lock the critical section before adding a value to the result. Obviously, this implementation is not really scalable, since most threads will be waiting for the lock on the critical section.
Instead of using these locks, lets use a combinable object to have lock-free access to the shared resource. The following code demonstrates the two step approach explained above.
vector<int> values(10); int i = 0; generate(values.begin(), values.end(), [&]{return ++i;} ); // 1. Calculate the sum of the elements of the vector in parallel. combinable<int> sums; parallel_for_each(values.begin(),values.end(), [&](int n) { sums.local() += n;} ); // 2. Combine/Reduce into final result int result = sums.combine([](int left, int right) { return left + right; });
As you can see, there are no locks in this code. The parallel_for_each algorithm will split the vector in a number of parts. The sum of the elements in each part will be calculated in parallel in different threads and the intermediate results are stored in the combinable object. Every thread has its very own sums.local() variable, so no locks are required. Once all threads are finished, all the results in the combinable object are added together to form the final result.
Task Parallelism
The PPL contains the concepts of tasks and task groups. They allow you to create individual work items (=tasks). Several tasks can be combined into task groups. These can then be spawn into several threads executing the individual work items in parallel. This post will not go into more details about task parallelism.
For more information about tasks and task groups, see MSDN.
Demo Applications
Two demo applications are provided that show how powerful and versatile the parallel_for algorithm is.
Parallel Matrix Multiplication
This demo application contains two functions implementing a naive version of the matrix multiplication algorithm, one without parallel_for and one with parallel_for. The one without using parallel_for looks like this:
void MatrixMult(int iSize, const vector<vector<int> >& matrix1, const vector<vector<int> >& matrix2, vector<vector<int> >& matrixResult) { for (int i = 0; i < iSize; ++i) { for (int j = 0; j < iSize; ++j) { matrixResult[i][j] = 0; for (int k = 0; k < iSize; ++k) { matrixResult[i][j] += matrix1[i][k] * matrix2[k][j]; } } } }
The implementation consists of a total of 3 nested loops. To parallelize this algorithm, all we have to do is to convert the first for loop to use parallel_for, as follows:
void MatrixMultPar(int iSize, const vector<vector<int> >& matrix1, const vector<vector<int> >& matrix2, vector<vector<int> >& matrixResult) { parallel_for(0, iSize, 1, [&](int i) { for (int j = 0; j < iSize; ++j) { matrixResult[i][j] = 0; for (int k = 0; k < iSize; ++k) { matrixResult[i][j] += matrix1[i][k] * matrix2[k][j]; } } } ); }
That is all that is required. Download the attached demo project and see the performance benefits of running it on a multicore machine. Running the demo on my dual core test laptop using matrices of size 1500 x 1500 results in 60 seconds for the non parallel version and only 32 seconds for the parallel version. Not bad for a small and simple syntax change to use parallel_for 🙂
Parallel Image Blurring
The attached demo application demonstrates the use of parallel_for algorithm in speeding up a very naive implementation of blurring an image. The most important code is in the following two functions:
- CParallelBlurDoc::DoImageOperationHelper(): performs the blur agorithm without using parallel_for.
- CParallelBlurDoc::DoImageOperationParallelHelper(): performs the blur algorithm with parallel_for.
When you run the application you will see the following 3 buttons in the toolbar:
- R: Reset, show the original, non-blurred image.
- S: Run the non-parallel version of the blur algorithm.
- P: Run the parallel version of the blur algorithm.
Run the application, open any image and then click on the S or P button. After blurring is complete you will get the time how long it took.
Note that the demo application does not implement any border constraints and thus is not blurring a border of a few pixels.
Jimpu Hukimahango said,
Wrote on December 11, 2009 @ 11:46 am
Nice. Will we also see a pipeline infrastructure like in Intel TBB?
Marc Gregoire said,
Wrote on December 11, 2009 @ 3:27 pm
I have never used the Intel TBB pipeline infrastructure, so I cannot really comment on that.
However, have a look at the Visual C++ 2010 Concurrency Runtime ( http://msdn.microsoft.com/en-us/library/dd504870(VS.100).aspx ).
Maybe something with Asynchronous Agents or the Task Scheduler?
G McIntyre said,
Wrote on March 5, 2012 @ 7:10 pm
The blur example seems to be missing some files
error C2653: ‘CMFCVisualManagerScenic’ : is not a class or namespace name
error C2065: ‘classCMFCVisualManagerScenic’ : undeclared identifier
=
Marc Gregoire said,
Wrote on March 5, 2012 @ 7:19 pm
McIntyre: Are you opening this in Visual C++ 2010? It will not work on older versions of Visual Studio.