Parallel Programming in Visual C++ 2010 CTP

7 Nov, 2008 C++, concurrency, parallel programming, Software Development, threading

The CTP build of Visual C++ 2010 includes a new library to help you write native parallel code. Writing parallel code is getting more and more important with the broad availability of quad-core CPU’s at this time and the many-core CPU’s that will appear in the coming years. I will only be talking about the new concurrency library for native code. Of course, writing parallel code was already possible for a long time. However, you had to create and manage all threads by yourself and this could often be a complex task. Because of this, it requires quite a bit of time to parallelize a simple loop over multiple threads. The new native concurrency library makes this much easier.

This article will explain the parallel_for construct which is part of the native concurrency library in more details and will briefly touch on a few other constructs. For the example I will parallelize a trivial implementation of a Mandelbrot fractal renderer.

A Simple Serial Mandelbrot Renderer

The code to render a Mandelbrot fractal looks like the following:

int iHeight = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth = rcClient.Width();
int iHalfWidth = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
CDC memDC;
memDC.CreateCompatibleDC(&dc);
CBitmap bmp;
bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
CBitmap* pOldBmp = memDC.SelectObject(&bmp);
for (int y=-iHalfHeight; y<iHalfHeight; ++y)
{
    // Formula: zi = z^2 + z0
    double dZ0_i = dView_i + y * m_dZoomLevel;
    for (int x=-iHalfWidth; x<iHalfWidth; ++x)
    {
        double dZ0_r = dView_r + x * m_dZoomLevel;
        double dZ_r = dZ0_r;
        double dZ_i = dZ0_i;
        double d = 0.0;
        int iter;
        for (iter=0; iter < maxiter; ++iter)
        {
            double dZ_rSquared = dZ_r * dZ_r;
            double dZ_iSquared = dZ_i * dZ_i;
            if (dZ_rSquared + dZ_iSquared > 4)
            {
                // We escaped
                d = iter+1-log(log(sqrt(dZ_rSquared + dZ_iSquared)))/log(2.0);
                break;
            }
            dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
            dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
        }

        memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
    }
    dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
}
memDC.SelectObject(pOldBmp);
bmp.DeleteObject();
memDC.DeleteDC();

This code first calculates the width and height of the window to which we will be rendering. It also sets up the position and size of our view on the imaginary plane. The renderer will render line by line in a memory device context, so we set up a memory device context and select a bitmap in it whose size is the width of the rendering window and whose height is just 1 pixel. Then we loop over each line. In each line we loop over each pixel and for each pixel we iterate a number of times to calculate the value of that pixel. When we escape from the Mandelbrot set we calculate the value “d” to get some kind of smooth grayscale coloring of our fractal. Once a row has been rendered it will be blitted to the screen using BitBlt so we can see the progress of the rendering.

When you would run this renderer it will render line by line from top to bottom. A screenshot of this can be seen below:

Single threaded Mandelbrot rendering

Parallelizing the Mandelbrot Renderer

Before we can use the new native concurrency library you need to include the ppl.h file. Also, these concurrency functions are inside the namespace Concurrency, so either use “using namespace Concurrency” or specify Concurrency in front of every use of something from the library.

#include <ppl.h>
using namespace Concurrency;

To parallelize the above Mandelbrot renderer using the new parallel_for construct from the Visual C++ 2010 CTP concurrency library we basically only need to change the outer for loop; the loop that is iterating over all the rows. A first version would look like the following:

int iHeight = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth = rcClient.Width();
int iHalfWidth = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
    CDC memDC;
    CBitmap bmp;
    memDC.CreateCompatibleDC(&dc);
    bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
    CBitmap* pOldBmp = memDC.SelectObject(&bmp);
    // Formula: zi = z^2 + z0
    double dZ0_i = dView_i + y * m_dZoomLevel;
    for (int x=-iHalfWidth; x<iHalfWidth; ++x)
    {
        double dZ0_r = dView_r + x * m_dZoomLevel;
        double dZ_r = dZ0_r;
        double dZ_i = dZ0_i;
        double d = 0.0;
        int iter;
        for (iter=0; iter < maxiter; ++iter)
        {
            double dZ_rSquared = dZ_r * dZ_r;
            double dZ_iSquared = dZ_i * dZ_i;
            if (dZ_rSquared + dZ_iSquared > 4)
            {
                // We escaped
                d = iter+1-log(log(sqrt(dZ_rSquared + dZ_iSquared)))/log(2.0);
                break;
            }
            dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
            dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
        }

        memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
    }
    dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
    memDC.SelectObject(pOldBmp);
    bmp.DeleteObject();
    memDC.DeleteDC();
});

The only things that have been changed in this code compared to the original code are the red parts, meaning the “for” has been replaced with the “parallel_for” and the creation of the memory DC and bitmap are moved inside the parallel_for since we need a separate memory DC for every thread that will be created. The parallel_for construct is using another new feature of C++ called Lambda expressions that allow you to create anonymous inline functions. Describing lambda expressions is outside the scope of this article.

Making the Renderer Thread Safe

When you would run the above code you would notice that some lines will be missing in the rendering result. This is because we are writing to the same device context from different threads without any synchronization. To fix this, we will use a critical section to secure access to the device context so that only 1 thread can draw on it at the same time. The changes look as follows:

int iHeight = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth = rcClient.Width();
int iHalfWidth = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
// We need this critical section to have thread safe access to our device context
CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
    CDC memDC;
    CBitmap bmp;
    // We need to use a critical section here because we're accessing our dc.
    EnterCriticalSection(&cs);
    memDC.CreateCompatibleDC(&dc);
    bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
    LeaveCriticalSection(&cs);
    CBitmap* pOldBmp = memDC.SelectObject(&bmp);
    // Formula: zi = z^2 + z0
    double dZ0_i = dView_i + y * m_dZoomLevel;
    for (int x=-iHalfWidth; x<iHalfWidth; ++x)
    {
        double dZ0_r = dView_r + x * m_dZoomLevel;
        double dZ_r = dZ0_r;
        double dZ_i = dZ0_i;
        double d = 0.0;
        int iter;
        for (iter=0; iter < maxiter; ++iter)
        {
            double dZ_rSquared = dZ_r * dZ_r;
            double dZ_iSquared = dZ_i * dZ_i;
            if (dZ_rSquared + dZ_iSquared > 4)
            {
                // We escaped
                d = iter+1-log(log(sqrt(dZ_rSquared + dZ_iSquared)))/log(2.0);
                break;
            }
            dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
            dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
        }

        memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
    }
    // We need to use a critical section here because we're accessing our dc.
    EnterCriticalSection(&cs);
    dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
    LeaveCriticalSection(&cs);
    memDC.SelectObject(pOldBmp);
    bmp.DeleteObject();
    memDC.DeleteDC();
});

Before starting the parallel_for loop we initialize a critical section. Inside the parallel_for loop we will wrap all usages of the device context “dc” inside EnterCriticalSection/LeaveCriticalSection constructs to make sure only 1 thread accesses that device context at the same time.

When executing this code, you will see that it is rendering in blocks as can be seen in the following screenshot. Each block is being handled by a different thread.

Multi threaded Mandelbrot rendering

MandelbrotPar_src.zip contains the above Mandelbrot example. In the toolbar of the application you will find a button with a P in it. When you toggle this button you will switch between serial and parallel rendering. The titlebar of the application shows the time it took to render the image in milliseconds. Please note that this is a very basic example application. The rendering is happening in the WM_PAINT handler directly, meaning it will redraw the entire fractal each time it needs to paint the window.

MandelbrotPar_demo.zip contains a precompiled binary of the test application.

Other Concurrency Constructs

The native concurrency library also has some other high level constructs, like parallel_for_each and parallel_invoke.

parallel_for_each

This construct can replace the std::for_each construct to iterate over a container in a parallel manner. An example could be a vector drawing application. The application might hold a vector of all objects that have to be rendered. When the application needs to render everything to the screen it simple iterates over the vector and calls the Draw() functions on each object. For example:

#include <algorithm>
#include <vector>

class CObj
{
public:
    virtual void Draw();
    // ...
}

class CObjRectangle : public CObj
{
public:
    virtual void Draw();
    // ...
}

int main()
{
    std::vector<CObj*> vec;
    // Add objects to this vector
    // ...
    // Draw all objects
    for_each(vec.begin(), vec.end(), [&](CObj* p) {
        if (p)
            p->Draw();
    });
    return 0;
}

The above defines a base class for renderable objects and a specific rectangle class as example. A vector is filled with renderable objects and then the for_each construct will call the Draw() method on each object. This is happening in a serial way. We can simply parallelize this as follows:

#include <ppl.h>
using namespace Concurrency;

// ...

int main()
{
    std::vector<CObj*> vec;
    // Add objects to this vector
    // ...
    // Draw all objects
    parallel_for_each(vec.begin(), vec.end(), [&](CObj* p) {
        if (p)
            p->Draw();
    });
    return 0;
}

By simply replacing for_each with parallel_for_each, the loop is now being executed in parallel in several threads. Note: this assumes that the Draw function will only modify a single object in each iteration.

parallel_invoke

parallel_invoke can be used to launch several functions, execute them in parallel and wait until they are all finished.
A simple example is shown below:

#include <windows.h>
#include <iostream>
#include <ppl.h>

using namespace std;
using namespace Concurrency;

int main()
{
    parallel_invoke(
        [=]{for (int i=0; i<100; ++i) {cout << i << endl; Sleep(100);}},
        [=]{for (int i=0; i<10; ++i) {cout << "   " << i << endl; Sleep(1000);}}
    );

    return 0;
}

This example will launch 2 functions in parallel if you have more than 1 core. The first function will print the numbers 0 to 100 and waits 100 milliseconds between 2 numbers. The second function will print the numbers 0 to 10 and will wait 1 second between 2 numbers. When you execute the above application on a system with more than 1 core you will see that the numbers will interleave each other. Note that I didn’t include any synchronization so some numbers might appear on the same line. The parallel_invoke function can accept from 2 up to 10 functions to run in parallel. parallel_for internally is using parallel_invoke to split its work.

Profiling Parallel Code

Visual Studio 2010 CTP contains a new profiling option to help you in profiling the concurrency of your application. It allows you to check CPU utilization, thread blocking and core execution. Explaining this profiler is outside the scope of this article. You can find some more information regarding it at http://msdn.microsoft.com/en-us/magazine/cc817396.aspx .

( This article is also posted on CodeGuru. )

1 Comment so far »

John Vaskus said,

Wrote on February 10, 2009 @ 7:04 pm

Excellent post!
I am currently working with Parallel Extensions and multithreading with C#, but I am moving to C++ because I have to refactor and optmize a C++ project.
I’ve read a book that helped me a lot to understand multicore programming: “C# 2008 and 2005 threaded programming”, Packt publishing, Hillar Gaston –
http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-threaded-programming/book
recommended for C# developers.

Comment RSS · TrackBack URI

Marc Gregoire’s Blog