The Map Parallel pattern
A gentle introduction to parallel patterns
The following is a shorter version of the first post on my new blog.
Parallel patterns
According to Wikipedia, a software design pattern is “a software design pattern is a general, reusable solution to a commonly occurring problem within a given context in software design”.
As a simple example, in the context of structural design of an algorithm, a repeated access of a vector/array item for a given index can be expressed with a loop structure instead of repeatedly accessing the vector updating the index manually. All commonly used programming languages have support for loops.
The context I am concerned with in this series of blog posts is parallel computing. Parallel computing is a very broad topic, so for now, let’s restrict ourselves to the the kind of parallelism you get from muiltiple cores/cpus sharing memory. This is the kind of parallism you have in all processors you find today in your average computer. Where applicable, I will make references to additional kinds of parallelism available, in particular vector parallelism (aka SIMD) and GPU-style parallelism (aka SIMT) where applicable.
Illustrating patterns
Let’s begin with some simple code and its graphical representation which will be used to show-case parallel patterns later on.
1 F = f(A);
2 G = g(F);
3 B = h(G);
This piece of code consists of three statements. The first statement takes data structure A
as input to function f
and it produces some output data F
. Similarly function g
produces G
and h
produces B
. Since F
is used as input to function g
and G
is used as input to function h
, there is a data dependence f
, g
and h
, respectively.
This is a sequence pattern and can be illustrated with the figure to the left below.
The green boxes with rounded squares represent data, a blue square represent a task (a piece of computation to be done on some data) and the arrows between data and/or tasks represent a dependence. Note here that the data structure F
and G
are not visible in the pattern as they are not needed to express the data dependency. The code could have been written like B = h(g(f(A)))
instead.
The Map Pattern
Above, the map pattern is shown. This pattern consists of applying (mapping) the same function (let’s call it foo(X)
) to different parts of some input data. In essence it’s the sequence pattern repeated horizontally. In order for this pattern to be meaningful, it’s important that each vertical strand operates on different data.
This kind of pattern is also called embarrasingly parallel as each invokation of foo
can happen in parallel with any other given that there is no dependence between them.
Code
Even this simple pattern becomes very abstract without some examples, so let’s look at some code:
void map_for(vector<int> &arr, std::function<int(int)> foo) {
for (long i = 0; i < arr.size(); ++i) {
arr[i] = foo(arr[i]);
}
}
This is a typical example of the map pattern implemented sequentially. The function map_for
applies the same function foo
to each element of the input array arr
. Each invokation could happen in parallel with any or all others, as they are independent, but in this code it’s all sequential.
Ok, but how do we make it parallel? We will investigate OpenMP and TBB here. For more variants, visit the original blog post.
OpenMP
Our first variant here is using OpenMP which is a well known parallel programming API for C, Fortran and C++.
void map_openmp(vector<int> &arr, std::function<int(int)> const &foo) {
#pragma omp parallel for shared(arr, foo)
for (long i = 0; i < arr.size(); ++i) {
arr[i] = foo(arr[i]);
} // implicit barrier
}
As you see, it’s exactly the same as the sequential version except the line: #pragma omp parallel for shared(arr, foo)
. OpenMP is a directive based programming model (enhanced with a number of library functions). The parallel
directive spawns a team of threads and the code block after the directive will be executed by all threads. The for
directive is a worksharing construct which will divide the iterations of a for loop among the threads in the team. Most implementations (if not all) will divide them statically exactly the same way that’s done in the thread-version above. Finally, the shared(arr, foo)
directive tells the compiler that the vector and function references will be shared among the threads in the team. This is the default behaviour but it’s always good to be explicit.
At the end of a worksharing construct, there is an implicit barrier which means that the threads will wait here for all other threads to have finished their execution of the for-loop.
Threading building blocks — TBB
The Threading building blocks from Intel is a popular parallel programming library for C++. It is really powerful and provides in general tremendous performance and, as I will explore in a subsequent blog post, it allows for compositional parallelism which I think is very important when you are building large systems.
Here is the TBB version:
void map_tbb(vector<int> &arr, std::function<int(int)> const &foo) {
parallel_for((size_t) 0, arr.size(), foo);
}
The map idiom is such an important concept that it has it’s own function in TBB: parallel_for
. It takes, in this case, three arguments: the starting index 0
, the end index + 1 arr.size()
, and the function to operate on each element, foo
.
As you can see, it becomes simpler and simpler, but wait, there’s more.
In my original post, you can also read about the map pattern in C++ STL Algorithms, using C++ naked threads and how they perform against each other.