Lab: Multi-core programming with Intel's Manycore Testing Lab
"Responding to Manycore" Project
Dick Brown, St. Olaf College, May 2010
Intel Corporation has set up a special remote system that allows faculty and students to work with computers with lots of cores, called the Manycore Testing Lab (MTL). In this lab, we will create a program that intentionally uses multi-core parallelism, upload and run it on the MTL, and explore the issues in parallelism and concurrency that arise.
Extra: Comments in this format indicate possible avenues of exploration for people seeking more challenge in this lab.
Carry the steps in this section before the lab, if possible, on a laptop you can bring with you to the lab session.
We will use Link computers in RNS 202 or 203 in order to develop our multi-core program, but we need to connect to the Intel MTL system using a non-Link computer. This is because the Cisco VPN software for connecting to the MTL blocks all other network access, so we can't use it on the Link computers or it would interfere with all other uses of those computers, such as being part of the MistRider cluster.
The Cisco VPN client software is free, easy to install, easy to use, and does not interfere with your computer except while you are connected to the MTL. Here are download links (install before the lab if possible):
Download the from this link and run the executable to install the Cisco VPN client.
Note: For Windows, you also need ssh
and scp
capabilities. Putty and Cygwin are two ways to get these
capabilities. For Putty,
you'll need both putty.exe
and pscp.exe
.
Ask for help if you need it.
Download from this link, which creates a pseudodisk on your desktop, then open that pseudodisk and follow the instructions. (You can delete the pseudodisk when you're done.)
You can use a UNIX terminal window to access ssh
and
scp
.
For Linux and other versions of Windows and Macintosh, you can see the University of Ghent page that these downloads were obtained from. Choose among the "Cisco VPN clients without config file".
Once you have installed the Cisco VPN client (and have
ssh
and scp
installed), configure the Cisco
VPN client to access the multicore testing lab, as follows.
Start up the Cisco VPN client. Note: This is installed as a software application. Don't look for it under system network connection options, like other VPN systems.
Create a new connection, with the following connection information:
Connection entry: Choose a name, perhaps "Intel MTL VPN"
Description: Optional
Host: 192.55.51.80
Select Group Authentication (probably the default)
Name is VPN2
Password is sent to you separately.
Save this connection to finish creating it.
Now try connecting on your new VPN connection entry.
If this succeeds, you will find that none of your usual network services work. For example, your browser won't be able to find any pages (thus, you'll have to google on a different machine while you're connected to the MTL).
If your new VPN connection fails, recheck the settings you entered, or seek help.
Finally, disconnect from your new VPN connection entry. This will give you your usual network capabilities back, etc.
Extra: If this lab is too slow for you, consider carrying it out in a different parallelism strategy, such as Intel's Threading Building Blocks (TBB).
We will use a standard system for parallel programming called OpenMP, which enables a C or C++ programmer to take advantage of multi-core parallelism primarily through preprocessor pragmas.
We will begin with a short C++ program, parallelize it using OpenMP, and improve the parallelized version. This development work can be carried out on a CS lab machine.
The following program computes the a Calculus value, the "trapezoidal approximation of using 220 equal subdivisions." The exact answer from this computation should be 2.0.
#include <iostream> #include <cmath> #include <cstdlib> using namespace std; /* Demo program for OpenMP: computes trapezoidal approximation to an integral*/ const double pi = 3.141592653589793238462643383079; int main(int argc, char** argv) { /* Variables */ double a = 0.0, b = pi; /* limits of integration */; int n = 1048576; /* number of subdivisions = 2^20 */ double h = (b - a) / n; /* width of subdivision */ double integral; /* accumulates answer */ int threadct = 1; /* number of threads to use */ /* parse command-line arg for number of threads */ if (argc > 1) threadct = atoi(argv[1]); double f(double x); #ifdef _OPENMP cout << "OMP defined, threadct = " << threadct << endl; #else cout << "OMP not defined" << endl; #endif integral = (f(a) + f(b))/2.0; int i; for(i = 1; i < n; i++) { integral += f(a+i*h); } integral = integral * h; cout << "With n = " << n << " trapezoids, our estimate of the integral" << " from " << a << " to " << b << " is " << integral << endl; } double f(double x) { return sin(x); }
Comments on this code:
If a command line argument is given, the code segment
if (argc > 1) threadct = atoi(argv[1]);converts that argument to an integer and assigns that value to the variable
threadct
, overriding
the default value of 1. This uses the two arguments of the
function main()
, namely argc
and
argv
. This demo program makes no attempt to check
whether a first command line argument argv[1]
is
actually an integer, so make sure it is (or omit it).
The variable threadct
will be used later to
control the number of threads to be used. Recall that a
process is an execution of a program. A
thread is
an independent execution of (some of) a process's code that shares
(some) memory and/or other resources with that process. We will
modify this program to
use multiple threads, which can be executed in parallel on a
multi-core computer.
The preprocessor macro _OPENMP
is defined for C++
compilations that include support for OpenMP. Thus, the code segment
#ifdef _OPENMP cout << "_OPENMP defined, threadct = " << threadct << endl; #else cout << "_OPENMP not defined" << endl; #endifprovides a way to check whether OpenMP is in use.
The following lines contain the actual computation of the trapezoidal approximation:
integral = (f(a) + f(b))/2.0; int i; for(i = 1; i < n; i++) { integral += f(a+i*h); } integral = integral * h;Since n == 220, the
for
loop adds over 1
million values. Later in this lab, we will parallelize that loop,
using multiple cores
which will each perform part of this summation, and look for speedup
in the program's performance.Log into a CS lab machine, and create a file containing the program above, perhaps named
trap-omp.C
. To compile your file, you can enter the
command
% g++ -o trap-omp trap-omp.C -lm -fopenmp(Here,
%
represents your shell prompt, which is usually a
machine name followed by either %
or $
.)
First, try running the program without a command-line argument
% trap-ompThis should print a line "
_OPENMP defined, threadct = 1
",
followed by a line indicating the computation with an answer of 2.0.
Next, try
% trap-omp 2This should indicate a different thread count, but otherwise produce the same output. Finally, try recompiling your program omitting the
-fopenmp
flag. This should report
_OPENMP not defined
, but give the same answer 2.0.
Note that the program above is actually uses only a single
core, whether or not a command-line argument is given. It is an
ordinary C++ program in every respect, and OpenMP does not magically
change ordinary C++ programs; in particular, the variable
threadct
is just an ordinary local variable with no
special computational meaning.
To request a parallel computation, add the following
pragma preprocessor directive, just before the
for
loop.
#pragma omp parallel for num_threads(threadct) \ shared (a, n, h, integral) private(i)The resulting code will have this format:
int i; #pragma omp parallel for num_threads(threadct) \ shared (a, n, h, integral) private(i) for(i = 1; i < n; i++) { integral += f(a+i*h); }Here are some comments on this pragma:
Make sure no characters follow the backslash character before the end of the first line. This causes the two lines to be treated as a single pragma (useful to avoid long lines).
The strings omp parallel for
indicate that this is
an OpenMP pragma for parallelizing the for
loop that
follows immediately. The OpenMP system will divide the 220
iterations of that loop up into threadct
segments, each
of which can be executed in parallel on multiple cores.
The OpenMP clause num_threads(threadct)
specifies the number of threads to use in the parallelization.
Extra: OpenMP also provides other ways to set the
number of threads to use, namely the
omp_set_num_threads()
library function and the
OMP_NUM_THREADS
environment variable.
The clauses in the second line indicate whether the variables
that appear in
the for
loop should be shared
with the other
threads, or
should be local private
variables used only by a single
thread. Here, four of those variables are globally shared by all the
threads, and only the loop control variable i
is local to
each particular thread.
Extra: OpenMP provides several other clauses for
managing variable locality, initialization, etc. Examples:
default
, firstprivate
,
lastprivate
, copyprivate
.
After inserting the parallel for
pragma, observe
that the program runs and produces the correct result 2.0 for
threadct == 1
(e.g., no command-line argument), but that
% trap-omp 2which sets
threadct == 2
, produces an incorrect answer
(perhaps about 1.5). What happens with repeated runs with that and
other (positive) thread counts? Can you explain why?
Note: Try reasoning out why the computed answer is correct for one thread but incorrect for two or more threads. Hint: All of the values being added in this particular loop are positive values, and the erroneous answer is too low.
If you figure out the cause, think about how to fix the problem. You may use the OpenMP website or other resources to research your solution, if desired.
To see the cause and a remedy, click here
A code segment is said to be thread-safe if it remains correct when executed by multiple independent threads. The body of this loop is not thread-safe.
Some libraries are identified as thread-safe, meaning that each
function in that library is thread-safe. Of course, calling a
thread-safe function doesn't insure that the code with that function
call is thread-safe. For example, the function f()
in
our example, is thread-safe, but the body of that loop is not
thread-safe.
We can obtain the running time for a program using the
time
Linux program. For example,
% /usr/bin/time -p trap-ompmight display the following output:
OMP defined, threadct = 1 With n = 1048576 trapezoids, our estimate of the integral from 0 to 3.14159 is 2 real 0.04 user 0.04 sys 0.00Here, we use the full path
/usr/bin/time
to insure that
we are accessing the
time program instead of a shell builtin command. The
-p
flag produces output in a format comparable to what we
will see in the MTL.
The real
time measures actual time elapsed during the
running of your command trap-omp
. user
measures the amount of time executing user code, and sys
measures the time executing in Linux kernel code.
Try the time command, and compare the results for different thread counts. You should find that real time decreases somewhat when changing from 1 thread to 2 threads; user time increases somewhat. Can you think of reasons that might produce these results?
Also, real time and user time increase considerably on lab machines when increasing from 2 to 3 or more threads. What might explain that?
Before accessing the MTL, copy your program
trap-omp.C
to your laptop. If this program is in a
subdirectory mtl
of your home directory, you can copy
this using scp
as follows:
laptop% scp username@shelob.cs.stolaf.edu:mtl/trap-omp.C .(Don't forget the final dot!)
Now, use Cisco VPN to connect your laptop to the MTL network.
You can now login to the MTL computer, as follows
laptop% ssh sorab-s0N@36.81.203.1Use one of the student account usernames
sorab-s01
,
sorab-s02
, sorab-s03
, or
sorab-s04
, together with the password distributed in
class. Create a subdirectory for yourself in the home directory for
that student account sorab-s0N
:
36.81.203.1% mkdir usernameHere,
username
can be your St. Olaf username.Next, copy your program from your laptop to the MTL machine. One way to do this is to logout, then enter the following command:
laptop% scp trap-omp.C sorab-soN@36.81.203.1:usernameAfter making this copy, login into the MTL machine
36.81.203.1
again.
On the MTL machine, compile and test run your program.
36.81.203.1% g++ -o trap-omp trap-omp.C -lm -fopenmp 36.81.203.1% ./trap-omp 36.81.203.1% ./trap-omp 2 36.81.203.1% ./trap-omp 16Note: Since the current directory
.
may not be
in your default path, you probably need to use the path name
./trap-omp
to invoke your program.
Now, try some time trials of your code on the MTL machine.
(The full pathname for time
and the -p
flag
are unnecessary.) For example:
36.81.203.1% time trap-omp 36.81.203.1% time trap-omp 2 36.81.203.1% time trap-omp 3 36.81.203.1% time trap-omp 4 36.81.203.1% time trap-omp 8 36.81.203.1% time trap-omp 16 36.81.203.1% time trap-omp 32
What patterns do you notice with the real
and
user
times of various runs of trap-omp
with
various values of threadct
?
OpenMP works well for adding parallelism to loops in working sequential code, and it's available for C, C++, and Fortran languages on many platforms (including Linux, Windows, and Macintosh OS X). Older versions of OpenMP did not readily support non-loop parallelism or programming with concurrent data structures, but OpenMP version 3.0 (released May 2008) provides a task feature for programming such computations.
Intel's Threading Building Blocks (TBB) provides an object-oriented approach to implementing parallel algorithms, for the C++ language (and any of the three platforms). Adding parallelism to existing code in TBB is somewhat more involved than in OpenMP, but is considerably less complicated than programming in a native threads package for a particular operating system. The forthcoming new standard for the C++ language is likely to include parallelism similar to TBB.
Enter the following TBB program into a file
trap-tbb.cpp
.
#include <iostream> #include <cmath> using namespace std; #include "tbb/tbb.h" using namespace tbb; /* Demo program for TBB: computes trapezoidal approximation to an integral*/ const double pi = 3.141592653589793238462643383079; double f(double x); class SumHeights { double const my_a; double const my_h; double &my_int; public: void operator() (const blocked_range<size_t>& r) const { for(size_t i = r.begin(); i != r.end(); i++) { my_int += f(my_a+i*my_h); } } SumHeights(const double a, const double h, double &integral) : my_a(a), my_h(h), my_int(integral) {} }; int main(int argc, char** argv) { /* Variables */ double a = 0.0, b = pi; /* limits of integration */; int n = 1048576; /* number of subdivisions = 2^20 */ double h = (b - a) / n; /* width of subdivision */ double integral; /* accumulates answer */ integral = (f(a) + f(b))/2.0; parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral)); integral = integral * h; cout << "With n = " << n << " trapezoids, our estimate of the integral" << " from " << a << " to " << b << " is " << integral << endl; } double f(double x) { return sin(x); }
Comments on this code:
This program does not use a
command-line argument (and the
cstdlib
library is not needed). Unlike OpenMP, TBB does
not provide a simple way to request a particular number of threads.
Instead, the TBB system chooses a number of threads to use
automatically. (OpenMP will also make such a selection for you if you do
not specify the number of threads to use.)
The lines
#include "tbb/tbb.h" using namespace tbb;prepare for using TBB.
Recall that in the OpenMP code, we parallelized the loop
for(i = 1; i < n; i++) { integral += f(a+i*h); }by adding a pragma just before that
for
loop.
In order to program a comparable computation in TBB, we create a
class SumHeights
whose method operator()
contains the following loop:
for(size_t i = r.begin(); i != r.end(); i++) { my_int += f(my_a+i*my_h); }then pass an instance of that class
SumHeights
to a call of
parallel_for()
. Observe that the forms of the two loops
indicate the same iterative computation, if one matches 1
to r.begin()
, n
to r.end()
,
variables and integral
, a
, and h to
SumHeights
state
variables my_int
, my_a
, and
my_h
.
One way to describe this relationship is to say that the class
SumHeights
is a "wrapper" around its loop.
The class SumHeights
defines
operator()
, which means that an object of
type SumHeights
can be called using function-call
syntax. Since operator()
is defined here with one
argument, this means we can cause the for
loop to execute
using a call sh(range)
, where sh
is an
object of type SumHeights
and range
is an
appropriate argument.
Note that operator()
is a const
method
(indicated by the const
after )
and before
{
), which means that it is permitted to call
sh(range)
with a const
object
sh
.
The argument r
of operator()
indicates the range of the loop, i.e., the starting and
ending values for the loop control variable.
The loop control variable i
has type
int
in the OpenMP implementation, but type
size_t
in the TBB implementation. size_t
is
an integer type, which may be equivalent to int
,
long
, or another integer type depending on
implementation.
The constructor SumHeights()
makes local copies
my_a
, etc., of variables a
, etc., in
main()
, enabling values in main()
to be used
within the class SumHeights
.
The range r
has the type
blocked_range<size_t>
. This is a templated
type built over the size_t
type. There could be
blocked_range
types built over other types, as well, e.g.,
int
or long
.
The call to parallel_for
in main()
automatically subdivides (or chunks) the range
r
for multi-threaded parallel computation.
parallel_for
expects a range in its first argument, and
an object with a method operator()
having one range
argument in its second argument.
The variable integral
is passed by reference in
the constructor SumHeights()
in an effort to use that
memory location integral
as an accumulator during the
parallelized computation.
The constructor initializes the state variables
my_a
, my_h
, and my_int
using
colon initializers. In the constructor definition
SumHeights(const double a, const double h, double &integral) : my_a(a), my_h(h), my_int(integral) {}the expression
my_a(a)
located after the colon
:
and before the curly bracket {
has the
same effect as an assignment
my_a = a;would if it occurred between the curly brackets. Colon initialization is optional for the state variables
my_a
and my_h
, but it is required for the state variable
my_int
, because that state variable was defined using a
reference type.
Note: Can you detect any problems in this code?
At this writing, TBB is not installed on St. Olaf CS machines. Therefore, we will run this TBB program on the MTL.
First, if necessary, copy the program file you created to a local machine for connecting to MTL, e.g., your laptop.
Now connect to the MTL's network using the Cisco VPN client
software on that local machine. Use scp
to copy your
file to your subdirectory in the MTL system, then login to the remote
MTL system.
On the remote MTL system, execute the following command, which sets up environment variables for compiling with TBB:
36.81.203.1% source /opt/intel/Compiler/11.1/056/tbb/bin/tbbvars.sh intel64The
intel64
command-line argument prepares for 64-bit
compilation.To compile your program that was copied in a prior step, issue this command:
36.81.203.1% g++ -o trap-tbb trap-tbb.cpp -ltbb_debugNote: You can use
-ltbb
instead of
-ltbb_debug
for a production version of the library
instead of one with debugging hooks. Now run your program (recall that there are no command-line arguments).
36.81.203.1% ./trap-tbbThe result is significantly less than 2! Can you think of an explanation for the answer being so far off?
Also run several time tests of your program
36.81.203.1% time ./trap-tbbWhat do you observe in these time tests? How do the times compare to timed runs of
trap-omp
for various thread sizes?
The code above for a TBB trapezoidal computation produces an
incorrect answer if there are multiple threads, because each thread
attempts to update the shared variable integral
without
any mechanism to avoid one thread from overwriting the results
produced by another thread. As before, we will solve this issue using
a reduction, in which results will be computed in
local variables for each thread, then those local results
added together at the end.
To do the reduction in TBB, we will use the
parallel_reduce
call instead of the
parallel_for
call, and will use a modified class
SumHeights2
.
#include <iostream> #include <cmath> using namespace std; #include "tbb/tbb.h" using namespace tbb; /* Demo program for TBB: computes trapezoidal approximation to an integral*/ const double pi = 3.141592653589793238462643383079; double f(double x); class SumHeights2 { double const my_a; double const my_h; public: double my_int; void operator() (const blocked_range<size_t>& r) { double a2 = my_a; double h2 = my_h; double int2 = my_int; size_t end = r.end(); for(size_t i = r.begin(); i != end; i++) { int2 += f(a2+i*h2); } my_int = int2; } SumHeights2(const double a, const double h, const double integral) : my_a(a), my_h(h), my_int(integral) {} SumHeights2(SumHeights2 &x, split) : my_a(x.my_a), my_h(x.my_h), my_int(0) {} void join( const SumHeights2 &y) { my_int += y.my_int; } }; int main(int argc, char** argv) { /* Variables */ double a = 0.0, b = pi; /* limits of integration */; int n = 1048576; /* number of subdivisions = 2^20 */ double h = (b - a) / n; /* width of subdivision */ double integral; /* accumulates answer */ integral = (f(a) + f(b))/2.0; SumHeights2 sh2(a, h, integral); parallel_reduce(blocked_range<size_t>(1, n), sh2); integral += sh2.my_int; integral = integral * h; cout << "With n = " << n << " trapezoids, our estimate of the integral" << " from " << a << " to " << b << " is " << integral << endl; } double f(double x) { return sin(x); }
The class SumHeights2
handles the variable
my_int
differently than the class SumHeights
does. Instead of SumHeights
's misguided attempt to share
main()
's memory location integral
through
reference types, the new class SumHeights2
allocates a
new separate (state) variable location my_int
for each
object of type SumHeights2
(by avoiding reference
types).
Also, my_int
is a public
state variable
in SumHeights2
, instead of the default
private
visibility in the prior class
SumHeights
. This makes it possible for a method of a
SumHeights2
object to compute a value and store that
value in my_int
, then for another part of the code to
access that computed value through that public state variable
my_int
. (Alternatively, we could have made
my_int
private like the other state variables, and added
a "getter" method get_my_int()
to retrieve that computed
value.)
The operator()
definitions in the two classes
differ in several ways.
The code for the new class's operator
SumHeights2::operator()
begins my making local copies
a2
, h2
, and int2
of
the variables my_a
, my_h
, and
my_int
, and also storing the (unchanging) value of
r.end()
in another local variable. These local variable
assignments are not necessary for the logical correctness of
the code. Instead, they make it possible for the compiler to produce
a more efficient computations. (HD veterans: With this help, the
compiler can realize that it's safe to use registers to
implement those variables instead of memory locations, which
would lead to faster access to those values.)
The loop is rewritten to use these local variables, but
otherwise represents the same computation as in the previous
SumHeights::operator()
.
After the loop, the local variable int2
is assigned to the state variable my_int
, in order to
deliver the sum for this thread's subdivision (chunk) of the
summation range.
SumHeights2::operator()
is not a
const
method. This means it's not safe for
const
objects to call this method -- they will be
changed. In this case, the change is that my_int
is
changed when operator()
is called.
The three-argument constructor for SumHeights
is
the same as the three-argument constructor for
SumHeights2
, except for the handling of the third
argument integral
(discussed above).
However, the class SumHeights2
has an additional
constructor and an additional method join()
.
The second constructor is called a split constructor.
This constructor will be used to construct new instances of
SumHeights2
for additional threads brought into the
summation computation.
The method join()
is used to add the partial sum
from one thread's computation to a running sum -- i.e., to perform the
reduction operation.
Here is an overview description of the parallel computation for this program.
An object sh2
is allocated, using the
three-argument constructor for SumHeights2
.
The call to parallel_reduce()
in
main()
performs sh2
's
operator()
over the range 1 to n
by
subdividing (i.e., chunking) that range and assigning a thread to perform
the trapezoidal sum for each chunk.
Each of those threads creates its own SumHeights2
object using the splitting constructor.
The thread first calls that splitting
object's operator()
with that thread's range chunk to compute a
partial sum.
Then, the thread calls sh2.join()
with that
splitting object as the argument, in order to add its partial sum to
sh2
's accumulator sh2.my_int
.
After all range chunks have been processed,
parallel_reduce()
finishes, leaving the final answer in
the public
state variable sh2.my_int
.
The splitting constructor for SumHeights2
has a
dummy argument of type split
(defined by the TBB
library), because without that extra argument, there would be no way
for a compiler to tell the difference between a call to that splitting
constructor and a call to SumHeight2
's copy constructor.
Enter the program above, using the filename
trap-tbb2.cpp
.
You can enter it on a lab machine, but then you'd have to disconnect
Cisco VPN on your local machine (e.g., your laptop), scp
the new file to your local machine, reconnect Cisco VPN, and
scp
to the MTL machine in order to transfer it to the MTL
system.
Alternatively, you could enter the program on your local machine
and scp
to the MTL machine. Another possibility is to
enter it on the remote MTL system directly, using an editor such as
emacs
or vi
.
Compile and test your trap-tbb2
program on the
MTL. Does it now produce the correct answer of 2 for the trapezoidal approximation?
Also time the
performance of runs of this revised program, and compare to the
time performance of runs of the prior program trap-tbb
.
rab@stolaf.edu, May 21, 2010