Multi-core programming with Intel's Manycore Testing Lab

"Responding to Manycore" Project


Dick Brown, St. Olaf College, May 2010

Intel Corporation has set up a special remote system that allows faculty and students to work with computers with lots of cores, called the Manycore Testing Lab (MTL). In this lab, we will create a program that intentionally uses multi-core parallelism, upload and run it on the MTL, and explore the issues in parallelism and concurrency that arise.

Extra: Comments in this format indicate possible avenues of exploration for people seeking more challenge in this lab.

Preparing your machine to connect to the MTL

Carry the steps in this section before the lab, if possible, on a laptop you can bring with you to the lab session.

We will use Link computers in RNS 202 or 203 in order to develop our multi-core program, but we need to connect to the Intel MTL system using a non-Link computer. This is because the Cisco VPN software for connecting to the MTL blocks all other network access, so we can't use it on the Link computers or it would interfere with all other uses of those computers, such as being part of the MistRider cluster.

The Cisco VPN client software is free, easy to install, easy to use, and does not interfere with your computer except while you are connected to the MTL. Here are download links (install before the lab if possible):

Once you have installed the Cisco VPN client (and have ssh and scp installed), configure the Cisco VPN client to access the multicore testing lab, as follows.

  1. Start up the Cisco VPN client. Note: This is installed as a software application. Don't look for it under system network connection options, like other VPN systems.

  2. Create a new connection, with the following connection information:

    Save this connection to finish creating it.

  3. Now try connecting on your new VPN connection entry.

    If this succeeds, you will find that none of your usual network services work. For example, your browser won't be able to find any pages (thus, you'll have to google on a different machine while you're connected to the MTL).

    If your new VPN connection fails, recheck the settings you entered, or seek help.

  4. Finally, disconnect from your new VPN connection entry. This will give you your usual network capabilities back, etc.

Getting started with OpenMP

Extra: If this lab is too slow for you, consider carrying it out in a different parallelism strategy, such as Intel's Threading Building Blocks (TBB).

We will use a standard system for parallel programming called OpenMP, which enables a C or C++ programmer to take advantage of multi-core parallelism primarily through preprocessor pragmas.

We will begin with a short C++ program, parallelize it using OpenMP, and improve the parallelized version. This development work can be carried out on a CS lab machine.

  1. The following program computes the a Calculus value, the "trapezoidal approximation of using 220 equal subdivisions." The exact answer from this computation should be 2.0.

    #include <iostream>
    #include <cmath>
    #include <cstdlib>
    using namespace std;
    
    /* Demo program for OpenMP: computes trapezoidal approximation to an integral*/
    
    const double pi = 3.141592653589793238462643383079;
    
    int main(int argc, char** argv) {
       /* Variables */
       double a = 0.0, b = pi;  /* limits of integration */;
       int n = 1048576; /* number of subdivisions = 2^20 */
       double h = (b - a) / n; /* width of subdivision */
       double integral; /* accumulates answer */
       int threadct = 1;  /* number of threads to use */
       
       /* parse command-line arg for number of threads */
       if (argc > 1)
         threadct = atoi(argv[1]);
    
       double f(double x);
         
    #ifdef _OPENMP
       cout << "OMP defined, threadct = " << threadct << endl;
    #else
       cout << "OMP not defined" << endl;
    #endif
    
       integral = (f(a) + f(b))/2.0;
       int i;
    
       for(i = 1; i < n; i++) {
         integral += f(a+i*h);
       }
       
       integral = integral * h;
       cout << "With n = " << n << " trapezoids, our estimate of the integral" <<
         " from " << a << " to " << b << " is " << integral << endl;
    }
        
    double f(double x) {
       return sin(x);
    }
    

    Comments on this code:

  2. Log into a CS lab machine, and create a file containing the program above, perhaps named trap-omp.C . To compile your file, you can enter the command

      %  g++ -o trap-omp trap-omp.C -lm -fopenmp
    
    (Here, % represents your shell prompt, which is usually a machine name followed by either % or $.) First, try running the program without a command-line argument
      %  trap-omp
    
    This should print a line "_OPENMP defined, threadct = 1", followed by a line indicating the computation with an answer of 2.0. Next, try
      %  trap-omp 2
    
    This should indicate a different thread count, but otherwise produce the same output. Finally, try recompiling your program omitting the -fopenmp flag. This should report _OPENMP not defined, but give the same answer 2.0.

  3. Note that the program above is actually uses only a single core, whether or not a command-line argument is given. It is an ordinary C++ program in every respect, and OpenMP does not magically change ordinary C++ programs; in particular, the variable threadct is just an ordinary local variable with no special computational meaning.

    To request a parallel computation, add the following pragma preprocessor directive, just before the for loop.

        #pragma omp parallel for num_threads(threadct) \
          shared (a, n, h, integral) private(i) 
    
    The resulting code will have this format:
           int i;
        
        #pragma omp parallel for num_threads(threadct) \
          shared (a, n, h, integral) private(i) 
           for(i = 1; i < n; i++) {
             integral += f(a+i*h);
           }
    
    Here are some comments on this pragma:

    Enter this change, compile, and test the resulting executable.
  4. After inserting the parallel for pragma, observe that the program runs and produces the correct result 2.0 for threadct == 1 (e.g., no command-line argument), but that

      % trap-omp 2
    
    which sets threadct == 2, produces an incorrect answer (perhaps about 1.5). What happens with repeated runs with that and other (positive) thread counts? Can you explain why?

  5. A program has a race condition if the correct behavior of that program depends on the timing of its execution. With 2 or more threads, the program trap-omp.C has a race condition concerning the shared variable integral, which is the accumulator for the summation performed by that program's for loop.

  6. One approach to avoiding this program's race condition is to use a separate local variable integral for each thread instead of a global variable that is shared by all the threads. But declaring integral to be private instead of shared in the pragma will only generate threadct partial sums in those local variables named integral -- the partial sums in those temporary local variables will not be added to the program's variable integral. In fact, the value in those temporary local variables will be discarded when each thread finishes its work for the parallel for if we simply make integral private instead of shared.

    Can you re-explain this situation in your own words?

  7. Fortunately, OpenMP provides a convenient and effective solution to this problem.

    Add this clause to your OpenMP pragma, and remove the variable integral from the shared clause, then recompile and test your program. You should now see the correct answer 2.0 when computing with multiple threads -- a correct multi-core program!

  8. How does this reduction feature of OpenMP compare to the "reduce" stage of map-reduce computing? How does it relate to the reduce operation in MPI? Write a comparison of OpenMP's reduction to other "reduce" operations you have seen, if any.

  9. A code segment is said to be thread-safe if it remains correct when executed by multiple independent threads. The body of this loop is not thread-safe, but adding the  reduction  clause to the OpenMP pragma led to a correct computation.

    Some libraries are identified as thread-safe, meaning that each function in that library is thread-safe. Of course, calling a thread-safe function doesn't insure that the code with that function call is thread-safe. For example, the function f() in our example, is thread-safe, because the math library function sin() is thread-safe, but the body of that loop is not thread-safe, because of the race condition involving the variable integral.

Timing performance

  1. We can obtain the running time for a program using the time Linux program. For example,

      % /usr/bin/time -p trap-omp 
    
    might display the following output:
    OMP defined, threadct = 1
    With n = 1048576 trapezoids, our estimate of the integral from 0 to 3.14159 is 2
    real 0.04
    user 0.04
    sys 0.00
    
    Here, we use the full path /usr/bin/time to insure that we are accessing the time program instead of a shell builtin command. The -p flag produces output in a format comparable to what we will see in the MTL.

    The real time measures actual time elapsed during the running of your command trap-omp. user measures the amount of time executing user code, and sys measures the time executing in Linux kernel code.

  2. Try the time command, and compare the results for different thread counts. You should find that real time decreases somewhat when changing from 1 thread to 2 threads; user time increases somewhat. Can you think of reasons that might produce these results?

    Also, real time and user time increase considerably on lab machines when increasing from 2 to 3 or more threads. What might explain that?

Using the MTL

  1. Before accessing the MTL, copy your program trap-omp.C to your laptop. If this program is in a subdirectory mtl of your home directory, you can copy this using scp as follows:

      laptop% scp username@shelob.cs.stolaf.edu:mtl/trap-omp.C .
    
    (Don't forget the final dot!)

  2. Now, use Cisco VPN to connect your laptop to the MTL network.

  3. You can now login to the MTL computer, as follows

      laptop% ssh sorab-s0N@36.81.203.1
    
    Use one of the student account usernames sorab-s01, sorab-s02, sorab-s03, or sorab-s04, together with the password distributed in class.

  4. Create a subdirectory for yourself in the home directory for that student account sorab-s0N :

      36.81.203.1% mkdir username
    
    Here, username can be your St. Olaf username.

  5. Next, copy your program from your laptop to the MTL machine. One way to do this is to logout, then enter the following command:

      laptop% scp trap-omp.C sorab-soN@36.81.203.1:username
    
    After making this copy, login into the MTL machine 36.81.203.1 again.

  6. On the MTL machine, compile and test run your program.

      36.81.203.1% g++ -o trap-omp trap-omp.C -lm -fopenmp
      36.81.203.1% ./trap-omp
      36.81.203.1% ./trap-omp 2
      36.81.203.1% ./trap-omp 16
    
    Note: Since the current directory . may not be in your default path, you probably need to use the path name ./trap-omp to invoke your program.

  7. Now, try some time trials of your code on the MTL machine. (The full pathname for time and the -p flag are unnecessary.) For example:

      36.81.203.1% time trap-omp
      36.81.203.1% time trap-omp 2
      36.81.203.1% time trap-omp 3
      36.81.203.1% time trap-omp 4
      36.81.203.1% time trap-omp 8
      36.81.203.1% time trap-omp 16
      36.81.203.1% time trap-omp 32
    

  8. What patterns do you notice with the real and user times of various runs of trap-omp with various values of threadct?

Intel's Threading Building Blocks (TBB)

OpenMP works well for adding parallelism to loops in working sequential code, and it's available for C, C++, and Fortran languages on many platforms (including Linux, Windows, and Macintosh OS X). Older versions of OpenMP did not readily support non-loop parallelism or programming with concurrent data structures, but OpenMP version 3.0 (released May 2008) provides a task feature for programming such computations.

Intel's Threading Building Blocks (TBB) provides an object-oriented approach to implementing parallel algorithms, for the C++ language (and any of the three platforms). Adding parallelism to existing code in TBB is somewhat more involved than in OpenMP, but is considerably less complicated than programming in a native threads package for a particular operating system. The forthcoming new standard for the C++ language is likely to include parallelism similar to TBB.

  1. Enter the following TBB program into a file trap-tbb.cpp.

    #include <iostream>
    #include <cmath>
    using namespace std;
    
    #include "tbb/tbb.h"
    using namespace tbb;
    
    /* Demo program for TBB: computes trapezoidal approximation to an integral*/
    
    const double pi = 3.141592653589793238462643383079;
    
    double f(double x);
         
    class SumHeights {
      double const my_a;
      double const my_h;
      double &my_int;
    
    public:
      void operator() (const blocked_range<size_t>& r) const {
        for(size_t i = r.begin(); i != r.end(); i++) {
          my_int += f(my_a+i*my_h);
        }
      }
      
      SumHeights(const double a, const double h, double &integral) : 
        my_a(a), my_h(h), my_int(integral) 
      {}
    };
    
    int main(int argc, char** argv) {
       /* Variables */
       double a = 0.0, b = pi;  /* limits of integration */;
       int n = 1048576; /* number of subdivisions = 2^20 */
    
       double h = (b - a) / n; /* width of subdivision */
       double integral; /* accumulates answer */
       
       integral = (f(a) + f(b))/2.0;
    
       parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));
       
       integral = integral * h;
       cout << "With n = " << n << " trapezoids, our estimate of the integral" <<
         " from " << a << " to " << b << " is " << integral << endl;
    }
        
    double f(double x) {
    
       return sin(x);
    }
    

    Comments on this code:

    Note: Can you detect any problems in this code?

  2. At this writing, TBB is not installed on St. Olaf CS machines. Therefore, we will run this TBB program on the MTL.

  3. The code above for a TBB trapezoidal computation produces an incorrect answer if there are multiple threads, because each thread attempts to update the shared variable integral without any mechanism to avoid one thread from overwriting the results produced by another thread. As before, we will solve this issue using a reduction, in which results will be computed in local variables for each thread, then those local results added together at the end.

    To do the reduction in TBB, we will use the parallel_reduce call instead of the parallel_for call, and will use a modified class SumHeights2.

    #include <iostream>
    #include <cmath>
    using namespace std;
    
    #include "tbb/tbb.h"
    using namespace tbb;
    
    /* Demo program for TBB: computes trapezoidal approximation to an integral*/
    
    const double pi = 3.141592653589793238462643383079;
    
    double f(double x);
         
    class SumHeights2 {
      double const my_a;
      double const my_h;
    
    public:
      double my_int;
    
      void operator() (const blocked_range<size_t>& r) {
        double a2 = my_a;
        double h2 = my_h;
        double int2 = my_int;
        size_t end = r.end();
        for(size_t i = r.begin(); i != end; i++) {
          int2 += f(a2+i*h2);
        }
        my_int = int2;
      }
      
      SumHeights2(const double a, const double h, const double integral) : 
        my_a(a), my_h(h), my_int(integral)
      {}
    
      SumHeights2(SumHeights2 &x, split) : 
        my_a(x.my_a), my_h(x.my_h), my_int(0)
      {}
    
      void join( const SumHeights2 &y) { my_int += y.my_int; }
    };
    
    int main(int argc, char** argv) {
       /* Variables */
       double a = 0.0, b = pi;  /* limits of integration */;
       int n = 1048576; /* number of subdivisions = 2^20 */
    
       double h = (b - a) / n; /* width of subdivision */
       double integral; /* accumulates answer */
       
       integral = (f(a) + f(b))/2.0;
    
       SumHeights2 sh2(a, h, integral);
       parallel_reduce(blocked_range<size_t>(1, n), sh2);
       integral += sh2.my_int;
       
       integral = integral * h;
       cout << "With n = " << n << " trapezoids, our estimate of the integral" <<
         " from " << a << " to " << b << " is " << integral << endl;
    }
        
    double f(double x) {
    
       return sin(x);
    }
    
    Comments:

  4. Enter the program above, using the filename trap-tbb2.cpp.

    You can enter it on a lab machine, but then you'd have to disconnect Cisco VPN on your local machine (e.g., your laptop), scp the new file to your local machine, reconnect Cisco VPN, and scp to the MTL machine in order to transfer it to the MTL system.

    Alternatively, you could enter the program on your local machine and scp to the MTL machine. Another possibility is to enter it on the remote MTL system directly, using an editor such as emacs or vi.

  5. Compile and test your trap-tbb2 program on the MTL. Does it now produce the correct answer of 2 for the trapezoidal approximation?

  6. Also time the performance of runs of this revised program, and compare to the time performance of runs of the prior program trap-tbb.





rab@stolaf.edu, May 21, 2010