Basic Parallelization in C/C++: Serial vs OpenMP vs MPI vs CUDA

In this post I compare three different Parallelization techniques. As an example we can use vector addition with large vectors and repeating several times to generate load, this is basically a ‘Hello World’ for numerical code.

Full source code for all methods presented here including a script to run a speed test can be found on github: Basic Parallelization

The applied methods include:


Serial

The function add() adds two vectors, x and y, of length n.

full source here

void add(int n, float *x, float *y)
{
    for (int i = 0; i < n; i++)
    {
        y[i] = x[i] + y[i];
    }
}

OpenMP

OpenMP stands for Open Multi-Processing. It is used for shared-memory multiprocessing that is, multiple threads on one node.

OpenMP generates multiple threads for one process. To set the number of threads, set the environment variable OMP_NUM_THREADS. For 2 threads, for example set OMP_NUM_THREADS=2.

The part of the code that should be run in parallel needs to be marked by a #pragma to indicate start and end of the parallel region. Additionally the code has to be compiled with the -fopenmp flag

full source here

void add(int n, float *x, float *y)
{
    #pragma omp parallel for
    for (int i = 0; i < n; i++)
    {
        y[i] = x[i] + y[i];
    }
}

MPI

The Message Passing Interfaces basically does just that, it pass messages from one process to another to exchange data. The application then runs in multiple processes, each one handling its own part of the vector. The application has to be compiled with the mpicc command, that is a wrapper for gcc and run with the mpirun command, indicating the number of processes needed. The following changes are necessary in the code (see full source here).

The add() function remains the same as in the serial example.

In the main function the MPI environment is initialized:

MPI_Init(NULL,NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD,&rank_total);

MPI_Comm_rank is used to determin the rank of the current (calling) MPI process and the main part of the main function, is adjusted to the current rank. In a simple example of 2 MPI processes, the first process handles the first half of the vector while the second one handles the second part of the vector. This includes the add() function that only has to handle half of the operation, (theoretically) beeing twice as fast.

Each process only has access to its own data. To exchange data between the different processes, it has to be acquired using the MPI interface.

    if (rank_total>0)
    {
        if (rank==0)
        {
            int cpu_time_used_local = 0;
            int maxError_local = 0;

            // receive data from all other processes
            for (int i=1;i<rank_total;i++)
            {
                MPI_Recv(&cpu_time_used_local, 1, MPI_INT, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                MPI_Recv(&maxError_local     , 1, MPI_INT, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
                cpu_time_used += cpu_time_used_local;
                maxError = fmax(maxError_local, maxError);
            }
        }
        else
        {
            // send data from 1 to 0
            //MPI_SEND(buf, count, datatype, dest, tag, comm)
            //printf("Rank 1: sending data\n");
            MPI_Send(&cpu_time_used, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
            MPI_Send(&maxError,      1, MPI_INT, 0, 0, MPI_COMM_WORLD);

        }
    }

And finally the MPI environment has to be closed at the end.

MPI_Finalize();

OpenMP + MPI

OpenMP and MPI can be combined using the MPI example from above and using the OpenMP version of the add() function, including the #pragma. The code has to be compiled with mpicc, then run using mpirun with correct number of processes. Additionally the environment variable OMP_NUM_THREADS is set to the number of threads for each MPI process. See full source here.

Why would you want to use that? The final number of parallel threads equals the product of the MPI processes times the OpenMP threads. But more processes or more threads doesn’t necessarily mean faster. The optimal combination for the number of MPI processes and OpenMP threads depends on serveral factors: the type of compute node, CPU speed, size of shared memory, number of CPUs/nodes per socket, number of cores per CPU and even network between different compute nodes. The optimal combination usually has to be figured out by systematic tests.


CUDA

CUDA is an API that allows applications to interface with the GPU. CUDA is devloped and maintained by Nvidia and specifically designed for Nvidia graphis cards. To interface with AMD GPUs an alternative, OpenCL, has to be used. In this example I use CUDA, simply because I have an Nvidia graphics card.

A GPU is basically a specialized CPU with an instruction set opimized for floating-point operations. It has thousands of cores, each with only a small amount memory and it handles small jobs but massively parallelized, exactely what is needed for graphics rendering.

full source here

First add() is changed to a CUDA kernel function that can be run by the GPU.

__global__
void add(int n, float *x, float *y)
{
    int t = threadIdx.x;
    int T = blockDim.x;
    for (int i = t; i < n; i += T)
    {
        y[i] = x[i] + y[i];
    }
}

To share data across both CPU and GPU, memory is allocated with a CUDA specific command:

cudaMallocManaged(&x,N*sizeof(*x));
cudaMallocManaged(&y,N*sizeof(*y));

The most important part is the call to the GPU to execute the kernel function. In this example, 512 threads are called in parallel:

for (int i = 0; i<rep; i++)
{
    add<<<1, 512>>>(N,x,y);
    cudaDeviceSynchronize();
}

And finally memory has to be freed with a CUDA command as well:

cudaFree(x);
cudaFree(y);

Resources:

Comments

Currently Deactivated - Contact me on email or Twitter if you have questions or suggestions.