Implementation of parallel reduction using Min, Max, Sum Operations.

Experiment No: HPC-01

TITLE: - Implement parallel reduction using Min, Max, Sum Operations.

Problem Statement: - To study and implementation of directive based parallel programming model.

Objective:

To study openMP a parallel programming model.

Requirements (Hw/Sw): PC, gcc, openMP tool, Ubuntu system,C/C++.

Theory:-

OpenMP:

OpenMP is a set of C/C++ pragmas (or FORTRAN equivalents) which provide the programmer a high-level front-end interface which get translated as calls to threads (or other similar entities). The key phrase here is "higher-level"; the goal is to better enable the programmer to "think parallel," alleviating him/her of the burden and distraction of dealing with setting up and coordinating threads. For example, the OpenMP directive.

OpenMP is an implementation of multithreading, a method of parallelizing whereby a master thread (a series of instructions executed consecutively) forks a specified number of slave threads and the system divides a task among them. The threads then run parallely, with the runtime environment allocating threads to different processors.

he section of code that is meant to run in parallel is marked accordingly, with a compiler directive that will cause the threads to form before the section is executed.

Each thread has an id attached to it which can be obtained using a function (called omp_get_thread_num()). The thread id is an integer, and the master thread has an id of 0. After the execution of the parallelized code, the threads join back into the master thread, which continues onward to the end of the program.

By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP in this way.

Core elements

The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data-environment management, thread synchronization, user-level runtime routines and environment variables.

Thread creation

The pragma omp parallel is used to fork additional threads to carry out the work enclosed in the construct in parallel. The original thread will be denoted as master thread with thread ID 0.

OpenMP & Reduction

A reduction simplifies something complex into something which is understandable and uncomplicated.

Therefore, we can argue that the following example is a reduction

sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + a[6] + a[7] + a[8] + a[9]

It simplifies the long expression

a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + a[6] + a[7] + a[8] + a[9]

into something shorter – sum. A similar argument argues that also the following example is a reduction

product = a[0] * a[1] * a[2] * a[3] * a[4] * a[5] * a[6] * a[7] * a[8] * a[9]

It reduces the long expression into the simpler one.

Loops

In C++, we can write the previous examples with for loops:

sum = 0;
for (auto i = 0; i < 10; i++)
{
    sum += a[i]
}

and

product = 1;
for (auto i = 0; i < 10; i++)
{
    product *= a[i]
}

OpenMP and reduction

OpenMP is specialized into parallelization of for loops. But a parallelization of the previous for loops is tricky. There is a shared variable (sum / product / reduction) which is modified in every iteration. If we are not careful when parallelizing such for loops, we might introduce data races.

For example, if we would parallelize the previous for loops by simply adding

#pragma omp parallel for

we would introduce a data race, because multiple threads could try to update the shared variable at the same time.

But for loops which represent a reduction are quite common. Therefore, OpenMP has the special reduction clause which can express the reduction of a for loop.

In order to specify the reduction in OpenMP, we must provide

an operation (+ / * / o)
and a reduction variable (sum / product / reduction). This variable holds the result of the computation.

For example, the following for loop

For example, the following for loop


sum = 0;


for (auto i = 0; i < 10; i++)

{

    sum += a[i]

}

can be parallelized with the reduction clause where the operation is equal to + and the reduction variable is equal to sum:


sum = 0;


#pragma omp parallel for shared(sum, a) reduction(+: sum)

for (auto i = 0; i < 10; i++)

{

    sum += a[i]

}

Implementation with Reduction Clause

How does OpenMP parallelize a for loop declared with a reduction clause? OpenMP creates a team of threads and then shares the iterations of the for loop between the threads. Each thread has its own local copy of the reduction variable. The thread modifies only the local copy of this variable. Therefore, there is no data race. When the threads join together, all the local copies of the reduction variable are combined to the global shared variable.

For example, let us parallelize the following for loop


sum = 0;


#pragma omp parallel for shared(sum, a) reduction(+: sum)

for (auto i = 0; i < 9; i++)

{

    sum += a[i]

}

and let there be three threads in the team of threads. Each thread has sumloc, which is a local copy of the reduction variable. The threads then perform the following computations

Thread 1

sumloc_1 = a[0] + a[1] + a[2]

Thread 2

sumloc_2 = a[3] + a[4] + a[5]

Thread 3

sumloc_3 = a[6] + a[7] + a[8]

In the end, when the treads join together, OpenMP reduces local copies to the shared reduction variable

sum = sumloc_1 + sumloc_2 + sumloc_3

Example (C program): Display "Hello, world." using multiple threads.

#include <stdio.h>

#include <omp.h>

int main(void)

{

#pragma omp parallel

printf("Hello, world.\n");

return 0;

}

Use flag -fopenmp to compile using GCC:

$ gcc -fopenmp hello.c -o hello

Output on a computer with two cores, and thus two threads:

Hello, world.

Fig---multithreading where the master thread forks off a number of threads which execute blocks of code in parallel.

Input :

Output :

Parallel Reduction Manual and Programs

Conclusion:We have implemented parallel reduction using Min, Max, Sum

Operations.

Search This Blog

...........................

SPOS LAB(GROUP C-01) Implement UNIX system calls like ps, fork, join, exec family, and wait for process management (use shell script/ Java/ C programming)