4

Optimize the MSE algorithm using openmp

 2 years ago
source link: https://www.codesd.com/item/optimize-the-mse-algorithm-using-openmp.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Optimize the MSE algorithm using openmp

advertisements

I wanted to optimize below code using openMP

double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;

#define _MSE(m, t) \
val = refData[t] - calData[t];  \
m += val*val; 

#pragma omp parallel
 {
 #pragma omp for
for( i=0; i<(width*height)/2; i++ ) {  //yuv422: 2 pixels at a time
    _MSE(m_u, 0);
    _MSE(m_y, 1);
    _MSE(m_v, 2);
    _MSE(m_y, 3); 

  #pragma omp reduction(+:refData) reduction(+:calData)
    refData += 4;
    calData += 4;
 // int id = omp_get_thread_num();
 //printf("Thread %d performed %d iterations of the loop\n",id ,i);
}

}

Any suggestion welcome for optimizing above code currently I have wrong output.


I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:

double sqError[4] = {0};
const int numBytes = width * height * 2;

#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
    for( int i = elem; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        sqError[elem] += (double)(val*val);
    }
}

This way, each thread operates exclusively on one thing and there is no contention.

Maybe it's not the most advanced use of OMP, but you should see a speedup.


After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.

You said:

performance hit this time with openMP : Time :0.040637 with serial Time :0.018670

So I reworked it using the reduction on each variable and using a single loop:

    #pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
    for( int i = 0; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        e0 += (double)(val*val);
        val = refData[i+1] - calData[i+1];
        e1 += (double)(val*val);
        val = refData[i+2] - calData[i+2];
        e2 += (double)(val*val);
        val = refData[i+3] - calData[i+3];
        e3 += (double)(val*val);
    }

With my test case on a 4-core machine, I observed a little less than 4-fold improvement:

serial:             2025 ms
omp with 2 loops:   6850 ms
omp with reduction: 455  ms


[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:

Your first piece of code is a terrible example of what false sharing does in multithreaded codes. As sqError has only 4 elements of 8 bytes each, it fits in a single cache line (even in a half cache line on modern x86 CPUs). With 4 threads constantly writing to neighbouring elements, this would generate a massive amount of inter-core cache invalidation due to false sharing. One can get around this by using instead a structure like this struct _error { double val; double pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate cache line, hence no false sharing.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK