Optimize the MSE algorithm using openmp

advertisements

I wanted to optimize below code using openMP

double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;

#define _MSE(m, t) \
val = refData[t] - calData[t];  \
m += val*val; 

#pragma omp parallel
 {
 #pragma omp for
for( i=0; i<(width*height)/2; i++ ) {  //yuv422: 2 pixels at a time
    _MSE(m_u, 0);
    _MSE(m_y, 1);
    _MSE(m_v, 2);
    _MSE(m_y, 3); 

  #pragma omp reduction(+:refData) reduction(+:calData)
    refData += 4;
    calData += 4;
 // int id = omp_get_thread_num();
 //printf("Thread %d performed %d iterations of the loop\n",id ,i);
}

}

Any suggestion welcome for optimizing above code currently I have wrong output.

I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:

double sqError[4] = {0};
const int numBytes = width * height * 2;

#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
    for( int i = elem; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        sqError[elem] += (double)(val*val);
    }
}

This way, each thread operates exclusively on one thing and there is no contention.

Maybe it's not the most advanced use of OMP, but you should see a speedup.

After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.

You said:

performance hit this time with openMP : Time :0.040637 with serial Time :0.018670

So I reworked it using the reduction on each variable and using a single loop:

    #pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
    for( int i = 0; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        e0 += (double)(val*val);
        val = refData[i+1] - calData[i+1];
        e1 += (double)(val*val);
        val = refData[i+2] - calData[i+2];
        e2 += (double)(val*val);
        val = refData[i+3] - calData[i+3];
        e3 += (double)(val*val);
    }

With my test case on a 4-core machine, I observed a little less than 4-fold improvement:

serial:             2025 ms
omp with 2 loops:   6850 ms
omp with reduction: 455  ms

[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:

Your first piece of code is a terrible example of what false sharing does in multithreaded codes. As sqError has only 4 elements of 8 bytes each, it fits in a single cache line (even in a half cache line on modern x86 CPUs). With 4 threads constantly writing to neighbouring elements, this would generate a massive amount of inter-core cache invalidation due to false sharing. One can get around this by using instead a structure like this struct _error { double val; double pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate cache line, hence no false sharing.

Optimize the MSE algorithm using openmp

Optimize the MSE algorithm using openmp

Recommend

「企云方」完成近亿元A轮融资，多家老股东持续加码

Improving v3dv pipeline caching

Building games and apps entirely through natural language using OpenAI’s code-da...

Tools for infrastructure drift detection

Record the actual logout time of a user

零跑汽车向港交所提交上市申请书

VK_KHR_depth_stencil_resolve support on Turnip

Deleting Resources in Kubernetes Operators

GitHub - Frontesque/VueTube: A simple FOSS video streaming client aimed to recre...

Java系列 | 远程热部署在美团的落地实践

About Joyk