How does the order of the combination of numpy affect the speed of multiplication?

advertisements

How order of numpy array influence on multiplication speed? And how I can auto choose it depending on size of matrices?

The question initially comes from code using cudamat:

def test_mat():
    #need to init cublas?
    # cm.cublas_init()

    n = 1024

    for i in xrange(1,20):  # 2^15 max or python fails
        m= 2
        m=m**i
        # print m
        print i
        try:
            t0= time.time()
            # cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
            # cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='C')
            #CUDA need fortran order of array for speed?
            cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
            cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
            c = np.dot(cpum2.T, cpum1.T)
            print (time.time()-t0)

            t0= time.time()
            gpum1 = cm.CUDAMatrix(cpum1)
            gpum2 = cm.CUDAMatrix(cpum2)
            gm = cm.dot(gpum2.T, gpum1.T)
            gm.copy_to_host()
            print (time.time()-t0)
        except:
            pass

    # cm.cublas_shutdown()

    print 'done'

here is some tests I have done, but I need some theoretical point of view.

def test_order(m,n):
    #default
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32)
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32)

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    print 'done'    

m= 1024*10
n= 1024*1
7.125
7.14100003242
6.95299983025
8.14100003242
7.15600013733

m= 1024*1
n= 1024*10
0.718999862671
0.734000205994
0.641000032425
0.656000137329
0.655999898911

Here is the code testing peak memory usage:

import numpy as np
import time
from memory_profiler import profile

@profile
def test_order_():

    m= 1024*1
    n= 1024*10

    #what used by default when c= np.dot(a,b)
    c = np.array(np.zeros((m, m)), dtype=np.float32, order='C')
    #c = np.array(np.zeros((m, m)), dtype=np.float32, order='F')

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    print 'done'

if __name__ == '__main__':
    test_order_()

Also found some info about numpy.dot copy and fast_dot

The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order

Also some performance tips it's strange but I can't reproduce results each time I run example.(Maybe before reruns some data chaches?)

The performance depends on the underlying linear algebra library that you have.

# ORDER C-C
In [6]: %timeit a.dot(b)
10 loops, best of 3: 87.6 ms per loop

# ORDER C-F
In [8]: %timeit a.dot(b)
10 loops, best of 3: 87.8 ms per loop

# ORDER F-C
In [10]: %timeit a.dot(b)
10 loops, best of 3: 90.1 ms per loop

# ORDER F-F
In [12]: %timeit a.dot(b)
10 loops, best of 3: 90 ms per loop

I am using ATLAS compiled on this machine with SSE3, as seen by np.show_config(). Re-running the computations show that there is no statistical difference between the two. And indeed, there is no difference because the library is doing a copy of the array before performing the product. Said copy takes 650 µs (including Python overhead), that is below the times you have. As the matrices grow, the dot product dominates, and you don't see the copy effect. If you use smaller matrices, the Python overhead masks any effect.

You can see the copies actually happening if you monitor the memory and use very large arrays.

How does the order of the combination of numpy affect the speed of multiplicatio...

How does the order of the combination of numpy affect the speed of multiplication?

Recommend

How does Virtual Reality affect our society, emotionally and ethically?

What Is the Google Knowledge Graph (and Does It Affect SEO)?

Does space affect the taste of wine and other tech news

Experts answer: How does Elon Musk affect crypto space?

Does the motion of the solar system affect the microwave sky?

How does the mix order affect the derived class?

Sort a NumPy Array in descending order in Python

Check if a NumPy Array is in descending order

Check if a Numpy Array is in ascending order

Understanding CPUs can help speed up Numba and NumPy code

About Joyk