4

How does the order of the combination of numpy affect the speed of multiplicatio...

 2 years ago
source link: https://www.codesd.com/item/how-does-the-order-of-the-combination-of-numpy-affect-the-speed-of-multiplication.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How does the order of the combination of numpy affect the speed of multiplication?

advertisements

How order of numpy array influence on multiplication speed? And how I can auto choose it depending on size of matrices?

The question initially comes from code using cudamat:

def test_mat():
    #need to init cublas?
    # cm.cublas_init()

    n = 1024

    for i in xrange(1,20):  # 2^15 max or python fails
        m= 2
        m=m**i
        # print m
        print i
        try:
            t0= time.time()
            # cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')
            # cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='C')
            #CUDA need fortran order of array for speed?
            cpum1 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
            cpum2 = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
            c = np.dot(cpum2.T, cpum1.T)
            print (time.time()-t0)

            t0= time.time()
            gpum1 = cm.CUDAMatrix(cpum1)
            gpum2 = cm.CUDAMatrix(cpum2)
            gm = cm.dot(gpum2.T, gpum1.T)
            gm.copy_to_host()
            print (time.time()-t0)
        except:
            pass

    # cm.cublas_shutdown()

    print 'done'

here is some tests I have done, but I need some theoretical point of view.

def test_order(m,n):
    #default
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32)
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32)

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c = np.dot(a,b)
    print (time.time()-t0)

    print 'done'    

m= 1024*10
n= 1024*1
7.125
7.14100003242
6.95299983025
8.14100003242
7.15600013733

m= 1024*1
n= 1024*10
0.718999862671
0.734000205994
0.641000032425
0.656000137329
0.655999898911

Here is the code testing peak memory usage:

import numpy as np
import time
from memory_profiler import profile

@profile
def test_order_():

    m= 1024*1
    n= 1024*10

    #what used by default when c= np.dot(a,b)
    c = np.array(np.zeros((m, m)), dtype=np.float32, order='C')
    #c = np.array(np.zeros((m, m)), dtype=np.float32, order='F')

    #1
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #2
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='C')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #3
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='C')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    #4
    a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
    b = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')

    t0= time.time()
    c[:]= np.dot(a,b)
    # np.dot(a,b,out= c) # only for C-Array !
    print (time.time()-t0)

    del a
    del b
    # del c

    print 'done'

if __name__ == '__main__':
    test_order_()

Also found some info about numpy.dot copy and fast_dot

The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order

Also some performance tips it's strange but I can't reproduce results each time I run example.(Maybe before reruns some data chaches?)


The performance depends on the underlying linear algebra library that you have.

# ORDER C-C
In [6]: %timeit a.dot(b)
10 loops, best of 3: 87.6 ms per loop

# ORDER C-F
In [8]: %timeit a.dot(b)
10 loops, best of 3: 87.8 ms per loop

# ORDER F-C
In [10]: %timeit a.dot(b)
10 loops, best of 3: 90.1 ms per loop

# ORDER F-F
In [12]: %timeit a.dot(b)
10 loops, best of 3: 90 ms per loop

I am using ATLAS compiled on this machine with SSE3, as seen by np.show_config(). Re-running the computations show that there is no statistical difference between the two. And indeed, there is no difference because the library is doing a copy of the array before performing the product. Said copy takes 650 µs (including Python overhead), that is below the times you have. As the matrices grow, the dot product dominates, and you don't see the copy effect. If you use smaller matrices, the Python overhead masks any effect.

You can see the copies actually happening if you monitor the memory and use very large arrays.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK