algorithm - A Cache Efficient Matrix Transpose Program? -
so obvious way transpose matrix use :
for( int = 0; < n; i++ ) for( int j = 0; j < n; j++ ) destination[j+i*n] = source[i+j*n];
but want take advantage of locality , cache blocking. looking , can't find code this, i'm told should simple modification original. ideas?
edit: have 2000x2000 matrix, , want know how can change code using 2 for
loops, splitting matrix blocks transpose individually, 2x2 blocks, or 40x40 blocks, , see block size efficient.
edit2: matrices stored in column major order, matrix
a1 a2 a3 a4
is stored a1 a3 a2 a4
.
you're going want 4 loops - 2 iterate on blocks, , 2 perform transpose-copy of single block. assuming simplicity block size divides size of matrix, think, although i'd want draw pictures on backs of envelopes sure:
for (int = 0; < n; += blocksize) { (int j = 0; j < n; j += blocksize) { // transpose block beginning @ [i,j] (int k = i; k < + blocksize; ++k) { (int l = j; l < j + blocksize; ++l) { dst[k + l*n] = src[l + k*n]; } } } }
an important further insight there's cache-oblivious algorithm (see http://en.wikipedia.org/wiki/cache-oblivious_algorithm, uses exact problem example). informal definition of "cache-oblivious" don't need experiment tweaking parameters (in case blocksize) in order hit good/optimal cache performance. solution in case transpose recursively dividing matrix in half, , transposing halves correct position in destination.
whatever cache size is, recursion takes advantage of it. expect there's bit of management overhead compared strategy, use performance experiments to, in effect, jump straight point in recursion @ cache kicks in, , go no further. on other hand, performance experiments might give answer works on machine not on customers' machines.
Comments
Post a Comment