CUDA reduction using registers -
i need calculate n signals' mean values using reduction. input 1d array of size mn, m length of each signal.
originally had additional shared memory first copy data , reduction on each signal. however, original data corrupted.
my program tries minimize shared memory. wondering how can use registers reduction sum on n signals. have n threads, shared memory (float) s_m[n*m], 0....m-1 first signal, etc.
do need n registers (or one) store mean value of n different signals? (i know how sequential addition using multi-thread programming , 1 register). next step want subtract every value in input correspondent signal's mean.
your problem small (n = 32 , m < 128). however, guidelines:
assuming reducing across n values each of n threads.
- if n large (> 10s of thousands) large, reductions on m sequentially in each thread.
- if n < 10s of thousands, consider using 1 warp or 1 thread block perform each of n reductions.
- if n small m large, consider using multiple thread blocks per each of n reductions.
- if n small , m small (as numbers are), consider using gpu reductions if computations generate , / or consume input / output of reductions running on gpu.
Comments
Post a Comment