CUDA reduction using registers -


i need calculate n signals' mean values using reduction. input 1d array of size mn, m length of each signal.

originally had additional shared memory first copy data , reduction on each signal. however, original data corrupted.

my program tries minimize shared memory. wondering how can use registers reduction sum on n signals. have n threads, shared memory (float) s_m[n*m], 0....m-1 first signal, etc.

do need n registers (or one) store mean value of n different signals? (i know how sequential addition using multi-thread programming , 1 register). next step want subtract every value in input correspondent signal's mean.

your problem small (n = 32 , m < 128). however, guidelines:

assuming reducing across n values each of n threads.

  • if n large (> 10s of thousands) large, reductions on m sequentially in each thread.
  • if n < 10s of thousands, consider using 1 warp or 1 thread block perform each of n reductions.
  • if n small m large, consider using multiple thread blocks per each of n reductions.
  • if n small , m small (as numbers are), consider using gpu reductions if computations generate , / or consume input / output of reductions running on gpu.

Comments

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

visual c++ - Using relative values in array sorting ( asm ) -