Friday 15 April 2011

performance - Is branch divergence really so bad? -


I have spread over the Internet about many questions, deviation in the middle of the branch, and how to avoid / Em>. However, even after reading dozens of articles on how CUDA works, I can not seem to see that it helps in most cases to avoid diversity in branches before that Someone has paws on me, allow me to describe it before I think of "most matters".

It seems to me that in many instances of deviation of branches, actually very specific code. For example, we have the following scenario:

  if (a): foo (a) other: bar (b)   

If we have two Threads that face this deviation, will implement the thread 1 first, after that the path will take a. After this, thread 2 will take the path. To remove the deviation, we can read the block above like this:

  foo (a) times (b)   

assume that it The call is safe to call foo (a) thread 2 and bar (b) thread 1 on , though one can expect improvement in performance, the way here I see it:

In the first case, threads of 1 and 2 are executed in the serial. Call these two clock cycles.

In the second case, execute thread 1 and 2 in foo (A) parallel, then execute bar (b) in parallel. It still looks like two clock chunks, the difference is that if in the past, foo (A) is read from the memory, then I think thread 2 starts execution during that latency If this is the case, the result is that the code is different in the latency.

You (at least this is an example given by you and only one reference Which you do), the only way to avoid branch deviation is to allow all threads to execute all the code.

In that situation I agree there is not much difference.

But to avoid variation in the branches, it is more likely to reconstruct the algorithm at a higher level, provided they are used to execute in addition to some additional additions or deletions, and " Safe "code.

I will present an example. Suppose I know that weird thread has to handle pixel blue elements and even Threads have to handle green component:

  #define N 2 / pixel components Number #define blue 0 # Define green 1 / pixel order: px0BL px0GR px1BL px1GR ... if (ThreadIdx.x & amp; 1) Foo (pixel (N * thread xx + blue)); And bar (pixel (N * thread Idx.x + GREEN));   

This means that every alternate thread is taking a given path, whether it be foo or bar . So now the work of my taana takes twice for execution.

However, if I rearrange my pixel data so that the color components are consistent in a 32 pixel block: BL0 BL1 BL2 ... GR0 GR1 GR2 ..

I can write the same code:

  if (thread idx. X and 32) foo (pixel (thread idex. X)); Other times (pixels (thread ID x.x));   

It still shows that I have the possibility of deviation, but since deviation is devised on the lines of the taana, therefore, either an access to either if path or < Code> else , then there is no real deviation.

This is a trivial example, and probably stupid, but it shows that there can be ways of working around deviation of taana, which does not involve running all the codes of different paths .

No comments:

Post a Comment