Excessive loop unrolling reduces performance in CUDA kernels primarily by increasing register pressure and degrading instruction cache efficiency. Register pressure occurs because every thread in a GPU warp requires its own set of hardware registers to store local variables; when you unroll a loop, the compiler must allocate more registers simultaneously to hold the data for all unrolled iterations, rather than reusing a smaller set of registe....
Log in to view the answer