CUDA 核函数内死循环

超神工作室 2015-05-30 11:19:19

CUDA新手求助：
比如有若干个数据块需要并行处理，但是相互之间有依赖关系，例如第一个第二个块可以同时执行，第三个要等待第一个块执行完再执行。第四个等待第二个，等等。
想让所有的块同时开始，在核函数内部用while循环判断，依赖的块执行完后跳出循环，开始实际的数据处理。这样做理论是可以的吧？实际尝试了一下，运行中就黑屏然后退出程序了，应该是出了什么异常。求助！

...全文

584 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

超神工作室 2015-06-09

打赏
举报

感谢Spidey212和u014693181的指点，我明白了。其实就是通过同步语句，控制每个线程具体进入某个if条件，相当于按一定依赖顺序执行每个线程。我用3楼的方法达到了目的。所有块同时并发，但每个时间点只有满足条件的块在进行运算，然后同步，再继续。

YCMyTot 2015-06-08

打赏
举报

我们把__syncthreads(); 函数有时称为栅栏同步，比如： if(row == 0 || col == 9) { d_C[index] = 1; } __syncthreads(); else if(row == 1 || col == 8) { d_C[index] = d_C[ref] + index; } __syncthreads(); 在进行了第一次 if(row == 0 || col == 9)判断操作后，由于线程的执行该项操作的完成时间可能不同，当线程0执行完之后，线程10可能还在执行，这个时候如果进行下一个if() 操作，执行慢的线程程会跟不上执行结束快的线程，这种情况下很容易出现线程卡死。 __syncthreads();函数起了同步的作用，只有当线程都执行完当前操作的时候，才会继续向下执行，只是可能会消耗一定的时间开销。不过这样的时间开销是值得的。更多的具体细节，书上都有涉及。

Spidey212 2015-06-05

打赏
举报

引用 4 楼 u013177821 的回复:

[quote=引用 3 楼 Spidey212 的回复:] 最好贴上代码.... 其实很简单，就是控制线程执行顺序，举个栗子：

offset = THREAD_NUM / 2;
while(offset > 0) {
    if(tid < offset) {
        shared[tid] += shared[tid + offset];
    }
    offset >>= 1;
    __syncthreads();
}

当THREAD_NUM = 256 时循环展开就是这样

if(tid < 128) { shared[tid] += shared[tid + 128]; }
__syncthreads();
if(tid < 64) { shared[tid] += shared[tid + 64]; }
__syncthreads();
if(tid < 32) { shared[tid] += shared[tid + 32]; }
__syncthreads();
if(tid < 16) { shared[tid] += shared[tid + 16]; }
__syncthreads();
if(tid < 8) { shared[tid] += shared[tid + 8]; }
__syncthreads();
if(tid < 4) { shared[tid] += shared[tid + 4]; }
__syncthreads();
if(tid < 2) { shared[tid] += shared[tid + 2]; }
__syncthreads();
if(tid < 1) { shared[tid] += shared[tid + 1]; }
__syncthreads();

通过对线程tid的条件判断实现了其执行顺序的控制

比如是这样：10*10的数据块，每个数据块有一个行号和列号，每个块必须在右上方数据块处理完成后，开始运算。大致代码如下：d_C是数据块，第一行的先初始为1，每个块处理完成后，flag置为1


__global__ void vecAdd(float* d_C,int* d_flag) 
{ 
	int index=threadIdx.x; 
	int row = index / 10;
	int col = index % 10;
	if(row == 0)
	{
		d_C[index] = 1;
		d_flag[index] = 1;
	}
	else
	{
		int ref = (row - 1) * 10 + col + 1;//依赖的数据块
		//if(ref >= 0)
		//{
			while(d_flag[ref] != 1);  //等待依赖块执行完毕
			//{
				d_C[index] = d_C[ref] + index;
				d_flag[index] = 1;
			//}
		//}
	}
}

运行后，黑屏，nvidia提示显示器驱动停止响应。输出的数据结果表明任何块都没执行运算。[/quote] 没有进行线程同步，有的线程死循环了，导致显卡崩溃。第一行的线程在执行 if(row == 0)这一步的时候，后面的线程已经进入while循环的判断了。可以这么做...

__global__ void vecAdd(float* d_C) 
{ 
	int index=threadIdx.x; 
	int row = index / 10;
	int col = index % 10;
	int ref = (row - 1) * 10 + col + 1;//依赖的数据块

	if(row == 0 || col == 9)
	{
		d_C[index] = 1;
	}
	__syncthreads();

	else if(row == 1 || col == 8)
	{
		d_C[index] =  d_C[ref] + index;
	}
	__syncthreads();

	else if(row == 2 || col == 7)
	{
		d_C[index] =  d_C[ref] + index;
	}
	__syncthreads();

	//-----TODO---
	//......
	//------------

	else if(row == 9 || col == 0)
	{
		d_C[index] =  d_C[ref] + index;
	}
	__syncthreads();
}

超神工作室 2015-06-04

打赏
举报

引用 3 楼 Spidey212 的回复:

最好贴上代码.... 其实很简单，就是控制线程执行顺序，举个栗子：

offset = THREAD_NUM / 2;
while(offset > 0) {
    if(tid < offset) {
        shared[tid] += shared[tid + offset];
    }
    offset >>= 1;
    __syncthreads();
}

当THREAD_NUM = 256 时循环展开就是这样

if(tid < 128) { shared[tid] += shared[tid + 128]; }
__syncthreads();
if(tid < 64) { shared[tid] += shared[tid + 64]; }
__syncthreads();
if(tid < 32) { shared[tid] += shared[tid + 32]; }
__syncthreads();
if(tid < 16) { shared[tid] += shared[tid + 16]; }
__syncthreads();
if(tid < 8) { shared[tid] += shared[tid + 8]; }
__syncthreads();
if(tid < 4) { shared[tid] += shared[tid + 4]; }
__syncthreads();
if(tid < 2) { shared[tid] += shared[tid + 2]; }
__syncthreads();
if(tid < 1) { shared[tid] += shared[tid + 1]; }
__syncthreads();

通过对线程tid的条件判断实现了其执行顺序的控制


__global__ void vecAdd(float* d_C,int* d_flag) 
{ 
	int index=threadIdx.x; 
	int row = index / 10;
	int col = index % 10;
	if(row == 0)
	{
		d_C[index] = 1;
		d_flag[index] = 1;
	}
	else
	{
		int ref = (row - 1) * 10 + col + 1;//依赖的数据块
		//if(ref >= 0)
		//{
			while(d_flag[ref] != 1);  //等待依赖块执行完毕
			//{
				d_C[index] = d_C[ref] + index;
				d_flag[index] = 1;
			//}
		//}
	}
}

运行后，黑屏，nvidia提示显示器驱动停止响应。输出的数据结果表明任何块都没执行运算。

Spidey212 2015-06-01

打赏
举报

最好贴上代码.... 其实很简单，就是控制线程执行顺序，举个栗子：

offset = THREAD_NUM / 2;
while(offset > 0) {
    if(tid < offset) {
        shared[tid] += shared[tid + offset];
    }
    offset >>= 1;
    __syncthreads();
}

当THREAD_NUM = 256 时循环展开就是这样

if(tid < 128) { shared[tid] += shared[tid + 128]; }
__syncthreads();
if(tid < 64) { shared[tid] += shared[tid + 64]; }
__syncthreads();
if(tid < 32) { shared[tid] += shared[tid + 32]; }
__syncthreads();
if(tid < 16) { shared[tid] += shared[tid + 16]; }
__syncthreads();
if(tid < 8) { shared[tid] += shared[tid + 8]; }
__syncthreads();
if(tid < 4) { shared[tid] += shared[tid + 4]; }
__syncthreads();
if(tid < 2) { shared[tid] += shared[tid + 2]; }
__syncthreads();
if(tid < 1) { shared[tid] += shared[tid + 1]; }
__syncthreads();

通过对线程tid的条件判断实现了其执行顺序的控制

超神工作室 2015-05-31