nvcc 中的--default-stream per-thread 编译选项如何正确添加到vs2010中

free_lock 2015-07-27 09:32:19

我参考http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
希望用并行stream。
在命令行下通过编译可以从3s降低到1s，但是我依赖的库比较多，我想知道怎么用vs2010编译。

我想把per-thread的编译选项添加到vs2010中，
已经试过了，直接在Command Line的“其他选项”中添加--default-stream per-thread 是不行的。
得到的stream并不是并行的。

从用户手册来看，我还可以用添加宏的方式：
For code that is compiled using the --default-stream per-thread compilation flag
(or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including
CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream
and each host thread has its own default stream.
For code that is compiled using the --default-stream null compilation flag, the
default stream is a special stream called the NULL stream and each device has a single
NULL stream used for all host threads. The NULL stream is special as it causes implicit
synchronization as described in Implicit Synchronization.

不过我试过了依然没有并行，我也不知道要怎么改进了，求高人指点

...全文

3332 4 打赏收藏转发到动态举报

写回复

用AI写文章

4 条回复

切换为时间正序

请发表友善的回复…

发表回复

qq_18668575 2017-12-18

打赏
举报

你好，请问这个问题后来解决了吗？我现在也是遇到和你一样的问题。

skysbjdy 2016-04-16

打赏
举报

最后找打问题所在呢吗??? 怎么解决的??

熊猫视觉 2015-07-29

打赏
举报

楼主，我最近也在想这个问题，可以一起交流下

free_lock 2015-07-27

打赏
举报

补充一点，上面所说的3s到1s是指没有使用pthread的测试代码，如果使用pthread的这段代码



#include <pthread.h>

#include <stdio.h>



const int N = 1 << 20;



__global__ void kernel(float *x, int n)

{

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {

        x[i] = sqrt(pow(3.14159,i));

    }

}



void *launch_kernel(void *dummy)

{

    float *data;

    cudaMalloc(&data, N * sizeof(float));



    kernel<<<1, 64>>>(data, N);



    cudaStreamSynchronize(0);



    return NULL;

}



int main()

{

    const int num_threads = 8;



    pthread_t threads[num_threads];



    for (int i = 0; i < num_threads; i++) {

        if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {

            fprintf(stderr, "Error creating threadn");

            return 1;

        }

    }



    for (int i = 0; i < num_threads; i++) {

        if(pthread_join(threads[i], NULL)) {

            fprintf(stderr, "Error joining threadn");

            return 2;

        }

    }



    cudaDeviceReset();



    return 0;

}

那么得到的就是提问中的图。
我随后改变了一下代码中控制流同步的部分：
cudaStreamSynchronize()
改为：
cudaDeviceSynchronize()，
得到新的图为：

有一部分流是并行了，这到底是什么原因呢？为什么我不能让它们全部并行呢？
by the way, 我想请教下在vs2010 命令行下编译的问题：

nvcc --default-stream per-thread -I /includepath test.cu -l pthread.lib -o test
（第一个为大（i），第二个为小（L））
这样的方式错在哪里，为什么与pthread相关的库中的函数无法解析呢？
在工程中通过附加库的形式添加是没有问题的，还望指点迷津。

C:\Users\panda>nvcc --help Usage : nvcc [options] <inputfile> Options for specifying the compilation phase ============================================ More exactly, this option specifies...

命令：nvidia-smi -l 可以看到GPU的型号，例如：这台机器的GPU版本就是P4。我们的CMakeLists.txt中是这么写的： LIST(APPEND CUDA_NVCC_FLAGS --default-stream per-thread;-O3;-arch=sm_52;-lineinfo; --use_fast_math; -DUSE_OPENAI_GEMM

CUDA优化——stream的使用一、stream是什么？二、stream编程1.引入库一、stream是什么？示例：pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。二、stream编程 1.引入库要讲的内容，本文仅仅简单介绍了pandas的使用，而pandas提供了大量能使我们快速便捷地处理数据的函数和方法。 ...

第三章编程接口 CUDA C++ 为熟悉 C++ 编程语言的用户提供了一种简单的途径，可以轻松编写由设备执行的程序。它由c++语言的最小扩展集和运行时库组成。编程模型中引入了核心语言扩展。它们允许程序员将内核定义为 C++ 函数，并在每次调用函数时使用一些新语法来指定网格和块的维度。所有扩展的完整描述可以在 C++ 语言扩展中找到。任何包含这些扩展名的源文件都必须使用 nvcc 进行编译，如使用NVCC编译中所述。运行时在 CUDA Runtime 中引入。它提供了在主机上执行的 C 和 C++ 函数

CUDA中得异步并发 CUDA 将以下操作公开为可以彼此同时操作的独立任务：在主机上计算；设备上的计算；从主机到设备的内存传输；从设备到主机的内存传输；在给定设备的内存中进行内存传输；设备之间的内存传输。这些操作之间实现的并发级别将取决于设备的功能和计算能力，如下所述。主机和设备之间的并发执行在设备完成请求的任务之前，异步库函数将控制权返回给宿主线程，从而促进了主机的并发执行。使用异步调用，许多设备操作可以在适当的设备资源可用时排队，由CUDA驱动程序执行。这减轻了主机线程管理设备的