发现cudaMalloc()函数在C1060@64bit CentOS下的严重性能问题

kuangquansheng 2010-03-17 03:51:39

我的程序在不同GPU和系统下对比过程中，发现cudaMalloc()及cudaMallocPitch()函数在C1060@64bit CentOS有严重的性能问题。C1060每次耗时0.1s左右，而对比GeForce9600GSO@32bit Ubuntu9.04，每次不超过0.03秒。
写了个测试代码，请大家看看代码有没有错误，贴出你们的结果，帮我测试下并分析下。
大家一起讨论哈~



#include <stdio.h>

#include <stdlib.h>

#include <cutil_inline.h>



int main(int argc, char* argv[])

{

	int cudaDeviceCount;

	cudaGetDeviceCount(&cudaDeviceCount);

	int cudaDevice = 0;

	int maxSps = 0;

	struct cudaDeviceProp dp;

	for (int i = 0; i < cudaDeviceCount; i++) {

		cudaGetDeviceProperties(&dp, i);

		if (dp.multiProcessorCount > maxSps) {

			maxSps = dp.multiProcessorCount;

			cudaDevice = i;

		}

	}

	cudaGetDeviceProperties(&dp, cudaDevice);

	printf("Using cuda device %d: %s\n", cudaDevice, dp.name);

	cudaSetDevice(cudaDevice);



	int length = 1<<26;



	unsigned int timer = 0;

	CUT_SAFE_CALL( cutCreateTimer(&timer) );

	CUT_SAFE_CALL( cutStartTimer(timer) );



	float* h_Data;

	h_Data = (float*)malloc(sizeof(float)*length);

	

	CUT_SAFE_CALL( cutStopTimer(timer) );

	printf("Time 1: %f (ms)\n", cutGetTimerValue(timer));

	CUT_SAFE_CALL( cutStartTimer(timer) );



	for(int i=0; i<length; i++) {

		h_Data[i] = i;

	}



	CUT_SAFE_CALL( cutStopTimer(timer) );

	printf("Time 2: %f (ms)\n", cutGetTimerValue(timer));

	CUT_SAFE_CALL( cutStartTimer(timer) );



	float* d_Data;

	CUDA_SAFE_CALL(cudaMalloc(&d_Data, sizeof(float)*length));



	CUT_SAFE_CALL( cutStopTimer(timer) );

	printf("Time 3: %f (ms)\n", cutGetTimerValue(timer));

	CUT_SAFE_CALL( cutStartTimer(timer) );



	CUDA_SAFE_CALL(cudaMemcpy(d_Data, h_Data, sizeof(float)*length, cudaMemcpyHostToDevice));



	cutilSafeCall( cudaThreadSynchronize() );

	CUT_SAFE_CALL( cutStopTimer(timer) );

	printf("Time 4: %f (ms)\n", cutGetTimerValue(timer));

	CUT_SAFE_CALL( cutStartTimer(timer) );



	cudaFree(d_Data);

	free(h_Data);



	CUT_SAFE_CALL( cutStopTimer(timer) );

	printf("Time 5: %f (ms)\n", cutGetTimerValue(timer));



	CUT_SAFE_CALL( cutDeleteTimer(timer));

	cudaThreadExit();

}

...全文

284 15 打赏收藏转发到动态举报

写回复

用AI写文章

15 条回复

切换为时间正序

请发表友善的回复…

发表回复

无心人_过过小日子 2010-03-19

打赏
举报

[Quote=引用 14 楼 kuangquansheng 的回复:]
引用 13 楼 gogdizzy 的回复:

不过我还是觉得：
对于同样的内存分配算法，如果不是简单的找到一块就用，而是尽量找合适的块，那么内存越大，越浪费时间去寻找！

恩，非常有道理，多谢！
我又在SLES 10.2 64bit 上测试了下，驱动版本为190.53

Using cuda device 0: GeForce 9600 GSO
Time 1: 0.01200……
[/Quote]

尝试下面的.....
http://www.nvidia.com/object/linux_display_amd64_195.36.15.html

kuangquansheng 2010-03-19

打赏
举报

[Quote=引用 13 楼 gogdizzy 的回复:]

不过我还是觉得：
对于同样的内存分配算法，如果不是简单的找到一块就用，而是尽量找合适的块，那么内存越大，越浪费时间去寻找！
[/Quote]

恩，非常有道理，多谢！
我又在SLES 10.2 64bit 上测试了下，驱动版本为190.53

Using cuda device 0: GeForce 9600 GSO
Time 1: 0.012000 (ms)
Time 2: 360.464996 (ms)
Time 3: 391.526001 (ms)
Time 4: 552.729004 (ms)
Time 5: 587.117004 (ms)

cudaMalloc()函数执行时间和32bit操作系统相同。可见跟操作系统无关。
看来我得升级C1060的驱动去。

天下第一好大人 2010-03-18

打赏
举报

不过我还是觉得：
对于同样的内存分配算法，如果不是简单的找到一块就用，而是尽量找合适的块，那么内存越大，越浪费时间去寻找！

天下第一好大人 2010-03-18

打赏
举报

我错了，原来是时间点，是从reset或create开始计算。

向楼主道歉！！！

cuda2010 2010-03-18

打赏
举报

cutGetTimerValue得到的是时间点。

天下第一好大人 2010-03-18

打赏
举报

按我的理解，不是用time3-time2，而就是time3本身。

cutGetTimerValue得到的不就是时间间隔吗？而不是时间点。

天下第一好大人 2010-03-17

打赏
举报

提出一种猜测：c1060使用了更复杂的内存分配器，避免产生更多的内存碎片，因为看你256M的结果来看，两者是差不多的，而只有1k的时候有明显差异。

kuangquansheng 2010-03-17

打赏
举报

多谢各位，我想办法先换个驱动试试！

lizecn 2010-03-17

打赏
举报

开勇都来围观了，哈哈，还是群里面的ST 好啊，给你测试了，驱动问题是最先怀疑的

robinking623623 2010-03-17

打赏
举报

Using cuda device 0: Tesla C1060
Time 1: 0.006000 (ms)
Time 2: 140.281998 (ms)
Time 3: 206.727997 (ms)
Time 4: 293.645996 (ms)
Time 5: 315.059998 (ms)

我的结果希望对你有用。fedora 11 x64

cuda2010 2010-03-17

打赏
举报

有什么应用需要多次反复malloc吗? 如果是一次性的开销我觉得慢一点也不是很大问题。

OpenHero 2010-03-17

打赏
举报

卡没关系，驱动问题

无心人_过过小日子 2010-03-17

打赏
举报

应该是32位OS(或驱动)与64位OS(或驱动)的差别造成的.估计和卡无关.如果可以的话,LZ不妨交换两块卡再试试.

kuangquansheng 2010-03-17

打赏
举报

[Quote=引用 2 楼 gogdizzy 的回复:]
提出一种猜测：c1060使用了更复杂的内存分配器，避免产生更多的内存碎片，因为看你256M的结果来看，两者是差不多的，而只有1k的时候有明显差异。
[/Quote]

感谢你的回复。
我的程序 Time3-Time2 就是cudaMalloc()函数的执行时间。运行结果中，无论数据大小，GeForce@32bit OS 都是0.03秒左右，Tesla@64bit OS 都是0.1秒左右。
所以如果某个程序分配显存达到十余次以上，核函数运行时间又不够长，有可能在C1060下整体性能远不如廉价的GeForce显卡。这是我很头疼的问题。
你说C1060使用了更复杂的内存控制器，能得到进一步的证实么？

kuangquansheng 2010-03-17

打赏
举报

先贴出我的运行结果，GeForce运行于32bit操作系统，Tesla运行于64bit操作系统。

256MB数据(length=1<<26)：

Using cuda device 0: GeForce 9600 GSO
Time 1: 0.016733 (ms)
Time 2: 266.729248 (ms)
Time 3: 290.055603 (ms)
Time 4: 448.621674 (ms)
Time 5: 474.362610 (ms)

Using cuda device 0: Tesla C1060
Time 1: 0.006000 (ms)
Time 2: 173.944000 (ms)
Time 3: 290.955994 (ms)
Time 4: 357.660004 (ms)
Time 5: 369.916992 (ms)

1KB数据(length=1<<8)：

Using cuda device 0: GeForce 9600 GSO
Time 1: 0.001299 (ms)
Time 2: 0.003647 (ms)
Time 3: 23.524227 (ms)
Time 4: 23.967709 (ms)
Time 5: 23.981958 (ms)

Using cuda device 0: Tesla C1060
Time 1: 0.003000 (ms)
Time 2: 0.004000 (ms)
Time 3: 115.965996 (ms)
Time 4: 115.987000 (ms)
Time 5: 115.999001 (ms)

1GB数据(length=1<<28)：

Using cuda device 0: GeForce 9600 GSO
Time 1: 0.032065 (ms)
Time 2: 1075.974731 (ms)
out of memory.

Using cuda device 0: Tesla C1060
Time 1: 0.006000 (ms)
Time 2: 665.909973 (ms)
Time 3: 786.293945 (ms)
Time 4: 1061.020020 (ms)
Time 5: 1105.677002 (ms)