多线程优化程序，我的程序

愚鬼 2007-01-14 01:18:01

程序在win2000下使用‘.net 2003’编译，测试运行正常，没有任何intel的工具可用，因为下载原因。

你们可以对源码重新编译运行。可能效果会更好。

优化主要是对循环进行的，首先是 updatepos中的循环改用进程支持，对于多核处理器，并行工作量要大一些，按页生成三个同类的进程，分别对页面进行操作，用开关量进行同步，在单处理器上优化效果不明显，因为还是顺序执行的原因。另外还有进程切换的开销。其次是对computpos进行的优化，也是按页面生成同类型的三个进程，计算pow点值，以每行的完成作为一个阶段，用开关量进行同步，使主进程与子进程并行运行。在单处理器上，反而更慢，因为进程切换开销和内存寻址开销的增加，在多核处理器上，应该更好一些。

因为没有能够下载到工具，因此没有使用优化工具，但是从程序算法原理上能够分析出，上述县城方案应该能够满足并行处理的要求，其他方面的小技巧本程序没有使用。

没有工具可以列出程序的运行数据。

/* compute the potential energy of a collection of */
/* particles interacting via pairwise potential */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>
#include <process.h> /* _beginthread, _endthread */
#include <time.h>

#define NPARTS 1000
#define NITER 201
#define DIMS 3

int rand( void );
int computePot(void);
void initPositions(void);
void updatePositions(void);
void updatePositions_New(void);
int computePot_New(void);

double r[DIMS][NPARTS];
double pot;
double distx, disty, distz, dist;
//
BOOL Update[DIMS];
BOOL Comput[DIMS];
double ComputResult[DIMS][NPARTS][NPARTS];
BOOL Result[DIMS][NPARTS];
BOOL ThreadQuit = FALSE;

DWORD WINAPI ThreadUpdatePos(void * Pointer);
DWORD WINAPI ThreadComputPos(void * Pointer);

int main() {
int i,jLine[DIMS];
;
clock_t start, stop;

initPositions();
//create threads
for(i=0;i<DIMS;i++){
Update[i]=FALSE;
Comput[i]=FALSE;;
jLine[i]=i;
CreateThread(NULL,0,ThreadUpdatePos,(void *)&jLine[i],0,NULL);
CreateThread(NULL,0,ThreadComputPos,(void *)&jLine[i],0,NULL);
}
updatePositions_New();
start=clock();
for( i=0; i<NITER; i++ ) {
pot = 0.0;
//computePot();
computePot_New();
if (i%10 == 0) printf("%5d: Potential: %10.3f\n", i, pot);
//updatePositions();
updatePositions_New();
}
stop=clock();
printf ("Seconds = %10.9f\n",(double)(stop-start)/ CLOCKS_PER_SEC);
ThreadQuit=TRUE;
getchar();
}

void initPositions() {
int i, j;

for( i=0; i<DIMS; i++ )
for( j=0; j<NPARTS; j++ )
r[i][j] = 0.5 + ( (double) rand() / (double) RAND_MAX );
}

void updatePositions(void) {
int i, j;

for( i=0; i<DIMS; i++ ){
for( j=0; j<NPARTS; j++ )
r[i][j] -= 0.5 + ( (double) rand() / (double) RAND_MAX );
}
}

void updatePositions_New(void){
int i;

for(i=0;i < DIMS;i++)
Update[i] = TRUE;

while((Update[0]==TRUE) || (Update[1]==TRUE) || (Update[2]==TRUE))
Sleep(0);
}

int computePot() {
int i, j;

for( i=0; i<NPARTS; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}
return 0;
}

int computePot_New(void)
{
int i,j;

for(i=0;i < DIMS;i++)
Comput[i] = TRUE;

//while((Comput[0]==TRUE) || (Comput[1]==TRUE) || (Comput[2]==TRUE))
// Sleep(0);
//
for(i=0;i<NPARTS;i++ ) {
while((Result[0][i]==FALSE) || (Result[1][i]==FALSE) || (Result[2][i] == FALSE))
Sleep(0);
//
Result[0][i]=FALSE;Result[1][i]=FALSE;Result[2][i]=FALSE;
for(j=0;j<i-1;j++){
dist = sqrt(ComputResult[0][i][j] + ComputResult[1][i][j] + ComputResult[2][i][j]);
pot += 1.0 / dist;
}
}
return 0;
}

DWORD WINAPI ThreadUpdatePos(void * Pointer)
{
int PosLine,iCount;

PosLine =*(int *)Pointer;

while(! ThreadQuit){
if (Update[PosLine] == TRUE){
for(iCount=0;iCount < NPARTS;iCount++){
r[PosLine][iCount] -= 0.5 + ( (double) rand() / (double) RAND_MAX );
}
Update[PosLine]=FALSE;
}else
Sleep(0);
}
return 0;
}

DWORD WINAPI ThreadComputPos(void * Pointer)
{
int PosLine,i,j;

PosLine =*(int *)Pointer;

while(! ThreadQuit){
if(Comput[PosLine] == TRUE){
for( i=0; i<NPARTS; i++ ) {
for( j=0; j<i-1; j++ ) {
//ComputResult[PosLine][i][j] = pow( (r[PosLine][j] - r[PosLine][i]), 2 );
ComputResult[PosLine][i][j] = (r[PosLine][j] - r[PosLine][i]) * (r[PosLine][j] - r[PosLine][i]);
}
Result[PosLine][i]=TRUE;
}
Comput[PosLine]=FALSE;
}else
Sleep(0);
}
return 0;
}

...全文

920 4 打赏收藏转发到动态举报

写回复

用AI写文章

4 条回复

切换为时间正序

请发表友善的回复…

发表回复

icansaymyabc 2007-01-16

打赏
举报

楼主老兄啊，你的思路好像不对啊。

我帮你测试了一下，

在迅驰1.66G双核CPU上，

原始程序：6.562 秒，CPU时间 50%。单线程嘛只用了一个CPU。

你的程序：6.671 秒，CPU时间 100%。而且结果是错的。

正确结果 68477.591，这个结果是如何优化都不能变的。
为了保证结果正确，不能优化initPositions updatePositions 这两个函数或改变其调用顺序。
你的程序输出了 158321.914

Intel Tuning Assistant 产生的结果：

Tuning Analysis for Sampling Results [AAA] - Tue Jan 16 14:17:58 2007
Workload Summary (all processes/modules)
Clockticks: 24,826,927,000

Time Statistics

Clockticks: 24,826,927,000 events

Processor Time: 14.93 sec

Raw Event Data

Instructions Retired: 9,919,795,000 events

Process/Module Summary (Process: aaa.exe, Module: aaa.exe, RVA: 0x1198-0x77c2)
Clockticks: 12,103,314,000

Time Statistics

Clockticks: 12,103,314,000 events

Processor Time: 7.28 sec

Processor Time: 7.28 sec

Accounts for 48.75% (workload)

Raw Event Data

Instructions Retired: 1,287,162,000 events

Event Ratios

Cycles per Retired Instruction - CPI: 9.4

main (RVA: 0x1024-0x138f, Module: aaa.exe, Process: aaa.exe)
Clockticks: 8,988,515,000

Time Statistics

Clockticks: 8,988,515,000 events

Processor Time: 5.41 sec

Processor Time: 5.41 sec

Accounts for 74.26% (process/module), 36.2% (workload)

Raw Event Data

Instructions Retired: 473,955,000 events

Event Ratios

Cycles per Retired Instruction - CPI: 18.97

unsigned long ThreadComputPos(void *) (RVA: 0x146c-0x1643, Module: aaa.exe, Process: aaa.exe)
Clockticks: 2,936,858,000

Time Statistics

Clockticks: 2,936,858,000 events

Processor Time: 1.77 sec

Processor Time: 1.77 sec

Accounts for 24.26% (process/module), 11.83% (workload)

Raw Event Data

Instructions Retired: 771,632,000 events

Event Ratios

Cycles per Retired Instruction - CPI: 3.81

unsigned long ThreadUpdatePos(void *) (RVA: 0x16bc-0x1785, Module: aaa.exe, Process: aaa.exe)
Clockticks: 151,333,000

Time Statistics

Clockticks: 151,333,000 events

Processor Time: 0.091 sec

Processor Time: 0.091 sec

Accounts for 1.25% (process/module), 0.61% (workload)

Raw Event Data

Instructions Retired: 16,630,000 events

Event Ratios

Cycles per Retired Instruction - CPI: 9.1

_getptd_noexit (RVA: 0x416b-0x41ed, Module: aaa.exe, Process: aaa.exe)
Clockticks: 11,641,000

Time Statistics

Clockticks: 11,641,000 events

Processor Time: 0.007 sec

Processor Time: 0.007 sec

Accounts for 0.096% (process/module), 0.047% (workload)

Raw Event Data

Instructions Retired: 14,967,000 events

Event Ratios

Cycles per Retired Instruction - CPI: 0.78

__set_flsgetvalue (RVA: 0x4037-0x4060, Module: aaa.exe, Process: aaa.exe)
Clockticks: 4,989,000

Time Statistics

Clockticks: 4,989,000 events

Processor Time: 0.003 sec

Processor Time: 0.003 sec

Accounts for 0.041% (process/module), 0.02% (workload)

Thread Profiler 报告有 7 个工作线程，都是满负荷运行。
其中 Over Utilized 时间占了 97.34%。

Thread Checker 报告在你书写的代码中有 101 个位置发生数据访问冲突，其中有很多位置重复发生数据冲突，数据冲突的总计数超过1亿次。
给你复制一段关于数据冲突的说明：
Read -> Write Data Race
Read->write data races occur when one thread reads a shared memory location (address) while another thread concurrently writes the same memory location. The shared memory location may be referred to by (variable) name, pointer, or even a function such as memcpy().

The following example uses a variable name:

1st access by first thread

S1: privateA = sharedX

2nd access by second thread

S2: sharedX = privateB

If sharedX is a variable visible to all threads and privateA and privateB are local variables visible only to the thread where each was declared, concurrent execution of the above statements by multiple threads results in a “race” on the value to be read from sharedX.

Since the order of execution among threads is unpredictable, it is unknown what value will be read from sharedX and written into privateA. This results in non-deterministic software, or software prone to produce different results each time it is executed.

When you double-click the diagnostic for the example above, the statement S1 shows in the 1st Access source view window; and S2 shows in the 2nd Access source view window.

To correct this diagnostic, see Correction Strategies for Read -> Write Diagnostics

顺便提及，多线程程序不能在单CPU上测试。单CPU上很难发生有多个线程真正同时访问同一个位置的情况出现，简单的时序安排就可以掩盖很多内在的问题。但是在多核CPU（实质就是将两个CPU焊在一个底座上）甚至超线程CPU上这些问题就掩盖不住了。

celineshi 2007-01-15