优化心得

茶禅如水 2007-01-20 01:35:13

1、我的优化后的源码，比较短，所以只显示修改的部分
#include <mkl.h>
//double distx, disty, distz，dist;
main()
{
HANDLE hProcess = GetCurrentProcess();
SetPriorityClass(hProcess,HIGH_PRIORITY_CLASS);
vmlSetMode(VML_NUM_THREADS_OMP_FIXED | VML_LA );
for()
...
}
int computePot() {
int i, j;
double temp[1000];
#pragma omp parallel num_threads(2) shared(r) private(i,j,temp)
{
#pragma omp for reduction(+:pot) schedule(static,1)
for( i=0; i<1000; i++ ) {
for( j=0; j<i-1; j++ ) {
temp[j] = pow( (r[0][j] - r[0][i]), 2 );
temp[j] += pow( (r[1][j] - r[1][i]), 2 );
temp[j] += pow( (r[2][j] - r[2][i]), 2 );
}
vdInvSqrt(j,temp,temp);
for (j= 0; j < i-1; j++)
pot += temp[j];
}
}
2、使用icc编译的参数
icl /Qopenmp /QaxP /Qipo pl.cpp mkl_c.lib
因为vc2005express版不支持openmp所以只用了linker
3、说明：程序主要使用了icc和mkl做优化，以及vtune对程序进行分析。
因为不太会汇编(要向huanyun好好学习汇编！），所以大部分主要靠工具啦。

...全文

1366 10 打赏收藏转发到动态举报

写回复

用AI写文章

10 条回复

切换为时间正序

请发表友善的回复…

发表回复

hz张三 2007-01-21

打赏
举报

这位的思路和我的差不多啊 ^_^
-------------------------------------
其实这一步，
for(int j=0; j<count; j++) pot += dist[j];
还有很大的优化油水……

wh_esther 2007-01-21

打赏
举报

我认为这次比赛的程序优化主要有三个重点：第一：循环并行化，第二：循环自动向量化，第三：对InvSqrt函数的处理上。其实Intel的本意是想通过比赛让更多的开发人员了解Intel开发工具的优势，给出的程序也特别符合Intel编译器文档中提到的优化技术，程序也比较简单，给的时间也很充裕，现学现卖也完全够了。如果整个程序都用汇编写，VC8和Intel编译器生成的可执行文件，在执行速度上应该是差不多的。不过我还是很佩服能想到用牛顿迭代法提高精度的高手们，真的让我学到了不少东西。

wh_esther 2007-01-21

打赏
举报

其实关键还在于对vdInvSqrt的处理上，如果用汇编来写，比如直接用flyingdog写的void myinvsqrt (double *start,double *end)函数将我的vdInvSqrt函数替换成myinvsqrt(dist, dist + count)，执行速度上和flyingdog的程序不相上下。不过我还是喜欢结构比较好的程序，关键地方可以使用汇编，但程序结构也很重要，结构好的程序更能发挥编译器自动优化优势，程序的可读性也好。

aero_boy 2007-01-20

打赏
举报

这位的思路和我的差不多啊 ^_^

wh_esther 2007-01-20

打赏
举报

我的是这样:
#define THREADS_NUM 2
double dist[NPARTS];
#pragma omp threadprivate(dist)
int main() {
......
}
int computePot()
{
int thread_id;
omp_set_num_threads(THREADS_NUM);
#pragma omp parallel reduction(+:pot)
{
thread_id = omp_get_thread_num();
for(int i=thread_id; i<NPARTS; i += THREADS_NUM)
if( i-1 > 0 ) pot = loop_vec(i, i-1, pot);
}
return 0;
}
double loop_vec(int i, int count, double pot)
{
double distx, disty, distz;

for(int j=0; j<count; j++) {
distx = r[0][j] - r[0][i];
disty = r[1][j] - r[1][i];
distz = r[2][j] - r[2][i];

distx *= distx;
disty *= disty;
distz *= distz;

dist[j] = distx + disty + distz;
}

vdInvSqrt(count, dist, dist);

for(int j=0; j<count; j++) pot += dist[j];
return pot;
}

茶禅如水 2007-01-20

打赏
举报

关于使用工具的感受
1程序的优化主要依靠了icc、vtune和mkl，确实很不错。
2、icc在编译时使用/QaxX 的参数确实是生成了2条不同路径的代码。这在intel的Optimization Reference Manual和icc的帮助中都有说明，为的是程序的兼容性和效率双赢，选择/QxX参数就可以生成1种路径代码，但是只向上兼容，intel的老CPU也一样不准运行。
虽然有的cpu指令集兼容，但是实现不同，要求运新速度都优化的那么好，要求有点苛刻。

flyingdog 2007-01-20

打赏
举报

顶。

茶禅如水 2007-01-20

打赏
举报

guided和dynamic在运行时计算，会对速度又影响，所以选择了static

茶禅如水 2007-01-20

打赏
举报

6、关于mkl有两点一个是使用了 InvSqrt，另一方面设置的mkl中向量函数的环境变量
vmlSetMode(VML_NUM_THREADS_OMP_FIXED | VML_LA );
VML_LA 是说mkl函数采用精度较低的算法，虽然是低精度但是仍然能保证比赛要求的精度，
足见mkl函数确实很强！
固定线程数是为了Openmp编译指示
7、对于Openmp中for的调度使用了 (static,1)
这是查icc的帮助得到了，使用后用vtune看了一下2个线程基本平衡了。
下面是具体说明，这引用在icc帮助文档
Use the following general on the parallel construct syntax to instruct OpenMP to loop schedule:

Example
#pragma omp parallel for schedule(kind [, chunk size])

Four different loop scheduling types (kinds) can be provided to OpenMP, as shown in the following table. The optional parameter (chunk), when specified, must be a loop-invariant positive integer.

Kind Description
static
Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is loop count/number of threads.

Set chunk to 1 to interleave the iterations.

dynamic
Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue.

By default, the chunk size is 1. Be careful when using this scheduling hint because of the extra overhead requirement.

guided
Similar to dynamic scheduling, but the chunk size starts off large and shrinks in an effort to reduce the amount of time threads have to go to the work queue to get more work. The optional chunk parameter specifies them minimum size chunk to use.

By default the chunk size is approximately the loop count/number of threads.

runtime
Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used.

OMP_SCHEDULE is a string formatted exactly the same as would appear on the parallel construct.

茶禅如水 2007-01-20

打赏
举报

4、使用icc编译时，参数 /Zi可以和/QxP /O3 等等一起用，这样在Vtune里就可以看到语句执行的时间。开始就用了 icc所以看到主要时间消耗在 computePot()的几个语句上，其他的就忽略了。
5、充分利用编译器。
上面 2、命令行参数少了/O3
icc中有很多参数是和vc兼容了。在针对CPU的优化上有arch:SSEx(vc也有) ,icc独有/QxX 和 /QaxX系列。我用的icc32位版本支持到了SSE3。例如编译时采用arch:SSE3 和 /QaxP优化效果没有大的差别，（我估计的）。
使用了/QaxP后编译器对pot+= 和 pow的for做了自动向量化处理，速度提高不少。

由在IBM工作50余年的资深计算机专家撰写，Amazon全五星评价，算法领域最有影响力的著作之一； Google公司首席架构师、Jolt大奖得主Joshua Bloch和Emacs合作创始人、C语言畅销书作者Guy Steele倾情推荐；算法的艺术和数学的智慧在本书中得到了完美体现，书中总结了大量高效、优雅和奇妙的算法，并从数学角度剖析了其背后的原理。

这是我的emacs配置文件，和我的blog相对应的

数据库优化心得 1. 减少过长查询语句 MySQL服务器与客户端通信方式是半双工的，因此应该尽量减少发送过长的查询语句以及减少响应数据的大小（例如避免select *）最好加上limit 分页限制 2. 查询缓存查询缓存对于写密集型的最好不要打开查询缓存，来缓存查找的结果查询缓存不易过大，最好不要超过100MB 尽量用小表代替大...

网页性能优化心得改文章是基于已经实施的优化方案进行的总结和拓展优化方案： 1 避免巨大的网络负载解决方案：推迟非关键资源例如：某些js资源放在html结构后面 2 最小化资源的大小解决方案：1 将图片放在阿里云的OSS中利用阿里云提供的数据处理图片处理能力减小大部分的图片大小建议某些具大图片进行处理时可以接受一些失真 2 将所有的css和js文件进行压缩改项目直接使用vscode中提供的插件minify 也可以考虑使用其他的方法注意：当使用minify 大部分压缩js的原理为混淆压缩会改变

性能优化心得