MKL中VML相关函数效率一般能提高多少?
在MKL中vdCos, vdSin, vdSqrt这些函数一般效率能提高多少?由于这次优化大赛,我才真正用了MKL,想了解一下她的威力到底有多大,在自己的计算中是否用得上。下面是个人的一点体会,错误的地方请大家指出。后面还有一段代码,我想优化一下,大家有什么想法。
单个的调用这些函数似乎效率不高,而进行组调用的时候效率能提高一些。但是,比如对一个数组进行sqrt计算,我做过一个实验,并不是这个数组长度越大(例如:把多层循环中的数都收集到一起再vdSqrt计算)整体效率就越大;相反,我的结论好像是在每层循环结束的时候进行VML函数调用效率较高。
下面这段代码是自己的分析程序中的一部分,我想用VML函数优化一下sqrt函数(不用多线程话,这部分代码是MPI并行中的部分),不知道有哪些方案。
P.S. 其中的关键部分数据大概如下:ppoints=10, itime < 6000, nbeads = 10000
for ( ipoint = 0; ipoint < ppoints; ipoint++ )
{
t = (int)( pow(10.0, logstart+(ipoint+ myid*ppoints)*logstep)/steptime);
for ( irecord = 0; irecord < nrecords; irecord++ )
{
gr0[irecord] = 0.0;
gurt[irecord] = 0.0;
}
for ( itime = 0; itime < nsteps-t-1; itime++ )
{
for ( ibead = 0; ibead < nbeads - 1; ibead++ ) // ibead of the chain
{
pdata_i = data + itime * nbeads + ibead;
pdata_i_t = data + ( itime + t ) * nbeads + ibead;
dx = pdata_i_t->rx - pdata_i->rx;
dy = pdata_i_t->ry - pdata_i->ry;
dz = pdata_i_t->rz - pdata_i->rz;
mui = dx*dx + dy*dy + dz*dz;
mui = sqrt(mui);
for( jbead = ibead + 1; jbead < nbeads; jbead++ )
{
pdata_j = data + itime * nbeads + jbead;
dx = pdata_i->rx - pdata_j->rx;
dy = pdata_i->ry - pdata_j->ry;
dz = pdata_i->rz - pdata_j->rz;
dx = dx - xprd * anint(dx/xprd);
dy = dy - yprd * anint(dy/yprd);
dz = dz - zprd * anint(dz/zprd);
rijsq = dx * dx + dy * dy + dz * dz;
if ( rijsq < rangeRDF * rangeRDF )
{
rij = sqrt( rijsq );
bin = (int)( rij / deltaR );
gr0[bin] += 1.0;
pdata_j_t = data + (itime+t)*nbeads + jbead;
dx = pdata_j_t->rx - pdata_j->rx;
dy = pdata_j_t->ry - pdata_j->ry;
dz = pdata_j_t->rz - pdata_j->rz;
muj = dx*dx + dy*dy + dz*dz;
muj = sqrt(muj);
gurt[bin] += (mui * muj);
}
} // for jbead
} // for ibead
} // for itime
musum = 0.0;
musqsum = 0.0;
for ( itime = 0; itime < nsteps-t-1; itime++ )
{
for ( ibead = 0; ibead < nbeads; ibead++ )
{
pdata_i = data + itime * nbeads + ibead;
pdata_i_t = data + (itime+t)*nbeads + ibead;
dx = pdata_i->rx - pdata_i_t->rx;
dy = pdata_i->ry - pdata_i_t->ry;
dz = pdata_i->rz - pdata_i_t->rz;
rijsq = dx*dx + dy*dy + dz*dz;
musum += sqrt(rijsq);
musqsum += rijsq;
}
}
musum /= ( nbeads*(nsteps-t-1) );
musqsum /= ( nbeads*(nsteps-t-1) );
} // for ipoint