浅谈VTune(TM) Performance Analyzer性能分析 以及与Intel(R) Thread Profiler的异同
Intel(R) VTune(TM) Performance Analyzer 提供了强大的性能分析功能,包含了进程报告图,线程报告图,模块报告图,热点报告图。可以层层分析,把性能问题的原因直指源代码。与Intel(R) Thread Profiler不同的是,VTune 不包含某些性能数据,如程序的并行度,同步变量的开销及与相关线程的关联,等。而Thread Profiler也不具备VTune的热点报告,及在源代码上与处理器架构相关性能分析.
然而,在有些功能上却有异曲同工之妙.
以下是计算Pi的四种方法(串行,传统的WinThread并行,OpenMP, TBB)的代码:
#include <windows.h>
#include <stdio.h>
#include <iostream>
#include <time.h>
#include <omp.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/spin_mutex.h"
#include "tbb/tick_count.h"
const int num_steps = 100000000;
const int num_threads = 4; // My laptop is T61
double step = 0.0, pi = 0.0;
static tbb::spin_mutex myMutex;
static CRITICAL_SECTION cs;
void Serial_Pi()
{
double x, sum = 0.0;
int i;
step = 1.0/(double) num_steps;
for (i=0; i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0 + x*x);
}
pi = step * sum;
}
DWORD WINAPI threadFunction(LPVOID pArg)
{
double partialSum = 0.0, x; // local to each thread
int myNum = *((int *)pArg);
for ( int i=myNum; i<num_steps; i+=num_threads ) // use every num_threads step
{
x = (i + 0.5f) / num_steps;
partialSum += 4.0f / (1.0f + x*x); //compute partial sums at each thread
}
EnterCriticalSection(&cs);
pi += partialSum * step; // add partial to global final answer
LeaveCriticalSection(&cs);
return 0;
}
void WinThread_Pi()
{
HANDLE threadHandles[num_threads];
int tNum[num_threads];
InitializeCriticalSection(&cs);
step = 1.0 / num_steps;
for ( int i=0; i<num_threads; ++i )
{
tNum[i] = i;
threadHandles[i] = CreateThread( NULL, // Security attributes
0, // Stack size
threadFunction, // Thread function
(LPVOID)&tNum[i],// Data for thread func()
0, // Thread start mode
NULL); // Returned thread ID
}
WaitForMultipleObjects(num_threads, threadHandles, TRUE, INFINITE);
}
void OpenMP_Pi()
{
double x, sum=0.0;
int i;
step = 1.0 / (double)num_steps;
omp_set_num_threads(4);
#pragma omp parallel for private (x) reduction(+:sum) //schedule(static,4)
for (i=0; i<num_steps; i++)
{
x = (i + 0.5)*step;
sum = sum + 4.0/(1. + x*x);
}
pi = sum*step;
}
class ParallelPi {
public:
void operator() (tbb::blocked_range<int>& range) const {
double x, sum = 0.0;
for (int i = range.begin(); i < range.end(); ++i) {
x = (i+0.5)*step;
sum = sum + 4.0/(1.0 + x*x);
}
tbb::spin_mutex::scoped_lock lock(myMutex);
pi += step * sum;
}
};
void TBB_Pi ()
{
step = 1.0/(double) num_steps;
parallel_for (tbb::blocked_range<int> (0, num_steps), ParallelPi(), tbb::auto_partitioner());
}
int main()
{
clock_t start, stop;
// Coputing pi by using serial code
pi = 0.0;
start = clock();
Serial_Pi();
stop = clock();
printf ("Computed value of Pi by using serial code: %12.9f\n", pi);
printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
// Computing pi by using Windows Threads
pi = 0.0;
start = clock();
WinThread_Pi();
stop = clock();
printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);
printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
// Computing pi by using OpenMP
pi = 0.0;
start = clock();
OpenMP_Pi();
stop = clock();
printf ("Computed value of Pi by using OpenMP: %12.9f\n", pi);
printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
// Computing pi by using TBB
pi = 0.0;
start = clock();
tbb::task_scheduler_init tbb_init;
TBB_Pi();
stop = clock();
printf ("Computed value of Pi by using TBB: %12.9f\n", pi);
printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);
return 0;
}
我们可以在VTune的进程报告中知道"Pi"进程的总体性能, 看5楼图一.
观察"Pi"进程的线程报告时,我们无法知道具体线程对应那些相应的工作, 看5楼图二.
但是我们可以选中所有的线程,使用Sampling Overtime View (SOT),产生新的报告.根据时间图得知那些线程对应那些相应的工作. 看5楼图三
而这些结果是与Thread Profiler的结果是一致的.看下图.(说明:OpenMP的线程以阴影表示;TBB的线程仅以一条表示.) 看5楼图四
这里对源代码的深入分析就不再一一展开了.(VTune的线程至模块至代码;Thread Profiler的Critical Path, Transition, Source View)
图片请参考下面5楼,谢谢