矩阵相乘只算出对角线的值

tho0o 2016-12-26 09:27:50
加精
有两个矩阵 A和B 他们都有几万行和列,我想只求出他们相乘对角线的值,但是我不想求出相乘后的矩阵再输出对角线(这样很浪费时间),我仅仅想做的是只算出对角线的值,然后输出。

用c++实现,有木有大神啊!
...全文
5876 30 打赏 收藏 转发到动态 举报
写回复
用AI写文章
30 条回复
切换为时间正序
请发表友善的回复…
发表回复
Junlys 2017-01-14
  • 打赏
  • 举报
回复
积分怎么拿~积分怎么拿~积分怎么拿~积分怎么拿~积分怎么拿~
副组长 2017-01-02
  • 打赏
  • 举报
回复
不错不错,是好东西,谢谢分享。
lkj2016 2016-12-31
  • 打赏
  • 举报
回复
膜拜一下大神
NoEdUl 2016-12-30
  • 打赏
  • 举报
回复
引用 26 楼 lunat 的回复:
题主问的是对角线计算,楼歪了。
无心插柳
ly_490380384 2016-12-30
  • 打赏
  • 举报
回复
666666666666666666
ljheee 2016-12-29
  • 打赏
  • 举报
回复
给下你完整代码
paschen 2016-12-29
  • 打赏
  • 举报
回复
引用 20 楼 u012909435 的回复:
[quote=引用 18 楼 paschen 的回复:] [quote=引用 17 楼 u012947309 的回复:] [quote=引用 11 楼 paschen 的回复:] [quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))

int main()
{
	double matrix1[5][5];
	double matrix2[5][5];
	double res3[5][5];
	//initialize
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int N=5;
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
#define N 8

int main()
{
	double matrix1[N][N];
	double matrix2[N][N];
	double res3[N][N]={0};
	//initialize
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
矩阵维度要大于8,这种优化主要是减少cache miss,使用最大的cache line size 其次res3先初始化为0 矩阵较大时这种方式速度明显要快 在我机上测试1000*1000的矩阵乘法,用普通的方法用时13.17s,这种方法仅1.49s[/quote] 这个要求N能被SM整除吧 后面三个循环的跳出条件要改成 i2 < SM && i + i2 < N k2 < SM && k + k2 < N j2 < SM && j + j2 < N 这样是不是能处理不整除的情况了?[/quote] 是,一言中的
paschen 2016-12-29
  • 打赏
  • 举报
回复
引用 22 楼 u012909435 的回复:
[quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
我试了一下,按照您这个方法 OpenMP+cache blocking+loop interchange+vectorize 在Xeon X5675上测试 4096*4096矩阵乘法 trival的实现大概是1171s 并行优化之后是4s 在12个thread的时候,加速了200多倍 4096*4096的矩阵乘法应该有64G次乘法+64G次加法。。。。 不过,这个性能算是达到了16Gflop/s吗,还是32Gflop/s,还是咋算。。。[/quote] 具体是不是这种算我也说不上来,你可以用LinX工具来测试FLOPS
cattpon 2016-12-29
  • 打赏
  • 举报
回复
看看是什么~
xxiaoccen 2016-12-29
  • 打赏
  • 举报
回复
引用 5 楼 paschen 的回复:
一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
我试了一下,按照您这个方法 OpenMP+cache blocking+loop interchange+vectorize 在Xeon X5675上测试 4096*4096矩阵乘法 trival的实现大概是1171s 并行优化之后是4s 在12个thread的时候,加速了200多倍 4096*4096的矩阵乘法应该有64G次乘法+64G次加法。。。。 不过,这个性能算是达到了16Gflop/s吗,还是32Gflop/s,还是咋算。。。
lunat 2016-12-29
  • 打赏
  • 举报
回复
题主问的是对角线计算,楼歪了。
GaryCV 2016-12-28
  • 打赏
  • 举报
回复
厉害厉害,参考
xxiaoccen 2016-12-28
  • 打赏
  • 举报
回复
引用 18 楼 paschen 的回复:
[quote=引用 17 楼 u012947309 的回复:] [quote=引用 11 楼 paschen 的回复:] [quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))

int main()
{
	double matrix1[5][5];
	double matrix2[5][5];
	double res3[5][5];
	//initialize
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int N=5;
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
#define N 8

int main()
{
	double matrix1[N][N];
	double matrix2[N][N];
	double res3[N][N]={0};
	//initialize
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
矩阵维度要大于8,这种优化主要是减少cache miss,使用最大的cache line size 其次res3先初始化为0 矩阵较大时这种方式速度明显要快 在我机上测试1000*1000的矩阵乘法,用普通的方法用时13.17s,这种方法仅1.49s[/quote] 这个要求N能被SM整除吧 后面三个循环的跳出条件要改成 i2 < SM && i + i2 < N k2 < SM && k + k2 < N j2 < SM && j + j2 < N 这样是不是能处理不整除的情况了?
tho0o 2016-12-28
  • 打赏
  • 举报
回复
引用 17 楼 u012947309 的回复:
[quote=引用 11 楼 paschen 的回复:] [quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))

int main()
{
	double matrix1[5][5];
	double matrix2[5][5];
	double res3[5][5];
	//initialize
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int N=5;
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
[/quote] 大神受教了,是不是还有SSE加速的
paschen 2016-12-28
  • 打赏
  • 举报
回复
引用 17 楼 u012947309 的回复:
[quote=引用 11 楼 paschen 的回复:] [quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))

int main()
{
	double matrix1[5][5];
	double matrix2[5][5];
	double res3[5][5];
	//initialize
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int N=5;
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
#define N 8

int main()
{
	double matrix1[N][N];
	double matrix2[N][N];
	double res3[N][N]={0};
	//initialize
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<N;i++)
	{
		for(int j=0;j<N;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
矩阵维度要大于8,这种优化主要是减少cache miss,使用最大的cache line size 其次res3先初始化为0 矩阵较大时这种方式速度明显要快 在我机上测试1000*1000的矩阵乘法,用普通的方法用时13.17s,这种方法仅1.49s
NoEdUl 2016-12-28
  • 打赏
  • 举报
回复
引用 11 楼 paschen 的回复:
[quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote]

#include <vector>
#include <iostream>
using namespace std;

#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))

int main()
{
	double matrix1[5][5];
	double matrix2[5][5];
	double res3[5][5];
	//initialize
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			matrix1[i][j] = i*j;
			matrix2[i][j] = i*j;
		}
	}
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix1[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;
	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<matrix2[i][j]<<" ";
		}
		cout<<endl;
	}
	cout<<endl;

	int N=5;
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];

	for(int i=0;i<5;i++)
	{
		for(int j=0;j<5;j++)
		{
			cout<<res3[i][j]<<" ";
		}
		cout<<endl;
	}

	return 0;
}
xxiaoccen 2016-12-28
  • 打赏
  • 举报
回复

for(int i = 0; i < nrows; i += SM) {
    for(int j = 0; j < ncols; j += SM) {
        for(int ii = 0; ii < SM; ii++) {
            for(int jj = 0; jj < SM; jj++) {
                diag[i + ii] += a[i + ii][j + jj] * b[j + jj][i + ii];
            }
       }
    }
}
只要对角线
xxiaoccen 2016-12-28
  • 打赏
  • 举报
回复
引用 11 楼 paschen 的回复:
[quote=引用 10 楼 u012947309 的回复:] [quote=引用 5 楼 paschen 的回复:] 一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
然而并没有看懂.......至少跑了一下结果不正确诶 [/quote] 你应该没搞对,给下你完整代码[/quote] 这是cache blocking吗?
NoEdUl 2016-12-27
  • 打赏
  • 举报
回复
引用 5 楼 paschen 的回复:
一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):

	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
int i2, j2, k2;
double* rres, *rm1, *rm2;
for (int i = 0; i < N; i += SM)
for (int j = 0; j < N; j += SM)
for (int k = 0; k < N; k += SM)
for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
for (j2 = 0; j2 < SM; ++j2)
rres[j2] += rm1[k2] * rm2[j2];


计算结果在res3中,相乘的矩阵为matrix1与matrix2



然而并没有看懂.......至少跑了一下结果不正确诶
NoEdUl 2016-12-27
  • 打赏
  • 举报
回复
引用 5 楼 paschen 的回复:
一个高效计算矩阵相乘的代码(看上去很复杂,但效率很高,利用了CPU缓存进行优化):
	
#define CACHE_LINE_SIZE 64
#define SM (CACHE_LINE_SIZE / sizeof(double))
	int i2, j2, k2;
	double* rres, *rm1, *rm2;
	for (int i = 0; i < N; i += SM)
		for (int j = 0; j < N; j += SM)
			for (int k = 0; k < N; k += SM)
				for (i2 = 0, rres = &res3[i][j], rm1 = &matrix1[i][k]; i2 < SM; ++i2, rres +=N, rm1 += N)
					for (k2 = 0, rm2 = &matrix2[k][j]; k2 < SM; ++k2, rm2 += N)
						for (j2 = 0; j2 < SM; ++j2)
							rres[j2] += rm1[k2] * rm2[j2];
计算结果在res3中,相乘的矩阵为matrix1与matrix2
最近刚好要做大矩阵分解-。-参考一下
加载更多回复(10)

5,530

社区成员

发帖
与我相关
我的任务
社区描述
C/C++ 模式及实现
社区管理员
  • 模式及实现社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧