高性能矩阵乘法.ppt_咨信网zixin.com.cn

资源描述

单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,*,矩阵乘法,xiafei,2026/2/27 周五,1,并行算法优化研究相对于传统面向对象串行算法的,4,个挑战：,同步：,两个或者多个线程协调其行为的过程,通信：,与线程之间交换数据相关的带宽和延迟问题,负载均衡,：,多个线程之间工作量分布的情况，给各个线程（执行核）分配均匀的工作,可扩展性,：,衡量在性能更加强劲的系统上运行软件时能否有效利用更多线程的指标，观察应用程序在更高级的平台上运行,4,核到,8,核线性增长,2026/2/27 周五,2,多线程（核）设计主要分解模式,任务分解：,对程序根据其执行的功能进行分解的过程,数据分解：,将应用程序根据各任务所处理的数据而非按任务的天然特性来进行分解,数据流分解：,研究数据在诸任务之间如何流动，根据任务之间的数据流关系对问题进行分解,模式,分解方式,任务级并行模式,任务分解,Divide and Conquer,任务,/,数据分解,几何分解模式,数据分解,流水线模式,数据流分解,波峰（,wavefront,）模式,数据流分解,2026/2/27 周五,3,多线程（核）设计主要分解模式,任务分解：,对程序根据其执行的功能进行分解的过程,数据分解：,将应用程序根据各任务所处理的数据而非按任务的天然特性来进行分解,数据流分解：,研究数据在诸任务之间如何流动，根据任务之间的数据流关系对问题进行分解,分解方式,设计,说明,任务分解,不同的程序行为采用不同的线程实现,常用于,GUI,应用程序,数据分解,多个线程对不同的数据块执行相同的操作,常用于音频、图像处理和科学计算应用程序,数据流分解,一个线程的输出作为另一个线程的输入,尤其应注意尽量消除启动和排空延迟,2026/2/27 周五,4,矩阵乘法算法探讨,在工程科学计算中，矩阵乘积是最基本的运算,典型的,n,阶稠密方阵乘积算法的时间复杂度是,O(n,3,),。,目前对大型矩阵乘积运算的处理主要是采用分治思想，将矩阵分布在多个节点上，但每个结点上的小矩阵仍要立方级乘法次数。,基于分之思想的两种划分策略：条形划分和块状（棋盘）划分的,6,种常见分布式矩阵乘法并行算法。,2026/2/27 周五,5,基于不同划分策略的矩阵乘法算法探讨,1,、条形（,striped partitioning,）划分的矩阵乘法并行算法,行条划分列条划分,两两组合：行列、行行、列列、列行,2026/2/27 周五,6,基于不同划分策略的矩阵乘法算法探讨,2,、块状划分,(checkerboard partitioning),的矩阵乘法并行算法,称为棋盘划分,2026/2/27 周五,7,Cannon,Description for implementation of MPI program to compute Matrix Matrix Multiplication using,block checkerboard partitioning,and,Cannon Algorithm,2026/2/27 周五,8,Cannon,Objective,Computing the matrix-matrix multiplication on SMP System.Use block,checkerboard,partitioning of the matrices and,Cannons Algorithm,.,Assumption,Size of the square matrices,p=q,2,and the size of square matrices,A,and,B,is evenly divisible by,q,.It is assumed that the number of blocks are equal to the number of processors.,2026/2/27 周五,9,Cannon,Cannons algorithm is based on,cartesian virtual,topology,A,and,B,are square matrices of size,n,and,C,be the output matrix.,These matrices are dived into blocks or submatrices to perform matrix-matrix operations in parallel,n,x,n,matrix,A,can be regarded,as,q,x,q,array of blocks,A,i,j,(0,=i q,0,=j q,)such that each block is an,(,n/q),x,(n/q,),submatrix,We use,p,processors to implement the block version of matrix multiplication in parallel by choosing,q,as a square root of,p,and compute a distinct block,C,i,j,on each processor.,2026/2/27 周五,10,传统并行,2026/2/27 周五,11,传统并行,The matrices,A,and,B,are,partitioned into,p,blocks,A,i,j,and,B,i,j,(0=,i q,0=,j q,)of size(,n/q,x,n/q,)on each process.,These blocks are mapped onto a,q,x,q,logical mesh of processes.The processes are labeled from P,0,0,to P,q,-,1,q-1,.,2026/2/27 周五,12,传统并行,Process P,i,j,initially store block matrices,A,i,j,and,B,i,j,and computes block,C,i,j,of result matrix.,To compute submatrix,C,i,j,we need all submatrices,A,i,k,and B,k,j,(0,=,k,q,).To acquire all the required blocks,an,all-to-all,broadcast of matrix,A,i,j,s is performed in each,row,and similarly in each,column,of matrix,B,i,j,s.,MPI,collective communication is used to perform this operations.,2026/2/27 周五,13,传统并行,After P,i,j,acquires,A,i,0,A,i,1,A,i,2,A,i,q-,1,and,B,0,j,B,1,j,B,2,j,B,q,-,1,j,it performs the serial block matrix to matrix multiplication and accumulates the partial block matrix,C,i,j,of matrix,C,.,To obtain the resultant product matrix,C,processes with rank,0,gathers all the block matrices by using,MPI_Gather,collective communication operation.,2026/2/27 周五,14,Cannon,p,processors arranged in,q,x,q,square grid of processors and the input matrices.,A,and,B,are distributed among the processes in checkerboard fashion.,It results in constructing,p,block matrices of,A,and,B,.It uses only,point-to-point communication,for,circularly shifting,blocks of matrix,A,and matrix,B,among,p,processes.,2026/2/27 周五,15,Cannon-inital,The algorithm proceeds in,q,stages.,The first step in this algorithm is to perform,initial alignment,of the block matrix,A,and block matrix,B,.,The blocks of matrix,A,are circularly shifted to the,i,positions to,left,in the,row,of the square grid of processes,where,i,is the,row,number,of the process in the mesh.,The blocks of matrix,B,are circularly shifted,j,positions,upwards,where,j,is the,column,number,of the process in the processes mesh.,2026/2/27 周五,16,Cannon-inital,2026/2/27 周五,17,Cannon-running,The algorithm performs the following steps in each stage:,1.Multiply the block of matrix,A,and matrix,B,and add the resultant matrix to get the block matrix,C,which is initially set to zero.,2.Circularly shift the blocks of matrix,A,to,left,in the rows of the processes and the blocks of matrix,B,upwards,in the columns of the square grid of processes in a,wrap around,manner.,2026/2/27 周五,18,Cannon-running,2026/2/27 周五,19,Cannon-running,2026/2/27 周五,20,书中,Cannon-bug,MPI_Send and MPI_Recv is not used for point-to-point communication,because if all the processes call MPI_Send or MPI_Recv,in different order the,deadlocked,situation may arise.,How to fix?,指派一个缓冲区，使用,MPI_Irecv/MPI_Isend,非阻塞式通讯函数，,MPI_wait.,MPI_Sendrecv.,2026/2/27 周五,21,Cannon-bug,死锁的问题,问题来源于,main_shift(),这个函数中,MPI,函数的使用。在,Cannon-mpi,代码的,main_shift(),模块中，文献中算法使用的是,MPI,的,阻塞通信,函数,:MPI_Send/MPI_Recv,这就使得,Cannon,算法在执行循环左移和循环上移时，矩阵规模超过共享,buff,的容量时出现,循环等待,的,死锁,状况。,在曙光,4000,集群系统上，该算法的发生死锁的矩阵下限规模是,200200,的浮点型矩阵。,2026/2/27 周五,22,Cannon-bug,原始（阻塞式）的,main_shift,模块：,void main_shift(),/*,将分块,b,左移位*,/,MPI_Send,(a,dl2,MPI_FLOAT,get_index(my_row,my_col-1,sp),1,MPI_COMM_WORLD);,MPI_Recv,(a,dl2,MPI_FLOAT,get_index(my_row,my_col+1,sp),1,MPI_COMM_WORLD,/*,将分块,b,上移位*,/,MPI_Send,(b,dl2,MPI_FLOAT,get_index(my_row-1,my_col,sp),1,MPI_COMM_WORLD);,MPI_Recv,(b,dl2,MPI_FLOAT,get_index(my_row+1,my_col,sp),1,MPI_COMM_WORLD,2026/2/27 周五,23,Cannon-bug,改进（非阻塞式）的,main_shift,模块,ci*dl+j+=ai*dl+k*bj*dl+k;/,改进了的,Cannon,按行存取,/*,将分块,a,左移位*,/,MPI_Isend,(a,dl2,MPI_FLOAT,get_index(my_row,my_col-1,sp),1,MPI_COMM_WORLD,MPI_Irecv,(buf,dl2,MPI_FLOAT,get_index(my_row,my_col+1,sp),1,MPI_COMM_WORLD,MPI_Wait,(,MPI_Wait,(,memcpy,(a,buf,sizeof(float)*dl2);,/*,将分块,b,上移位*,/,MPI_Isend,(b,dl2,MPI_FLOAT,get_index(my_row-1,my_col,sp),1,MPI_COMM_WORLD,MPI_Irecv,(buf,dl2,MPI_FLOAT,get_index(my_row+1,my_col,sp),1,MPI_COMM_WORLD,MPI_Wait,(,MPI_Wait,(,memcpy,(b,buf,sizeof(float)*dl2);,2026/2/27 周五,24,Cannon-bug,MPI_Irecv,仅仅初始化接受操作,在与之对应的,MPI_Wait,函数的调用返回之前，将不能访问,buffer,MPI_Irecv,函数返回时，,handle,指向一个,MPI_Request,对象，它代表了一个已近初始化了的通信操作。这个函数并不返回一个指向,MPI_Status,对象的指针，因为实际的接受操作并未完成。,MPI_Wait,会一直阻塞，直至参数,handle,所关联的操作完成，对发送来说，此时就可以向缓冲区写入新的值。而对接收来说，便可以从缓冲区读取消息，而,status,所指向的,MPI_Status,对象包含了所接收消息的信息。,新增加,buf,的目的就是防止在,a,还未发送出去的时候就,recv,内容至,a,中导致信息的错误，只有在,MPI_Wait,返回以后，再调用,mencpy,将,buf,的内容写回,a,中，完成更新。,2026/2/27 周五,25,Cannon,乘法,mpi,代码主要模块,int get_index(int row,int col,int sp),/,处理器逻辑阵列坐标至,rank,号的转换,void random_A_B(),/,随机生成矩阵,A/B,void scatter_A_B(),/rank=0,的处理器向外分发,A,B,的相关块,void init_alignment(),/,矩阵,A/B,初始对齐,Void main_shift(),/,分块矩阵左移和上移，并计算分块,c,这个模块就是我改造该算法的重点部位,void collect_c(),/rank=0,的处理器从其余处理器收集分块矩阵,c,void print(float*m,char*str),/,打印矩阵,int main(int argc,chat*argv),/,主过程，,cannon,算法，矩阵相乘,2026/2/27 周五,26,Cannon-review,循环移位对齐,左移,上移,分而治之的并行计算思想,任务划分,数据划分,精简通讯,All-to-All,Point-to-Point,2026/2/27 周五,27,

展开阅读全文