1、AgendaIntroductionWho Cares?DefinitionLoop Dependence and RemovalDependency Identification LabSummaryIntroductionLoops must meet certain criteriaIteration IndependenceIteration IndependenceMemory DisambiguationMemory DisambiguationHigh Loop CountHigh Loop CountEtcEtcWho Cares实现真正的并行实现真正的并行:OpenMPOpe
2、nMP Auto ParallelizationAuto Parallelization显式的指令级并行显式的指令级并行 ILP(Instruction Level ILP(Instruction Level Parallelism)Parallelism)Streaming SIMD(MMX,SSE,SSE2,)Streaming SIMD(MMX,SSE,SSE2,)Software Pipelining on Intel Itanium ProcessorSoftware Pipelining on Intel Itanium Processor Remove Dependencies
3、for the Out-of-Order CoreRemove Dependencies for the Out-of-Order Core More Instructions run in parallel on Intel Itanium-ProcessorMore Instructions run in parallel on Intel Itanium-Processor自动编译器并行自动编译器并行 High Level OptimizationsHigh Level OptimizationsDefinitionint aMAX;for(J=0;JMAX;J+)aJ=bJ;Loop
4、Independence:Loop Independence:Iteration Y of a loop is Iteration Y of a loop is independent of when independent of when or whether iteration X or whether iteration X happenshappens图例图例OpenMP:True ParallelismSIMD:VectorizationSWP:Software PipeliningOOO:Out-of-Order Core ILP:Instruction Level Paralle
5、lism Green:Benefits from concept Yellow:Some Benefits from Concept Red:No Benefit from Concept AgendaDefinitionWho Cares?Loop Dependence and RemovalData DependenciesData DependenciesRemoving DependenciesRemoving DependenciesData Ambiguity and the CompilerData Ambiguity and the CompilerDependency Rem
6、oval LabSummaryFlow DependencyRead After WriteCross-Iteration Flow Dependence:Variables written then read in different iterationsfor(J=1;JMAX;J+)AJ=AJ-1;A1=A0;A2=A1;Anti-DependencyWrite After ReadCross-Iteration Anti-Dependence:Variables written then read in different iterationsfor(J=1;JMAX;J+)AJ=AJ
7、+1;A1=A2;A2=A3;Output DependencyWrite After WriteCross-Iteration Output Dependence:Variables written then written again in a different iterationfor(J=1;JMAX;J+)AJ=BJ;AJ+1=CJ;A1=B1;A2=C1;A2=B1;A3=C1;IntraIteration DependencyDependency within an iterationHurts ILPMay be automatically removed by compil
8、erK=1;for(J=1;JMAX;J+)AJ=AJ+1;BK=AK+1;K=K+2;A1=A1+1;B1=A1+1;for(J=1;JMAX;J+)AJ=A0+J;Remove DependenciesBest ChoiceBest ChoiceRequirement for Requirement for true Parallelismtrue ParallelismNot all Not all dependencies can dependencies can be removedbe removedfor(J=1;JMAX;J+)AJ=AJ-1+1;for(J=1;JMAX;J+
9、=2)AJ=AJ-1+BJ;AJ+1=AJ-1+(BJ+BJ+1);Increasing ILP,without removing dependencies Good:Unroll LoopGood:Unroll Loop Make sure the compiler Make sure the compiler cant or didnt do this for cant or didnt do this for youyou Compiler should not Compiler should not apply common sub-apply common sub-expressio
10、n elimination expression elimination Also notice that if this is Also notice that if this is floating point data-floating point data-precision could be precision could be alteredalteredfor(J=1;JMAX;J+)AJ=AJ-1+BJ;Induction VariablesInduction variables are Induction variables are incremented on each i
11、ncremented on each trip through the looptrip through the loopFix by replacing Fix by replacing increment expressions increment expressions with pure function of with pure function of loop indexloop indexi1=0;i2=0;for(J=0,JMAX,J+)i1=i1+1;B(i1)=i2=i2+J;A(i2)=for(J=0,JMAX,J+)B(J)=.A(J*2+J)/2)=.Reductio
12、ns Reductions collapse Reductions collapse array data to scalar data array data to scalar data via associative via associative operations:operations:Take advantage of Take advantage of associativity and associativity and compute partial compute partial sums or local sums or local maximum in private
13、maximum in private storagestorageNext,combine partial Next,combine partial results into shared results into shared result,taking care to result,taking care to synchronize accesssynchronize accessfor(J=0;JMAX;J+)sum=sum+cJ;Data Ambiguity and the Compilervoid func(int*a,int*b)for(J=0;JMAX;J+)aJ=bJ;Are
14、 the loop Are the loop iterations iterations independent?independent?The C+compiler has The C+compiler has no ideano ideaNo chance for No chance for optimization-In order optimization-In order to run error free the to run error free the compiler assumes compiler assumes that a and b overlapthat a an
15、d b overlapFunction Callsfor(J=0;J=0;J-)compute(J,.)A Simple Test1.1.Reverse the loop order Reverse the loop order and rerun in serialand rerun in serial2.2.If results are If results are unchanged,the loop is unchanged,the loop is Independent*Independent*for(J=0;JMAX;J+)compute(J,.)*Exception:Loops
16、with induction variablesReverseSummaryLoop Independence:Loop Iterations are Loop Independence:Loop Iterations are independent of each other.independent of each other.Explained its importanceExplained its importance ILP and ParallelismILP and ParallelismIdentified common causes of loop dependenceIdentified common causes of loop dependence Flow Dependency,Anti-Dependency,Output Flow Dependency,Anti-Dependency,Output DependencyDependencyTaught some methods of fixing loop Taught some methods of fixing loop dependencedependenceReinforced concepts through labReinforced concepts through lab