高等教育并行计算多媒体并行体系结构.pptx

资源描述

Title,PCA L16 Chp7.,#,Wu Spring 04,USTC,This is our 1st Level Bullet,This is our 2nd level bullet,This is our 3rd level bullet,This is our next 1st Level Bullet,This is our 2nd level bullet,This is our 3rd level bullet,Parallel C,omputer Architecture,并行计算机体系结构,Lecture 16,概要,复习第,14,讲,基于目录高速缓存一致性协议,放松的存储一致性模型,并行文件系统,工作站机群上的文件系统,并行应用一般要处理很大的数据集,I/O,系统应该能允许并行应用中协作化的操作。,因此需要设计一个高性能的文件系统来简化进程间的协作，高效地利用所有资源，并且对用户是透明的。,考虑机群系统最基本的两个特点：,大量资源：如磁盘、内存等。,并行存取多个磁盘来提高传输带宽；,利用机群系统中的内存，建立大的文件系统缓冲区来提高性能；,高速互连网络,允许系统依赖远地节点完成某些任务。例如，现在的一些系统依赖远地节点的内存来保存本地节点中放不下的高速缓存块。,软件,RAID,软件（逻辑）,RAID,：,将,R,AID,的思想用在机群中，将数据分布在机群系统的多个磁盘中。,软件,RAID,表现就象,RAID 5,，并且与,RAID,具有相同的优缺点,与,RAID,的区别，就是文件系统需要负责分布数据和维护容错级别。,条块组（,Stripe Group,）：,将机群系统所有的磁盘组成一个逻辑,RAID,向所有磁盘写的大的写操作非常困难，导致很多小写操作。但在,RAID 5,，小的写操作效率差。因此，系统就不能充分利用所有磁盘的写带宽。,节点的网络连接的带宽有限，不能够同时读,/,写所有磁盘，只能利用部分磁盘性能。,发生故障的可能性大。奇偶校验机制不够，可能同时多个磁盘故障。,解决方法是将数据条块化分布到磁盘的一个子集上（条块组）。,系统需要执行的小的写操作数目大量减少。,网络连接的带宽与条块组中磁盘的集合带宽相匹配，充分利用资源。,系统中允许多个磁盘失效，只不过不能是属于同一条块组的多个磁盘。,代价：减少了磁盘存储容量和有效带宽，因为每个条块组都必须有一个存放奇偶校验块磁盘，而在原来的方法中整个系统只要一个存放奇偶校验块的磁盘。,日志结构的文件系统（,Log-structure Filesystem,）,日志结构的文件系统提高磁盘速度。,基本假设：高速缓存满足读操作的比例是很高的，因此磁盘的通信量主要是由写操作决定。如果能够改善写操作的执行，顺序执行所有写操作，就可避免寻道和查找时间，能极大提高磁盘性能。,日志结构文件系统的基本思想：使大部分写操作是按顺序执行。,日志结构文件系统中，将整个文件系统作为一个日志来实现。日志结构的文件系统在每次块被写到一个文件时都将数据块加到日志的末尾，同时将以前写的块置为无效。这种方法允许每个文件被顺序写入；不管写的块顺序，因此提供了更快的写速度。,降低读性能的代价换来很高的写性能，增加了复杂性。,块按照写时的顺序分配使文件以随机顺序在磁盘中分散放置。,增加一个单独的垃圾清除程序来扫描文件系统、移除无效块。,需要一个复杂的缓存,/,查询机制来支持高效的查询，并且每个文件的块位置信息必须保存起来。,缓存,利用局部性原理,多级缓存：能够在不同的层次利用缓存机制。（服务器或客户端磁盘控制器、操作系统、,I/O,库、用户程序）,缓存一致性问题：,放松的文件共享语义：对话语义，增加了程序员负担,一致性算法：实现,Unix,语义。不缓存写操作，,令牌：写之前必须获得令牌。令牌的回收，租约。,粒度：文件，文件块，自定义,协同缓存：,如不同的缓存间没有协作，不能充分利用所有的缓存空间；一个节点需要的文件块，已经缓存在另一个节点的缓存中了，从该缓存读提高系统的性能。,第一个实现协同文件缓存的系统是,xFS,。,基本思想：机群中每个节点分配一部分主存作为文件缓存。协同缓存算法利用所有这些主存来创建一个大型的、机群范围的文件缓存。当客户不命中局部文件缓存时，转向远地客户的存储器去取数据。,数据预取,预取：真正存取数据块之前就将其读入内存。,并行预取：每个节点独立的预取数据。,One-block-ahead,或,Stride,透明通知预取：用户向,I/O,系统提供一些存取文件情况的提示信息，系统利用这些信息，能够更好进行预取。,积极预取：一旦当磁盘准备好后，就进行预取，将内存中最远的将来才用到的数据块替换出去。,表,6.6,采用积极预取算法得到的预取调度序列一览表,时间,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,服务块,F1,A1,B2,C1,D2,E1,F1,块,1,F1,F1,F1,D2,D2,D2,D2,D2,D2,F1,F1,F1,块,2,B2,B2,B2,B2,B2,B2,B2,E1,E1,E1,E1,E1,块,3,A1,A1,A1,C1,C1,C1,C1,C1,C1,C1,I/O,接口,传统的,I/O,接口不能表达数据并行、协同化操作等概念，开发一种新的,I/O,接口来表达这些新的语义信息,.,共享文件指针,:,全局共享文件指针,分布共享文件指针,跨步存取模式：,简单的跨步存取操作,嵌套的跨步操作,Berkeley NOW,主动消息（,Active Message,）：实现低开销通信的一种异步通信机制。,在消息头部控制信息中携带一个用户级子例程（称作消息处理程序）的地址。当信息头到达目的节点时，调用消息处理程序从网络上抽取剩下的数据，并把它集成到正在进行的计算中。,GLUnix,：全局层（,Global Layer,）,Unix,运行在工作站标准,Unix,之上的一个软件层，支持可用性和单一系统映像,易于实现、可移植性、有效性、鲁棒性。,xFS,：无服务器文件系统,文件服务的功能分布到机群的所有节点上,软件,RAID,协同式文件缓存,分布管理,IBM SP2,系统,机群体系结构,标准环境,标准编程模型,系统可用性,精选的单一系统映像,高性能开关,HPS,多级,网络,宽节点、窄节点和窄节点,2,网络接口,系统软件,分布式共享存储系统,共享存储器分布于各节点之中，节点之间通过可扩放性好的互连网络相连。,在物理上分布存储的系统上逻辑地实现共享存储模型,对于程序设计者隐藏了远程通信机制，保持了方便性和可移植性。,DSM,系统底层分布式存储具有可扩放性和代价有效性,分布式的存储器和可扩放的互连网络增加了访存带宽，但却导致了不一致的访存结构,共享存储系统的体系结构,无高速缓存结构：,Cray-XMP,，,YMP-C90,向量机，大型机，早期分布式共享存储机器,共享总线结构：,SMP UMA,小型商用服务器,CC-NUMA,结构,:,COMA,结构：,NCC-NUMA,结构：,共享虚拟存储,SVM,结构：,CC-NUMA,结构,高速缓存一致的非均匀存储访问系统：,共享存储器分布于各节点之中。,节点之间通过可扩放性好的互连网络相连，每个处理器都能缓存共享单元，,通常采用基于目录的方法来维持处理器之间的高速缓存一致性。高速缓存一致性的维护是这类系统的关键，决定着系统的可扩放性。,Stanford,大学的,DASH,和,FLASH,，,MIT,的,Alewife,，以及,SGI,的,Origin 2000,等。,COMA,结构,唯高速缓存存储结构,:,共享存储器的地址是活动的，存储单元与物理地址分离，数据可以根据访存模式动态地在各节点的存储器间移动和复制。,每个节点的存储器相当于一个大容量高速缓存，数据一致性也在这一级维护。,优点是在本地共享存储器命中的概率较高。其缺点是当处理器的访问不在本节点命中时，由于存储器的地址是活动的，需要一种机制来查找被访问单元的当前位置，因此延迟很大。,目前采用唯高速缓存结构的系统有,Kendall Square Research,的,KSR1,和瑞典计算机研究院的,DDM,。此外，,COMA,结构常用于共享虚拟存储,SVM(Shared Virtual Memory),系统中,共享虚拟存储,SVM,结构,SVM(Shared Virtual Memory),系统，又称为软件,DSM,系统，,SVM,系统在基于消息传递的,MPP,或机群系统中，用软件把分布于各节点的多个独立编址的存储器组织成一个统一编址的共享存储空间。,优点是在消息传递的系统上实现共享存储的编程界面，但主要问题是难以获得满意的性能,与硬件共享存储系统相比，,SVM,系统中较大的通信和共享粒度,(,通常是存储页,),会导致假共享及额外的通信；,在基于机群的,SVM,系统中，通信开销很大。基于,SVM,系统的并行程序通信量通常比基于消息传递的并行程序的通信量大。,SVM,系统的实现,在操作系统上改进，如,Ivy,、,Mermaid,、,Mirage,和,Clouds,等；,由运行系统来支撑，如,CMU Midway,、,Rice Munin,、,Rice TreadMarks,、,Utah Quarks,、,DIKU CarlOS,、,Maryland CVM,和,JIAJIA,等；,从语言级来实现，如,MIT CRL,、,Linda,和,Orca,等。,混合实现的分布式共享存储系统，其基本思想是结合软硬件实现的分布式共享存储系统的优点。,Overview,关于论文答辩与考试,Review of Lec14,基于目录高速缓存一致性协议,放松的存储一致性模型,高速缓存一致性问题的解决,硬件不支持高速缓存一致性,(NCC-NUMA,结构,),为了避免一致性问题，共享数据被标识为不可高速缓存的，只有私有数据才能被高速缓存,好处在于仅需要很少的硬件支持就足够,缺点在于：,支持透明的软件高速缓存一致性的编译机制非常有限，基于编译支持的软件高速缓存一致性是不太现实的。,如果没有高速缓存一致性，那么在与访问远地单字所需的同等开销下系统将失去获取并使用一个高速缓存行中多个字的优点。当每次访问远地主存只能获得一个单字时，共享存储所具有的空间局部性的优点就荡然无存了。,如果可以同时处理多个字（如一个高速缓存行）时，则诸如预取等延迟容忍技术效果才能更好。,Context for Scalable Cache Coherence,Realizing Pgm Models,through net transaction,protocols,-efficient node-to-net interface,-interprets transactions,Caches naturally replicate,data,-coherence through bus,snooping protocols,-consistency,Scalable Networks,-many simultaneous,transactions,Scalable,distributed,memory,Need cache coherence protocols that scale!,-no broadcast or single point of order,解决方法,:,目录协议,显式地包含状态向量,与存储块状态相联系,记录每个存储块的状态,未命中，与目录通信,决定高速缓存拷贝的地址,决定将要进行的操作,确定协议以保持同步,一个高速缓存一致性系统必须,:,提供状态集,状态转移图,以及动作,管理一致性协议,(0),决定何时调用一致性协议,(a),找出其他高速缓存上的存储模块的信息以决定将要进行的操作,是否需要同其他高速缓存拷贝进行通信,(b),确定其他拷贝的地址,(c),与这些拷贝通信,(,使无效,/,更新,),在所有的系统中都使用同样的方法进行,(0),存储块的状态保存在高速缓存中,若未命中则调用协议,不同的方法通过,(a),到,(c),区分开来,基于总线的一致性,(a),(b),(c),都是通总线广播实现,访存失败的处理器发出一个“寻找”信号,其他的对该信号做出响应并采取必要的动作,在规模不同的网络上都可实现,向所有处理器广播,并使它们做出响应,Conceptually simple,but broadcast doesnt scale with p,on bus,bus bandwidth doesnt scale,on scalable network,every fault leads to at least p network transactions,Scalable coherence:,can have same cache states and state transition diagram,different mechanisms to manage protocol,One Approach:Hierarchical Snooping,Extend snooping approach:hierarchy of broadcast media,tree of buses or rings(KSR-1),processors are in the bus-or ring-based multiprocessors at the leaves,parents and children connected by two-way snoopy interfaces,snoop both buses and propagate relevant transactions,main memory may be centralized at root or distributed among leaves,Issues(a)-(c)handled similarly to bus,but not full broadcast,faulting processor sends out“search”bus transaction on its bus,propagates up and down hiearchy based on snoop results,Problems:,high latency:multiple levels,and snoop/lookup at every level,bandwidth bottleneck at root,Not popular today,Scalable Approach:Directories,Every memory block has associated directory information,keeps track of copies of cached blocks and their states,on a miss,find directory entry,look it up,and communicate only with the nodes that have copies if necessary,in scalable networks,communication with directory and copies is through network transactions,Many alternatives for organizing directory information,Basic Directory Transactions,Basic Operation of Directory,k processors.,With each cache-block in memory:k presence-bits,1 dirty-bit,With each cache-block in cache:1 valid bit,and 1 dirty(owner)bit,Read from main memory by processor i:,If dirty-bit OFF then read from main memory;turn pi ON;,if dirty-bit ON then recall line from dirty proc(cache state to shared);update memory;turn dirty-bit OFF;turn pi ON;supply recalled data to i;,Write to main memory by processor i:,If dirty-bit OFF then supply data to i;send invalidations to all caches that have the block;clear directory entries;turn dirty-bit ON;turn pi ON;.,.,A Popular Middle Ground,Two-level“hierarchy”,Individual nodes are multiprocessors,connected non-hiearchically,e.g.mesh of SMPs,Coherence across nodes is directory-based,directory keeps track of nodes,not individual processors,Coherence within nodes is snooping or directory,orthogonal,but needs a good interface of functionality,Examples:,Convex Exemplar:directory-directory,Sequent,Data General,HAL:directory-snoopy,SMP on a chip?,Example Two-level Hierarchies,Advantages of Multiprocessor Nodes,Potential for cost and performance advantages,amortization of node fixed costs over multiple processors,applies even if processors simply packaged together but not coherent,can use commodity SMPs,less nodes for directory to keep track of,much communication may be contained within node(cheaper),nodes prefetch data for each other(fewer“remote”misses),combining of requests(like hierarchical,only two-level),can even share caches(overlapping of working sets),benefits depend on sharing pattern(and mapping),good for widely read-shared:e.g.tree data in Barnes-Hut,good for nearest-neighbor,if properly mapped,not so good for all-to-all communication,Sharing Patterns Summary,Generally,few sharers at a write,scales slowly with P,Implies directories very useful in containing traffic,if organized properly,traffic and latency shouldnt scale too badly,Suggests techniques to reduce storage overhead,Organizing Directories,Centralized,Distributed,Hierarchical,Flat,Memory-based,Cache-based,Directory,Schemes,How to find source of,directory information,How to locate copies,How to Find Directory Information,centralized memory and directory-easy:go to it,but not scalable,distributed memory and directory,flat schemes,directory distributed with memory:at the home,location based on address(hashing):network xaction sent directly to home,hierarchical schemes,Hierarchical of caches that guarantee the inclusion property;each parent keeps track of exactly which of its immediate children has a copy of the block.,Latency and network transaction,Not so popular,How Hierarchical Directories Work,Directory is a hierarchical data structure,leaves are processing nodes,internal nodes just directory,logical hierarchy,not necessarily phyiscal,(can be embedded in general network),Find Directory Info(cont),distributed memory and directory,flat schemes,hash,hierarchical schemes,nodes directory entry for a block says whether each subtree caches the block,to find directory info,send“search”message up to parent,routes itself through directory lookups,like hiearchical snooping,but point-to-point messages between children and parents,How Is Location of Copies Stored?,Hierarchical Schemes,through the hierarchy,each directory has presence bits child subtrees and dirty bit,Flat Schemes,vary a lot,different storage overheads and performance characteristics,Memory-based schemes,info about copies stored all at the home with the memory block,Dash,Alewife,SGI Origin,Flash,Cache-based schemes,info about copies distributed among copies themselves,each copy points to next,Scalable Coherent Interface(SCI:IEEE standard),Flat,Memory-based Schemes,info about copies colocated with block at the home,just like centralized scheme,except distributed,Performance Scaling,traffic on a write:proportional to number of sharers,latency on write:can issue invalidations to sharers in parallel,Storage overhead,simplest representation:,full bit vector,i.e.one presence bit per node,storage overhead doesnt scale well with P;64-byte line implies,64 nodes:12.7%ovhd.,256 nodes:50%ovhd.;1024 nodes:200%ovhd.,for M memory blocks in memory,storage overhead is proportional to P*M,P,M,P,M,Reducing Storage Overhead,Optimizations for full bit vector schemes,increase cache block size(reduces storage overhead proportionally),use multiprocessor nodes(bit per mp node,not per processor),still scales as P*M,but reasonable for all but very large machines,256-procs,4 per cluster,128B line:6.25%ovhd.,Reducing“width”,addressing the P term?,Reducing“height”,addressing the M term?,Storage Reductions,Width observation:,most blocks cached by only few nodes,dont have a bit per node,but entry contains a few pointers to sharing nodes,P=1024=10 bit ptrs,can use 100 pointers and still save space,sharing patterns indicate a few pointers should suffice(five or so),need an overflow strategy when there are more sharers,Height observation:,number of memory blocks number of cache blocks,most directory entries are useless at any given time,organize directory as a cache,rather than having one entry per memory block,Flat,Cache-based Schemes,How they work:,home only holds pointer to rest of directory info,distributed linked list of copies,weaves through caches,cache tag has pointer,points to next cache with a copy,on read,add yourself to head of the list(comm.needed),on write,propagate chain of invals down the list,Scalable Coherent Interface(SCI)IEEE Standard,doubly linked list,Scaling Properties(Cache-based),Traffic on write:proportional to number of sharers,Latency on write:proportional to number of sharers!,dont know identity of next sharer until reach current one,also assist processing at each node along the way,(even reads involve more than one other assist:home and first sharer on list),Storage overhead:quite good scaling along both axes,Only one head ptr per memory block,rest is all prop to cache size,Very complex!,Summary of Directory Organizations,Flat Schemes:,Issue(a):finding source of directory data,go to home,based on address,Issue(b):finding out where the copies are,memory-based:all info is in directory at home,cache-based:home has pointer to first element of distributed linked list,Issue(c):communicating with those copies,memory-based:point-to-point messages(perhaps coarser on overflow),can be multicast or overlapped,cache-based:part of point-to-point linked list traversal to find them,serialized,Hierarchical Schemes:,all three issues through sending messages up and down tree,no single explicit list of sharers,only direct communication is between parents and children,Summary of Directory Approaches,Directories offer scalable coherence on general networks,no need for broadcast media,Many possibilities for organizing directory and managing protocols,Hierarchical directories not used much,high latency,many network transactions,and bandwidth bottleneck at root,Both memory-based and cache-based flat schemes are alive,for memory-based,full bit vector suffices for moderate scale,measured in nodes visible to directory protocol,not processors,放松的存储一致性模型,顺序一致性模型非常严格,程序序的要求,存储执行的原子性,写操作的全局执行后才能执行下一个存储操作,读操作的执行完成必须等待产生值的写操作完成,并非所有的存储操作的重叠进行都会引起错误，因此可以对某些执行次序要求放松。,处理器一致性模型：,w,r,弱一致性模型,:,同步操作与普通操作,释放一致性模型,急切释放一致性模型,懒惰释放一致性模型,域一致性模型,单项一致性模型,释放一致性模型,Release Consistency,硬件和程序员之间建立某种约定，由程序员来负担一些维护数据一致性的责任，从而放松硬件对访存事件发生次序的限制。,操作分为,同步操作和普通访存操作；,同步操作进一步划分为获取操作,acquire,和释放操作,release,。,acquire,用于获取对某些共享存储单元的独占性访问权，而,release,则用于释放这种访问权。,释放一致性模型对访存事件发生次序作如下限制：,在任一普通访存操作允许被执行之前，所有在同一处理器中先于这一访存操作的获取操作,acquire,都已完成；,在任一释放操作,release,允许被执行之前，所有在同一处理器中先于这一,release,的普通访存操作都已完成；,同步操作的执行满足顺序一致性条件。,急切释放一致性模型,SVM,系统中，由于通信和数据交换的开销很大，所以有必要减少通信和数据交换的次数。,急切释放一致性模型中，临界区内的多个存数操作不是及时进行的，而是在执行,release,操作之前,(,即退出临界区之前,),集中进行。把多个存数操作合并在一起统一执行，就减少了数据通信次数，这对于由软件实现的共享存储系统是十分必要的。,懒惰释放一致性模型,在此模型中，由一个处理器对某单元的存数操作并不是由此处理器主动地传播到所有共享该单元的其它处理器，而是在其它处理器要用到此处理器所写的数据时,(,即其它处理器执行,acquire,操作时,),再向此处理器索取该单元的最新备份。这样可以进一步减少通信量。,高速缓存一致性协议和存储一致性模型,解决高速缓存一致性问题，即如何保持数据在多个高速缓存和主存中的多个备份的一致性。,高速缓存一致性协议都是为实现某种存储一致性模型而设计的。,存储一致性模型对高速缓存一致性协议提出一致性要求，即高速缓存一致性协议应该实现什么样的,“,一致性,”,。,例如在释放一致性,RC,中，一个处理器对共享变量写的新值，其它处理器只有等到该处理器释放锁后才能看到；而在顺序一致性,SC,中，一个处理器写的值会立刻传播给所有的处理器。因此,SC,和,RC,所描述的,“,一致性,”,观点不同，实现,SC,的高速缓存一致性协议与实现,RC,的高速缓存一致性协议也就不一样。,高速缓存一致性协议通常需要考虑以下几方面：,如何传新值：是写无效还是写更新；,谁产生新值：是单写协议还是多写协议；,何时传新值：是及时传播还是延迟传播；,新值传向谁：是侦听协议还是目录协议。,Stanford DASH,多计算机,Dash(Directory Architecture for Shared Memory),John Hennessy,领导下由,Stanford,大学研制的,CC-NUMA,多处理器系统,使用分布式一致性高速缓存和分布式存储器

展开阅读全文