资源描述
,单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,#,Chapter 1,Fundamentals of Computer Design,Outline,Why Such Change in 20 years?,The End of the Uniprocessor Era,Sea Change in Chip Design,New Project in Berkeley,New Trends in Computer Design,What Computer Architecture brings to Table?,1)Taking Advantage of Parallelism,2),The Principle of Locality,3)Focus on the Common Case,4),Amdahls Law,5),Processor performance equation,Why Such Change in 20 years?,Performance,性能,Technology Advances,技术工艺的进步,CMOS VLSI dominates older technologies(TTL,ECL)in cost,AND,performance,在成本和性能上超越了较老的工艺技术,Computer architecture advances improves low-end,计算机体系结构的进步,改进了低端系统的性能,RISC,superscalar,RAID,Price:Lower costs due to,Simpler development,开发更简单,CMOS VLSI:smaller systems,fewer components,系统更小、部件更少(集成度高、功能强大),Higher volumes,容量更大,CMOS VLSI:same device cost 10,000 vs.10,000,000 units,Function,Rise of networking/local interconnection technology,联网/局部互联技术的高速发展,Technology Trends:Microprocessor Capacity,微处理器的晶体管数,CMOS improvements:,Die size:2X every 3 yrs,Line width:halve/7 yrs,Alpha 21264:15 million,Pentium Pro:5.5 million,PowerPC 620:6.9 million,Alpha 21164:9.3 million,Sparc Ultra:5.2 million,1971 第一款处理器 4004,(4,位微处理器,),只有2300个晶体管,P,处理器包含超过2000万个晶体管,Memory Capacity(Single Chip DRAM),year size(Mb)cyc time,19800.0625250 ns,19830.25220 ns,19861190 ns,19894165 ns,199216145 ns,199664120 ns,2000256100 ns,Current DRAM,Technology Trends(Summary),CapacitySpeed(latency),Logic2x in 3 years 2x in 3 years,DRAM4x in 3-4 years 2x in 10 years,Disk4x in 2-3 years 2x in 10 years,Processor Performance Trends,Year,Gates per clock,A typical pipeline has,a fixed amount of work,that is required to decode and execute an instruction.This work is performed by individual logical operations called,gates,.,Gates per clock,is how many gates in a pipeline may change state in a single clock cycle.,If we increase,clock speed,faster than improvements in gate speed,we can just,reduce the gates per clock and add more pipeline stages,.,This can be reduced by inserting latches(,锁存器,)into the data path:,when the number of gates between latches is reduced,a higher clock is possible,.,Processor Performance(1.35X before,1.55X in 90,s),1.55,X/yr,80年代中期以前,技术驱动:电路技术。,此后,得益于先进的系统结构思想:,流水技术、乱序执行、超标量、多级,Cache,Processor Performance Trends(Summary),Workstation performance(measured in,SPEC Marks,)improves roughly 50%per year (2X every 18 months),以,SPEC,分数评测,工作站性能大约每年,改进50%(,每十八月翻一番),Improvement in cost performance estimated at 70%per year,性能价格比大约每年,改进70%,补充:,SPEC,基准程序,(SPEC benchmark),限制微处理器设计、实现的严峻挑战不是制造能力,而是:,功耗密度,Crossroads:Uniprocessor Performance,VAX :25%/year 1978 to 1986,RISC+x86:52%/year 1986 to 2002,RISC+x86:,20%,/year 2002 to present,From Hennessy and Patterson,Computer Architecture:A Quantitative Approach,4th edition,October,2006,Recent Intel Processors,“We are dedicating all of our future product development to multicore designs.We believe this is a key inflection point for the industry.”,Intel President Paul Otellini,IDF 2005,Processors,Year,Fabrication(nm),Clock(GHz),Power(W),Pentium 4,2000,180,1.80-4.00,35-115,Pentium M,2003,90/130,1.00-2.26,5-27,Core 2 Duo,2006,65,2.60-2.90,10-65,Core 2 Quad,2006,65,2.60-2.90,45-105,Core i7(Quad),2008,45,2.93-3.60,95-130,Core i5(Quad),2009,45,3.20-3.60,73-95,Pentium Dual-Core,2010,45,2.80-3.33,65-130,Core i3(Duo),2010,32,2.93-3.33,18-73,2nd Gen i3(Duo),2011,32,2.50-3.40,35-65,2nd Gen i5(Quad),2011,32,3.10-3.80,45-95,2nd Gen i7(Quad/Hexa),2011,32,3.80-3.90,65-130,3rd Gen i3(Duo),2012,22/32,2.80-3.40,35-55,3rd Gen i5(Quad),2012,22/32,3.20-3.80,35-77,3rd Gen i7(Quad/Hexa),2012,22/32,3.70-3.90,45-77,Xeon E5(8-cores),2013,22,1.80-2.90,60-130,Xeon Phi(60-cores),2013,22,1.10,300,The End of the Uniprocessor Era,Single biggest change in the history of computing systems,Old Conventional Wisdom:Power is free,Transistors expensive,New Conventional Wisdom:,“Power wall”,Power expensive,Transistors free (Can put more on chip than can afford to turn on),Old CW:Sufficient increasing Instruction-Level Parallelism via compilers,innovation(pipelining,superscalar,out-of-order,speculation,VLIW,),New CW:,“ILP wall”,law of diminishing returns,on more HW for ILP,Old CW:Multiplies,(乘法器),are slow,Memory access is fast,New CW:,“Memory wall”,Memory slow,multiplies fast(200 clock cycles to DRAM memory,4 clocks for multiply),Old CW:Uniprocessor performance 2X/1.5 yrs,New CW:,Power Wall,+,ILP Wall,+,Memory Wall,=,Brick Wall,Uniprocessor performance now 2X/5(?)yrs,Sea change in chip design:multiple“cores”,(2X processors per chip/2 years),More,simpler processors are more,power efficient,Conventional Wisdom in Computer Architecture,TLP:2+cores/2 years,DLP:2x width/4 years,Prediction for x86 processors,from Hennessy&Patterson,5,th,edition(Note:Educated guess,not Intel product plans!),DLP will account for more mainstream parallelism growth than TLP in next decade.,No longer get faster,just wider,未来计算机硬件不会更快,但会更,“宽”,新摩尔定律,?,2025/4/25 周五,19,DLP important for conventional CPUs too,Sea Change in Chip Design,Intel 4004(1971):4-bit processor,2312 transistors,0.4 MHz,10 micron PMOS,11 mm,2,chip,Processor is the new transistor?,RISC II(1983):32-bit,5 stage pipeline,40,760 transistors,3 MHz,3 micron NMOS,60 mm,2,chip,125 mm,2,chip,0.065 micron CMOS=,2312 RISC II+FPU+Icache+Dcache,RISC II shrinks to 0.02 mm,2,at 65 nm,Problems with Sea Change,Algorithms,Programming Languages,Compilers,Operating Systems,Architectures,Libraries,not ready,to supply Thread-Level Parallelism or Data-Level Parallelism for 1000 CPUs/chip,Architectures not ready for 1000 CPUs/chip,Unlike Instruction-Level Parallelism,cannot be solved by computer architects and compiler writers,alone,but also cannot be solved without participation of architects,Need a,reworking of all the abstraction layers,in the computing system stack,Characteristics of Ideal Academic CS Research Supercomputer?,Scale,Hard problems at 1000 CPUs,Cheap,2006 funding of academic research,Cheap to operate,Small,Low Power,$again,Community,share SW,training,ideas,Simplifies debugging,high SW churn rate(,离网率,),Reconfigurable,test many parameters,imitate many ISAs,many organizations,Credible,results translate to real computers,Performance,run real OS and full apps,results overnight,New Project in Berkeley,FPGAs as New Research Platform,As 25 CPUs can fit in,Field Programmable Gate Array,(FPGA),1000-CPU system from 40 FPGAs?,16 32-bit simple,“,soft core,”,RISC at 150MHz in 2004(Virtex-II),FPGA generations,every 1.5 yrs;,2X CPUs,1.2X clock rate,HW research community does logic design(“gate shareware”)to create out-of-the-box,Massively Parallel Processor runs standard binaries of OS,apps,Gateware:Processors,Caches,Coherency,Ethernet Interfaces,Switches,Routers,(some free from open source hardware)(,www.openhw.org/),E.g.,1000 processor,standard ISA(,IBM POWER,)binary-compatible,64-bit,cache-coherent supercomputer 200 MHz/CPU in 2007,FPGA,的优势,过去的,20,年中,,CPU,速度一直遵循摩尔定律,计算机并行性未受太多关注。,2005,年以来由于功耗和散热等问题,单,CPU,速度增长趋于停止,多核芯片开始推出,并行计算机系统设计成为研究热点。由于,FPGA,正按照摩尔定律在速度、价格、集成度方面不断进步,,所以由,FPGA,实现的并行计算机系统将很快趋于实用化。,随着半导体工艺向深亚纳米演进,产品越来越复杂,技术成本不断上升。同时,由于产品的市场机会越来越短,由,3-5,年缩短为,1,年,,“可编程”,成为缩短上市时间的一个必要功能。,FPGA,的优势越来越明显,,FPGA,的每个逻辑单元的价格每年下降,25%,。,FPGA,的市场增长速度是,ASSP,(专用标准产品),的两倍,是,ASIC,(专用集成电路),的,3,倍。,多方观点,Richard Sevcik,,,Xilinx,公司执行副总裁:多平台,FPGA,的发展将终结,ASIC,时代?,Dave Bursky,Electronic Design,数字,IC/DSP,编辑:,FPGA,技术的进步铺平了通向真正的,SoC,解决方案之路,Justin R.Rattner,Intel CTO,:我感兴趣的一个领域是,可重复配置的硬件,。更普通的设计是在处理器中集成,FPGA,,我们会在今年某些时候增加这样的研究项目,进行大规模的实验。,RAMP,Since goal is to,ramp up,research in multiprocessing,called,R,esearch,A,ccelerator for,M,ultiple,P,rocessors,To learn more,read“RAMP:Research Accelerator for Multiple Processors-A Community Vision for a Shared Experimental Parallel HW/SW Platform,”Technical Report UCB/CSD-05-1412,Sept 2005,Source,IEEE Micro,V27,I2(March 2007)Pages 46-57,Authors,John Wawrzynek,University of California,Berkeley,David Patterson,University of California,Berkeley,Mark Oskin,University of Washington,Shih-Lien Lu,Intel,Christoforos Kozyrakis,Stanford University,James C.Hoe,Carnegie Mellon University,Derek Chiou,University of Texas at Austin,Krste Asanovic,Massachusetts Institute of Technology,Web page,ramp.eecs.berkeley.edu,RAMP:Research Accelerator for Multiple Processors,ABSTRACT,The RAMP projects,goal,is to enable the intensive,multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing.,RAMP itself is an open-source,community-developed,FPGA-based emulator of parallel architectures,.Its design framework lets a large,collaborative community develop and contribute reusable,composable design modules.Three complete designs-for transactional memory,distributed systems,and distributed-shared memory-demonstrate the platforms potential.,the stone soup of architecture research platforms,I/O,Patterson,Monitoring,Kozyrakis,Net Switch,Oskin,Coherence,Hoe,Cache,Asanovic,PPC,Arvind,x86,Lu,Glue-support,Chiou,Hardware,Wawrzynek,RAMP uses(internal),Internet-in-a-Box,Patterson,TCC,Kozyrakis,Dataflow,Oskin,Reliable MP,Hoe,1M-way MT,Asanovic,BlueSpec,Arvind,x86,Lu,Net-uP,Chiou,Wawrzynek,BEE,Why RAMP Good for,Research,?,SMP,Cluster,Simulate,RAMP,Cost,(1000 CPUs),F,($40M),C,($2M),A+,($0M),A,($0.1M),Cost of ownership,A,D,A,A,Scalability,C,A,A,A,Power/Space,(kilowatts,racks),D,(120 kw,12 racks),D,(120 kw,12 racks),A+,(.1 kw,0.1 racks),A,(1.5 kw,0.3 racks),Community,D,A,A,A,Observability,D,C,A+,A+,Reproducibility,B,D,A+,A+,Flexibility,D,C,A+,A+,Credibility,A+,A+,F,A,Perform.,(clock),A,(2 GHz),A,(3 GHz),F,(0 GHz),C,(0.2 GHz),GPA,平均分,C,B-,B,A-,Completed Dec.2004(14x17 inch 22-layer PCB),Module:,FPGAs,memory,10GigE conn.,Compact Flash,Administration/maintenance ports:,10/100 Enet,HDMI/DVI,USB,4K/module w/o FPGAs or DRAM,RAMP 1 Hardware,Called“BEE2”for Berkeley Emulation Engine 2,Multiple Module RAMP 1 Systems,8 compute modules(plus power supplies)in 8U rack mount chassis,500-1000 emulated processors,Many topologies possible,2U single module tray for developers,Disk storage:disk emulator+Network Attached Storage,千兆位级收发器,(MGT),RAMP,s ISA,Got it:Power 405(32b),SPARC v8(32b),Xilinx Microblaze(32b),Very Likely:SPARC v9(64b),Likely:IBM Power 64b,Probably(haven,t asked):MIPS32,MIPS64,Not likely:x86,Even less likely:x86-64,We,ll sue you:ARM,Vision:Multiprocessing Watering Hole,RAMP attracts many communities to shared artifact,Cross-disciplinary interactions,Accelerate innovation in multiprocessing,RAMP as next Standard Research Platform?(e.g.,VAX/BSD Unix in 1980s,x86/Linux in 1990s),RAMP,Parallel file system,Thread scheduling,Multiprocessor switch design,Fault insertion to check dependability,Data center in a box,Internet in a box,Dataflow language/computer,Security enhancements,Router design,Compile to FPGA,Parallel languages,Papers and Technical Reports of RAMP Project,ramp.eecs.berkeley.edu/index.php?publications,Berkeleys New Focus,Paper reviews the issues and,as an example,describe an integrated approach were developing at the Parallel Computing Laboratory to tackle the parallel challenge.,A,key research objective,is to enable programmers to easily write programs that run as efficiently on many-core systems as on sequential ones.,A View of the Parallel Computing Landscape,(,communications of the acm,oct.2009,vol.52,no.10),12 computational patterns in 7 general application areas and 5 Par Lab applications,Two long technical reports,A sanovic,K.et al.,The Parallel Computing Laboratory at U.C.Berkeley:A Research Agenda Based on the Berkeley View.,UC B/EECS-2008-23,University of California,Berkeley,Mar.21,2008.,A sanovic,K.et al.,The Landscape of Parallel Computing Research:A View from Berkeley.,UCB/EECS-2006-183,University of California,Berkeley,Dec.18,2006.,Parallel Computing Research at Illinois The UPCRC Agenda,New Trends in Computer Design,Top 500,一个为高性能计算机提供统计的组织。主要针对高性能计算机制造商,用户,潜在用户。,Top500,从,1993,年开始对高性能计算机用,Linpack,程序进行基准测试,取前,500,个最优质系统进行列表在,Top500,网站上进行公布。,Linpack,:是一个求解,100,个线形方程的计算机程序,被用于对高性能计算机进行基准测试。,www.top500.org/,Top 10 Supercomputers-06/2012,1,Sequoia-,BlueGene/Q,Power BQC 16C 1.60 GHz,Custom,2,K computer,SPARC64 VIIIfx 2.0GHz,Tofu interconnect,3,Mira-,BlueGene/Q,Power BQC 16C 1.60GHz,Custom,4,SuperMUC-,iDataPlex DX360M4,Xeon E5-2680 8C 2.70GHz,Infiniband FDR,5,Tianhe-1A-,NUDT YH MPP,Xeon X5670 6C 2.93 GHz,NVIDIA 2050,6,Jaguar-,Cray XK6,Opteron 6274 16C 2.200GHz,Cray Gemini interconnect,NVIDIA 2090,7,Fermi-,BlueGene/Q,Power BQC 16C 1.60GHz,Custom,8,JuQUEEN-,BlueGene/Q,Power BQC 16C 1.60GHz,Custom,9,Curie thin nodes-,Bullx B510,Xeon E5-2680 8C 2.700GHz,Infiniband QDR,10,Nebulae-,Dawning TC3600 Blade System,Xeon X5650 6C 2.66GHz,Infiniband QDR,NVIDIA 2050,2025/4/25 周五,43,用于个人计算机、工作站和游戏机的专用图像显示设备,显示卡或主板集成,;,nVidia,和,ATI(now AMD),是主要制造商,2025/4/25 周五,44,Graphic Processing Unit(GPU),GPU,与,CPU,的差异,GPU,面向计算密集型和大量数据并行化的计算,大量的晶体管用于计算单元,通用,CPU,面向通用计算,大量的晶体管用于,Cache,和控制电路,DRAM,Cache,ALU,Control,ALU,ALU,ALU,DRAM,CPU,GPU,GPU,与,CPU,的峰值速度比较,1,Based on slide 7 of S.Green,“GPU Physics,”SIGGRAPH 2007 GPGPU Course.www.gpgpu.org/s2007/slides/15-GPGPU-physics.pdf,Fermi GF100/GF110,核心架构图,Kepler GK104,核心架构图,暴增的,GPU(,CUDA,),核心数量,GeForce GTX 690:Shader Processors,2x 1536,硬件,CPU,GPU,FPGA,AES-128,解,密实测速度,(,GByte/s,),0.119,(,Core2,E6700,中单核),1.78,(,FX9800GTX+,),1.02,(,互联网资料中单片,FPGA,最大值,),开发难度,小,较小,大,增加功能,容易,容易,难,硬件升级,无需修改代码,无需修改代码,需要修改代码,与主控端通信,不需要,通过,PCI-E,,实际速度一般为,3,G,左右,通过,API,实现,较简单,需要为,FPGA,编写额外的驱动程序,实现通信协议需要额外的硬件资源,性能/成本,高,低,高,片外存储器,内存,容量大,速度低,显存,容量较大,速度高,FPGA,板上内存,一般为,DDRII,速度低,开发周期,短,短,长,2025/4/25 周五,48,CPU,、,GPU,、,FPGA,实现比较,APU,是“,Accelerated Processing Units”,的简称,是,AMD,融聚理念的产品,它第一次将处理器和独显核心做在一个晶片上。,APU微架构由五大部分融合而成:CPU、GPU、北桥、内存控制器和输入输出控制器。,APU:,让,CPU,和,GPU,融为一体,2025/4/25 周五,49,Intels Many Core and Multi-core,Intel 80-core TeraScale Processor(Vangal et al.2008),developed a solver(single precision)for this chip that ran at 1 TFLOP with only 97 Watts,Source,:,Tim Mattson,Intel Labs,Trends are putting all onto one chip,The future belongs to heterogeneous,many core SOC as the standard building block of computing,SOC=system on a chip,Source,:,Tim Mattson,Intel Labs,Xeon Phi,Xeon Phi,是由美国英特尔公司于,2012,年,11,月,12,日正式推出的首款,60,核处理器,。,Xeon Phi,并非传统意义上的,CPU,,它更像是与,CPU,协同工作的,GPU,,其基于英特尔消费级,GPU,技术,Larrabee,,不过该项目已经于,2009,年被取消。,英特尔需要,Larrabee,技术,从而在超级计算机市场与,Nvidia,竞争,因为更简单、更专业的,GPU,处理器可以更有效地处理某些超级计算任务,从而提高性能并减少能耗。,TOP 10 List June 2013,Rank,System,Cores,Rmax(TFlop/s),Rpeak(TFlop/s),Power(kW),1,NUDT,Tianhe-2(MilkyWay-2),-TH-IVB-FEP Cluster,Intel Xeon E5-2692 12C 2.200GHz,TH Express-2,Intel Xeon Phi 31S1P,3,120,000,33,862.7,54,902.4,17,808,2,Cray Inc.,Titan,-Cray XK7,Opteron 6274 16C 2.200GHz,Cray Gemini interconnect,NVIDIA K20 x,560,640,17,590.0,27,112.5,8,209,3,IBM,Sequoia,-BlueGene/Q,Power BQC 16C 1.60 GHz,Custom,1,572,864,17,173.2,20,132.7,7,890,4,Fujitsu,K computer,SPARC64 VIIIfx 2.0GHz,Tofu interconnect,705,024,10,510.0,11,280.4,12,660,5,IBM,Mira,-BlueGene/Q,Power BQC 16C 1.60GHz,Custom,786,432,8,586.6,10,066.3,3,945,6,Dell,Stampede,-PowerEdge C8220,Xeon E5-2680 8C 2.7
展开阅读全文