计算机体系结构课后习题.doc

资源描述

计算机体系结构课后习题 1.1 Three enhancements with the following speedups are proposed for a new architecture : Speedup1=30 Speedup2=20 Speedup3=15 Only one enhancement is usable at a time. （1） If enhancements 1 and 2 are each usable for 25% of the time ,what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? （2）Assume the enhancements can be used 25%,35% and 10% of the time for enhancements 1,2,and 3,respectively .For what fraction of the reduced execution time is no enhancement in use? （3）Assume ,for some benchmark,the possible fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3.We want to maximize performance .If only one enhancement can be implemented ,which should it be ?If two enhancements can be implemented ,which should be chosen? 答：(1)Assume: the fraction of the time enhancement 3 must be used to achieve an overall speedup of 10 is x. Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced 10=11-25%-25%-x+25%30+25%20+x15 So , x=45% (2)Assume:The total execution time before the three enhancements can be used is Timebefore ，The execution time for no enhancement is Timeno. Timeno=1-25%-35%-10%×Timebefore The total execution time after the three enhancements can be used is Timeafter Timeafter=Timeno+25%30×Timebefore+35%20×Timebefore+10%15×Timebefore So，TimenoTimeafter=90.2% (3)By Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced If only one enhancement can be implemented： Speedupoverall1=11-15%+15%30=1.17 Speedupoverall2=11-15%+15%20=1.166 Speedupoverall3=11-15%+15%15=2.88 So，we must select enhancement 1 and 3 to maximize performance. Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced Speedupoverall12=11-15%-15%+15%30+15%20=1.40 Speedupoverall13=11-15%-70%+15%30+70%15=4.96 Speedupoverall23=11-15%-70%+15%20+70%15=4.90 So，we must select enhancement 1 and 3 to maximize performance. 1.2 Suppose there is a graphics operation that accounts for 10% of execution time in an application ,and by adding special hardware we can speed this up by a factor of 18 . In further ,we could use twice as much hardware ,and make the graphics operation run 36 times faster.Give the reason of whether it is worth exploring such an further architectural change? 答：Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced Speedupoverall1=11-10%+10%18=10.9+0.0055555=1.104 Speedupoverall2=11-10%+10%36=10.9+0.0027777=1.108 So，It is not worth exploring such an further architectural change. 1.3 In many practical applications that demand a real-time response,the computational workload W is often fixed.As the number of processors increases in a parallel computer,the fixed workload is distributed to more processors for parallel execution.Assume 20 percent of W must be executed sequentially ,and 80 percent can be executed by 4 nodes simultaneously .What is a fixed-load speedup? 答：Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced Speedupoverall1=WW×20%+W×80%4=10.2+0.2=2.5 So，a fixed-load speedup is 2.5. 2.1 There is a model machine with nine instructions,which frequencies are ADD(0.3), SUB(0.24), JOM(0.06), STO(0.07), JMP(0.07), SHR(0.02), CIL(0.03), CLA(0.2), STP(0.01),respectively. There are several GPRs in the machine.Memory is byte addressable,with accessed addresses aligned .And the memory word width is 16 bit. Suppose the nine instructions with the characteristics as following : nTwo operands instructions nTwo kinds of instruction length nExtended coding nShorter instruction operands format:R(register)-R(register) nLonger instruction operands format:R(register)-M(memory) nWith displacement memory addressing mode A. Encode the nine instructions with Huffman-coding, and give the average code length. B. Designed the practical instruction codes,and give the average code length. C. Write the two instruction word formats in detail. D. What is the maximum offset for accessing memory address? 答： Huffman coding by Huffman tree nADD 30% 01 nSUB 24% 11 nCLA 20% 10 nJOM 6% 0001 nSTO 7% 0011 nJMP 7% 0010 nSHR 2% 000001 nCIL 3% 00001 nSTP 1% 000000 So，the average code length is i=19pi×li=2.61bits （B）Two kinds of instruction length extended coding nADD 30% 01 nSUB 24% 11 nCLA 20% 10 nJOM 6% 11000 nSTO 7% 11001 nJMP 7% 11010 nSHR 2% 11011 nCIL 3% 11100 nSTP 1% 11101 So，the average code length is （C）Shorter instruction format: Opcode 2bits Register 3bits Register 3bits Longer instruction format: opcode 5bits Register 3bits Register 3bits offset 5bits （D）The maximum offset for accessing memory address is 32 bytes. 3.1Identify all of the data dependences in the following code .Which dependences are data hazards that will be resolved via forwarding? ADD R2,R5,R4 ADD R4,R2,R5 SW R5,100(R2) ADD R3,R2,R4 答： 3.2How could we modify the following code to make use of a delayed branch slot? Loop: LW R2,100(R3) ADDI R3,R3,#4 BEQ R3,R4,Loop 答： LW R2,100(R3) Loop: ADDI R3,R3,#4 BEQ R3,R4,Loop Delayed branch slotà LW R2,100(R3) 3.3Consider the following reservation table for a four-stage pipeline with a clock cycle t=20ns. A. What are the forbidden latencies and the initial collision vector? B. Draw the state transition diagram for scheduling the pipeline. C. Determine the MAL associated with the shortest greedy cycle. D. Determine the pipeline maximumthroughput corresponding to the MAL and given t. s1 s2 s3 s4 1 2 3 4 5 6 × × × × × × × 答：A. the forbidden latencies F={1,2,5} the initial collision vectorC=(10011) B.the state transition diagram C. MAL (Minimal Average Latency)=3 clock cycles D. The pipeline maximum throughput Hk=1/(3×20ns) 3.4Using the following code fragment: Loop: LW R1,0(R2); load R1 from address 0+R2 ADDI R1,R1,#1; R1=R1+1 SW 0(R2),R1; store R1 at address 0+R2 ADDI R2,R2,#4; R2=R2+4 SUB R4,R3,R2; R4=R3-R2 BNEZ R4,Loop; Branch to loop if R4!=0 Assume that the initial value of R3 is R2+396. Throughout this exercise use the classic RISC five-stage integer pipeline and assume all memory access take 1 clock cycle. A. Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardwarebut assuming a register read and a write in the same clock cycle “forwards”through the register file. Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute? B. Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory reference take 1 cycle, how many cycles does this loop take to execute? C. Assume the RISC pipeline with a single-cycle delayed branchand normal forwarding and bypassing hardware. Schedule the instructions in the loop including the branch delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop. 答：A. ·The loop iterates 396/4=99 times. ·Go through one complete iteration of the loop and the first instruction in the next iteration. ·Total length=the length of iterations 0 through 97(The first 98 iterations should be of the same length) +the length of the last iteration. ·We have assumed the version of DLX described in Figure 3.21(Page 97) in the book,which resolves branches in MEM. ·From this Figure, the second iteration begin 17 clocks after the first iteration and the last iteration takes 18 cycles to complete. ·Total length=17×98+18=1684 clock cycles B. ·From this Figure, the second iteration begin 10 clocks after the first iteration and the last iteration takes 11 cycles to complete. ·Total length=10×98+11=991 clock cycles C. Loop: LW R1,0(R2); load R1 from address 0+R2 ADDI R1,R1,#1; R1=R1+1 SW 0(R2),R1; store R1 at address 0+R2 ADDI R2,R2,#4; R2=R2+4 SUB R4,R3,R2; R4=R3-R2 BNEZ R4,Loop; Branch to loop if R4!=0 Reorder instructions to : Loop: LW R1,0(R2); load R1 from address 0+R2 ADDI R2,R2,#4; R2=R2+4 SUB R4,R3,R2; R4=R3-R2 ADDI R1,R1,#1; R1=R1+1 BNEZ R4,Loop; Branch to loop if R4!=0 SW -4(R2),R1; store R1 at address 0+R2 ·From Figure the second iteration begin 6 clocks after the first iteration and the last iteration takes 10 cycles to complete. ·Total length=6×98+10=598 clock cycles Loop: LW R1,0(R2); load R1 from address 0+R2 stall ADDI R1,R1,#1; R1=R1+1 SW 0(R2),R1; store R1 at address 0+R2 ADDI R2,R2,#4; R2=R2+4 SUB R4,R3,R2; R4=R3-R2 stall BNEZ R4,Loop; Branch to loop if R4!=0 stall Loop: LW R1,0(R2); load R1 from address 0+R2 (stall) ADDI R2,R2,#4; R2=R2+4 ADDI R1,R1,#1; R1=R1+1 SW -4(R2),R1; store R1 at address 0+R2 SUB R4,R3,R2; R4=R3-R2 stall BNEZ R4,Loop; Branch to loop if R4!=0 stall Loop: LW R1,0(R2); load R1 from address 0+R2 (stall) ADDI R2,R2,#4; R2=R2+4 SUB R4,R3,R2; R4=R3-R2 (stall) ADDI R1,R1,#1; R1=R1+1 BNEZ R4,Loop; Branch to loop if R4!=0 (stall) SW -4(R2),R1; store R1 at address 0+R2 3.5Consider the following reservation table for a four-stage pipeline. A. What are the forbidden latencies and the initial collision vector? B. Draw the state transition diagram for scheduling the pipeline. C. Determine the MAL associated with the shortest greedy cycle. D. Determine the pipeline maximum throughput corresponding to the MAL. E. According to the shortest greedy cycle , put six tasks into the pipeline ,determine the pipeline actual throughput. 1 2 3 4 5 6 7 s1 √ √ √ s2 √ √ s3 √ s4 √ √ 答：A. the forbidden latencies are {2,4,6} the initial collision vector C=(101010) B.the state transition diagram: C.the MAL associated with the shortest greedy cycle is 4 cycles. scheduling Average latency （1,7） 4 （3,5） 4 （5,3） 4 （5） 5 （3,7） 5 （5,7） 6 （7） 7 D. the pipeline maximum throughput corresponding to the MAL :Hk=1/(4 clock cycles) E. According to the shortest greedy cycle , put six tasks into the pipeline. The best scheduling is the greedy cycle(l,7). because : according to (1,7) scheduling : actual throughput Hk=6/(1+7+1+7+1+7)=6/(24 cycles) according to (3,5) scheduling : actual throughput Hk=6/(3+5+3+5+3+7)=6/(26 cycles) according to (5,3) scheduling : actual throughput Hk=6/(5+3+5+3+5+7)=6/(28 cycles) 4.1 The following C program is run (with no optimizations) on a machine with a cache that has four-word(16-byte)blocksand holds 256 bytes of data: inti,j,c,stride,array[256]; … for(i=0;i<10000;i++) for(j=0;j<256;j=j+stride) c=array[j]+5; if we consider only the cache activity generated by references to the array and we assume that integer sare words, what is the expected miss rate when the cache is direct-mapped and stride=132? How about if stride=131? Would either of these change if the cache were two-way set associative? 答：If stride=132 and the cache is direct-mapped Page 201、211 ·The block number of the cache is 256/16=16 ·The block address of array[0]= 0/16 =0 ·The block number that array[0]maps to cache : 0 mod16=0 ·The block address of array[132]= 132×4/16 =33 ·The block number that array[132]maps to cache : 33 mod 16=1 So,miss rate=2/2´10000=1/10000 If stride=131 and the cache is direct-mapped Page 201、211 ·The block number of the cache is 256/16=16 ·The block address of array[0]= 0/16 =0 ·The block number that array[0]maps to cache : 0 mod16=0 ·The block address of array[131]= 131×4/16 =32 ·The block number that array[131]maps to cache:32 mod 16=0 So,miss rate=2´10000/2´10000=1 If stride=132 and the cache is two-way set associative Page 224-227、211 ·The block number of the cache is 256/16=16 ·The set number of the cache is 16/2=8 ·The block address of array[0]= 0/16 =0 ·The set number that array[0]maps to cache : 0 mod 8=0 ·The block address of array[132]= 132×4/16 =33 ·The set number that array[132]maps to cache :33 mod 8=1 So,miss rate=2/2´10000=1/10000 If stride=131 and the cache is two-way setassociative Page 224-227、211 ·The block number of the cache is 256/16=16 ·The set number of the cache is 16/2=8 ·The block address of array[0]= 0/16 =0 ·The set number that array[0]maps to cache : 0 mod 8=0 ·The block address of array[131]= 131×4/16 =32 ·The set number that array[131]maps to cache :32 mod 8=0 So,miss rate=2/2´10000=1/10000 4.2 Consider a virtual memory system with the following properties: n40-bitvirtualbyteaddress n16-KBpages n36-bitphysicalbyteaddress (1)whatisthetotalsizeofthepagetableforeachprocessonthismachine,assumingthatthevalid,protection,dirty,andusebitstakeatotalof4bitsandthatallthevirtualpagesareinuse?(Assumethatdiskaddressesarenotstoredinthepagetable) (2)Assumethatthevirtualmemorysystemisimplementedwithatwo-wayset-associativeTLBwithatotalof256TLBentries.Showthevirtual-to-physicalmappingwithafigure.Makesuretolabelthewidthofallfieldsandsignals. 答： So,the total size of the page table for each process on this machine is: 2(40-14) ×(4+(36-14))bit=226×26bit=208M(Byte)

展开阅读全文