1、计算机体系结构课后习题1.1 Three enhancements with the following speedups are proposed for a new architecture :Speedup1=30Speedup2=20Speedup3=15Only one enhancement is usable at a time.(1) If enhancements 1 and 2 are each usable for 25% of the time ,what fraction of the time must enhancement 3 be used to achiev
2、e an overall speedup of 10?(2)Assume the enhancements can be used 25%,35% and 10% of the time for enhancements 1,2,and 3,respectively .For what fraction of the reduced execution time is no enhancement in use?(3)Assume ,for some benchmark,the possible fraction of use is 15% for each of enhancements 1
3、 and 2 and 70% for enhancement 3.We want to maximize performance .If only one enhancement can be implemented ,which should it be ?If two enhancements can be implemented ,which should be chosen?答:(1)Assume: the fraction of the time enhancement 3 must be used to achieve an overall speedup of 10 is x.S
4、peedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced10=11-25%-25%-x+25%30+25%20+x15 So , x=45%(2)Assume:The total execution time before the three enhancements can be used is Timebefore ,The execution time for no enhancement is Timeno.Timeno=1-25%-35%-10%TimebeforeThe total execution time
5、after the three enhancements can be used is TimeafterTimeafter=Timeno+25%30Timebefore+35%20Timebefore+10%15TimebeforeSo,TimenoTimeafter=90.2%(3)By Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedIf only one enhancement can be implemented:Speedupoverall1=11-15%+15%30=1.17Speedupoveral
6、l2=11-15%+15%20=1.166Speedupoverall3=11-15%+15%15=2.88So,we must select enhancement 1 and 3 to maximize performance.Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall12=11-15%-15%+15%30+15%20=1.40Speedupoverall13=11-15%-70%+15%30+70%15=4.96Speedupoverall23=11-15%-70%+15%2
7、0+70%15=4.90So,we must select enhancement 1 and 3 to maximize performance.1.2 Suppose there is a graphics operation that accounts for 10% of execution time in an application ,and by adding special hardware we can speed this up by a factor of 18 . In further ,we could use twice as much hardware ,and
8、make the graphics operation run 36 times faster.Give the reason of whether it is worth exploring such an further architectural change?答:Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall1=11-10%+10%18=10.9+0.0055555=1.104Speedupoverall2=11-10%+10%36=10.9+0.0027777=1.108So
9、,It is not worth exploring such an further architectural change.1.3 In many practical applications that demand a real-time response,the computational workload W is often fixed.As the number of processors increases in a parallel computer,the fixed workload is distributed to more processors for parall
10、el execution.Assume 20 percent of W must be executed sequentially ,and 80 percent can be executed by 4 nodes simultaneously .What is a fixed-load speedup?答:Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall1=WW20%+W80%4=10.2+0.2=2.5So,a fixed-load speedup is 2.5.2.1 There
11、 is a model machine with nine instructions,which frequencies are ADD(0.3), SUB(0.24), JOM(0.06), STO(0.07), JMP(0.07), SHR(0.02), CIL(0.03), CLA(0.2), STP(0.01),respectively. There are several GPRs in the machine.Memory is byte addressable,with accessed addresses aligned .And the memory word width i
12、s 16 bit.Suppose the nine instructions with the characteristics as following :nTwo operands instructionsnTwo kinds of instruction lengthnExtended codingnShorter instruction operands format:R(register)-R(register)nLonger instruction operands format:R(register)-M(memory)nWith displacement memory addre
13、ssing modeA. Encode the nine instructions with Huffman-coding, and give the average code length.B. Designed the practical instruction codes,and give the average code length.C. Write the two instruction word formats in detail.D. What is the maximum offset for accessing memory address?答: Huffman codin
14、g by Huffman treenADD30%01nSUB24% 11nCLA 20% 10nJOM6% 0001nSTO7%0011nJMP7%0010nSHR2%000001nCIL3%00001nSTP1%000000So,the average code length isi=19pili=2.61bits(B)Two kinds of instruction length extended codingnADD30%01nSUB 24% 11nCLA20% 10nJOM6% 11000nSTO7%11001nJMP7%11010nSHR2%11011nCIL3%11100nSTP1
15、%11101So,the average code length is(C)Shorter instruction format:Opcode2bitsRegister3bitsRegister3bitsLonger instruction format:opcode5bitsRegister3bitsRegister3bitsoffset5bits(D)The maximum offset for accessing memory address is 32 bytes.3.1Identify all of the data dependences in the following code
16、 .Which dependences are data hazards that will be resolved via forwarding?ADDR2,R5,R4ADDR4,R2,R5SW R5,100(R2)ADDR3,R2,R4答:3.2How could we modify the following code to make use of a delayed branch slot?Loop: LW R2,100(R3)ADDI R3,R3,#4BEQ R3,R4,Loop答:LW R2,100(R3)Loop:ADDI R3,R3,#4BEQ R3,R4,LoopDelaye
17、d branch slotLW R2,100(R3)3.3Consider the following reservation table for a four-stage pipeline with a clock cycle t=20ns.A. What are the forbidden latencies and the initial collision vector?B. Draw the state transition diagram for scheduling the pipeline.C. Determine the MAL associated with the sho
18、rtest greedy cycle.D. Determine the pipeline maximumthroughput corresponding to the MAL and given t.s1s2s3s4123456答:A. the forbidden latencies F=1,2,5 the initial collision vectorC=(10011)B.the state transition diagramC. MAL (Minimal Average Latency)=3 clock cyclesD. The pipeline maximum throughput
19、Hk=1/(320ns)3.4Using the following code fragment:Loop: LW R1,0(R2); load R1 from address 0+R2ADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2BNEZ R4,Loop;Branch to loop if R4!=0Assume that the initial value of R3 is R2+396.Throughout this exercise use
20、 the classic RISC five-stage integer pipeline and assume all memory access take 1 clock cycle.A. Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardwarebut assuming a register read and a write in the same clock cycle “forwards”through the regis
21、ter file. Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute?B. Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is h
22、andled by predicting it as not taken. If all memory reference take 1 cycle, how many cycles does this loop take to execute?C. Assume the RISC pipeline with a single-cycle delayed branchand normal forwarding and bypassing hardware. Schedule the instructions in the loop including the branch delay slot
23、. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop.答:A. The
24、loop iterates 396/4=99 times.Go through one complete iteration of the loop and the first instruction in the next iteration.Total length=the length of iterations 0 through 97(The first 98 iterations should be of the same length) +the length of the last iteration.We have assumed the version of DLX des
25、cribed in Figure 3.21(Page 97) in the book,which resolves branches in MEM.From this Figure, the second iteration begin 17 clocks after the first iteration and the last iteration takes 18 cycles to complete.Total length=1798+18=1684 clock cyclesB. From this Figure, the second iteration begin 10 clock
26、s after the first iteration and the last iteration takes 11 cycles to complete.Total length=1098+11=991 clock cyclesC. Loop: LW R1,0(R2);load R1 from address 0+R2ADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2BNEZ R4,Loop;Branch to loop if R4!=0Reord
27、er instructions to :Loop: LW R1,0(R2); load R1 from address 0+R2ADDI R2,R2,#4; R2=R2+4SUBR4,R3,R2;R4=R3-R2ADDI R1,R1,#1;R1=R1+1BNEZ R4,Loop;Branch to loop if R4!=0SW-4(R2),R1;store R1 at address 0+R2From Figure the second iteration begin 6 clocks after the first iteration and the last iteration take
28、s 10 cycles to complete.Total length=698+10=598 clock cyclesLoop: LW R1,0(R2); load R1 from address 0+R2stallADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4; R2=R2+4SUBR4,R3,R2;R4=R3-R2stallBNEZ R4,Loop;Branch to loop if R4!=0stallLoop: LW R1,0(R2);load R1 from address 0+R2(sta
29、ll)ADDI R2,R2,#4;R2=R2+4ADDI R1,R1,#1;R1=R1+1SW-4(R2),R1;store R1 at address 0+R2SUBR4,R3,R2;R4=R3-R2stallBNEZ R4,Loop;Branch to loop if R4!=0stallLoop: LW R1,0(R2);load R1 from address 0+R2(stall)ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2(stall)ADDI R1,R1,#1;R1=R1+1BNEZ R4,Loop;Branch to loop if R4!
30、=0(stall)SW-4(R2),R1;store R1 at address 0+R23.5Consider the following reservation table for a four-stage pipeline.A. What are the forbidden latencies and the initial collision vector?B. Draw the state transition diagram for scheduling the pipeline.C. Determine the MAL associated with the shortest g
31、reedy cycle.D. Determine the pipeline maximum throughput corresponding to the MAL.E. According to the shortest greedy cycle , put six tasks into the pipeline ,determine the pipeline actual throughput.1234567s1s2s3s4答:A. the forbidden latencies are 2,4,6 the initial collision vector C=(101010)B.the s
32、tate transition diagram:C.the MAL associated with the shortest greedy cycle is 4 cycles.schedulingAverage latency(1,7)4(3,5)4(5,3)4(5)5(3,7)5(5,7)6(7)7D. the pipeline maximum throughput corresponding to the MAL :Hk=1/(4 clock cycles)E. According to the shortest greedy cycle , put six tasks into the
33、pipeline.The best scheduling is the greedy cycle(l,7).because :according to (1,7) scheduling :actual throughput Hk=6/(1+7+1+7+1+7)=6/(24 cycles)according to (3,5) scheduling :actual throughput Hk=6/(3+5+3+5+3+7)=6/(26 cycles)according to (5,3) scheduling :actual throughput Hk=6/(5+3+5+3+5+7)=6/(28 c
34、ycles)4.1 The following C program is run (with no optimizations) on a machine with a cache that has four-word(16-byte)blocksand holds 256 bytes of data:inti,j,c,stride,array256;for(i=0;i10000;i+)for(j=0;j256;j=j+stride)c=arrayj+5;if we consider only the cache activity generated by references to the
35、array and we assume that integer sare words, what is the expected miss rate when the cache is direct-mapped and stride=132? How about if stride=131? Would either of these change if the cache were two-way set associative?答:If stride=132 and the cache is direct-mappedPage 201、211The block number of th
36、e cache is 256/16=16The block address of array0= 0/16 =0The block number that array0maps to cache : 0 mod16=0The block address of array132= 1324/16 =33The block number that array132maps to cache : 33 mod 16=1So,miss rate=2/210000=1/10000If stride=131 and the cache is direct-mappedPage 201、211The blo
37、ck number of the cache is 256/16=16The block address of array0= 0/16 =0The block number that array0maps to cache : 0 mod16=0The block address of array131= 1314/16 =32The block number that array131maps to cache:32 mod 16=0So,miss rate=210000/210000=1If stride=132 and the cache is two-way set associat
38、ivePage 224-227、211The block number of the cache is 256/16=16The set number of the cache is 16/2=8The block address of array0= 0/16 =0The set number that array0maps to cache : 0 mod 8=0The block address of array132= 1324/16 =33The set number that array132maps to cache :33 mod 8=1So,miss rate=2/21000
39、0=1/10000If stride=131 and the cache is two-way setassociativePage 224-227、211The block number of the cache is 256/16=16The set number of the cache is 16/2=8The block address of array0= 0/16 =0The set number that array0maps to cache : 0 mod 8=0The block address of array131= 1314/16 =32The set number
40、 that array131maps to cache :32 mod 8=0So,miss rate=2/210000=1/100004.2 Consider a virtual memory system with the following properties:n40-bitvirtualbyteaddressn16-KBpagesn36-bitphysicalbyteaddress(1)whatisthetotalsizeofthepagetableforeachprocessonthismachine,assumingthatthevalid,protection,dirty,an
41、dusebitstakeatotalof4bitsandthatallthevirtualpagesareinuse?(Assumethatdiskaddressesarenotstoredinthepagetable)(2)Assumethatthevirtualmemorysystemisimplementedwithatwo-wayset-associativeTLBwithatotalof256TLBentries.Showthevirtual-to-physicalmappingwithafigure.Makesuretolabelthewidthofallfieldsandsignals.答:So,the total size of the page table for each process on this machine is:2(40-14) (4+(36-14)bit=22626bit=208M(Byte)