1、Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Silberschatz,Korth and Sudarshan,17.,*,Click to edit Master title style,Database System Concepts,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Master title style,*
2、第,9,章 恢复系统,本章内容参考,:,数据库概念,(,第四版,)by A.Silberschatz,Chapter 17,Recovery System,补充内容,本章内容特色,:,数据库恢复,的,“,道理,”,较容易理解,但,编程,实现,较难,算法较多,本章要解决的关键问题:,如何预防和应付各种,”,故障”,保证,ACID,性质,主要内容,故障分类,存储器结构,恢复与原子性,基于日志的恢复,Shadow Paging,并发事务的恢复,缓冲管理,非易失性存储器丢失信息的故障,Advanced Recovery Techniques,ARIES Recovery Algorithm,故障分类
3、事务故障,:,逻辑错误,:,因为某些内部错误条件导致事务不能完成,系统错误,:,因为某种错误条件,(,如死锁,),导致数据库系统终止一个活跃事务,系统崩溃,:,停电故障或者其他软硬件故障导致系统崩溃,故障,-,停止假设,:,假设非易失性存储器的内容不会因系统崩溃而破坏,数据库系统可通过许多完整性检查来防止磁盘数据被破坏,介质故障,:,磁头损坏或类似的磁盘故障可能破坏全部或部分磁盘存储器,假设损坏是可以检测到的,:,磁盘驱动器使用,校验和,来检测故障,恢复算法,恢复算法是指即使发生故障也能确保数据库一致性和事务原子性及持久性的技术,恢复算法两个部分,在正常事务处理过程中采取动作来确保有足够的信
4、息用于从故障恢复,(,即事前预防措施,保证有尽量多的信息在故障中保存下来,),在故障发生后采取行动作将数据库内容恢复到一个确保原子性,一致性和持久性的状态,(,即事后的恢复措施,保证原子性,一致性,耐久性,),存储器结构,易失性(,Volatile),存储器,不能在系统崩溃后保存下来,例如,:,主存,高速缓存,非易失性(,Nonvolatile),存储器,可以在系统崩溃后保存下来,例如,:,磁盘,磁带,闪存,非易失性,RAM(,电池供电,),稳定(,Stable),存储器,虚构的能够经受任何故障的存储器,可用多个非易失性介质存储相同的副本来近似,稳定存储器的实现,在不同磁盘上维持多个副本,副本
5、可以在远程站点上,以防止象火灾或洪水之类的灾难,在数据传输过程中的故障仍然可导致不一致的副本,块传输的结果可能是,成功完成,部分失败,:,目标块有错误信息,完全失败,:,目标块根本没有更新,在数据传输过程中防止存储介质出故障的解决方法,输出操作如下执行,(,假设每个块有两个副本,):,将信息写到第一个物理块,当第一个写操作成功完成,再将相同信息写到第二个物理块,仅当第二个写操作成功完成之后输出才算完成,数据存取,事务使用下列操作来在系统缓冲块和它的私有工作区之间传送数据项,:,read,(,X,),将数据项,X,的值赋给局部变量,x,i,.,write,(,X,),将局部变量,x,i,的值赋给
6、缓冲块中的数据项,X,这两条命令在赋值前都可能要求发出,input(B,X,),指令,如果,X,所在的块,B,X,当前不在主存中的话,数据存取,事务,第一次访问,X,时执行,read(,X,),以后的访问是对局部副本进行的,最后一次访问之后,执行,write(,X,),output(,B,X,),不必紧随,write(,X,),立即执行,系统可在它认为适当的时候执行,output,操作,数据存取例,x,Y,A,B,x,1,y,1,缓冲区,缓冲块,A,缓冲块,B,input(A),output(B),read(X),write(Y),磁盘,T,1,的工作区,T,2,的工作区,内存,x,2,恢复与
7、原子性,问题,一个较长事务,进行,若干查插删改,操作,设计时,假定,整个事务作为原子是保持一致性,但,执行可能,半途而废(意外原因,停电等),如何恢复?,恢复与原子性,为了在故障的情况下仍确保原子性,应首先输出描述修改操作的信息到稳定存储中,(,先记账,),,描述将作的操作,然后动作,(,恢复有据),下面介绍两种方法,:,基于日志的恢复,影子分页,shadow-paging,先假设事务是串行执行的,基于日志的恢复,日志,(log),保存在稳定存储器上,日志是,日志记录,的序列,用来记录对数据库的更新活动,当事务,T,i,启动,写入一条登记自己的日志记录,T,i,执行,write(,X,),之前
8、写入日志记录,各分量依次,是,:,事务,变量,前值,后值,当,T,i,完成了最后一条语句,写入日志记录,基于日志的恢复,目前假设日志记录直接写入稳定存储器,(,即不经过缓冲,),两种使用日志的方法,延迟更新数据库,(,先记账,后补工,保守,稳妥,),立即更新数据库,(,边记,边做,积极,冒险,),延迟更新数据库,延迟更新数据库,方案在日志中记录所有更新,但推迟,write,直至部分提交之后,假设事务串行执行,事务启动时向日志写入,记录,write(,X,),操作导致写入日志记录,其中,V,是,X,的新值,注,:,这种方案不需要保留旧值,这时并未执行对,X,的写操作,而是延迟了,当,T,i,部
9、分提交时,向日志写入,最后,读取日志记录并用来实际执行前面延迟的写操作,(,把被延迟的写动作(欠工账)补作完成,),生活中,一个“拖”字或“研究研究”就是这种延迟策略,延迟更新数据库,故障后恢复过程中,,事务需要重做,(redo),,当且仅当日志中存在,和,(,这样恢复有据,),重做事务,T,i,(redo,T,i,),将事务更新过的所有数据项的值置为新值,(,即有底子,有工账,在底子上,重来一次,),但,正,在恢复时出错有怎么办,?,例如事务,T,0,和,T,1,(,T,0,在,T,1,之前执行,):,T,0,:,read,(,A,),T,1,:,read,(,C,),A:=A-50,C:=
10、C-100,Write,(,A,),write,(,C,),read,(,B,),B:=B+50,write,(,B,),延迟更新数据库,下面给出了日志在三个时刻的情形,如果故障发生时稳定存储器上的日志处于情形,:,(a),不需重做,(,因为工账并未实施,),(b),必须执行,redo(,T,0,),因为存在,(c),必须先执行,redo(,T,0,),再执行,redo(,T,1,),因为存在,和,立即更新数据库,立即更新数据库,方案允许一个未提交事务在发出写操作时更新数据库,由于可能需要取消操作,日志记录必须包含旧值和新值,对更新的日志记录必须在数据库数据修改之前写入稳定存储器,假设日志记录
11、直接输出到稳定存储器,可以扩展成推迟日志记录的输出,只要满足:在执行数据块,B,的,output(,B,),操作之前,所有对应于,B,的日志记录都已写入稳定存储器,被更新块的输出可能发生在事务提交之前或之后的任何时刻,块输出的次序可以与修改块的次序不同,立即更新数据库,恢复过程需要两种操作,:,undo(,T,i,),复原所有被,T,i,更新了的数据项的,旧值,从,T,i,的最后一条日志记录向前进行(后退),redo(,T,i,),将所有被,T,i,更新了的数据项的值置为,新值,从,T,i,的第一条日志记录向后进行(朝前),两种操作都必须是,幂等的,亦即,即使该操作被执行多次,效果与只执行一次
12、是一样的,由于在恢复过程中可能重复执行这两个操作,所以需要此性质,立即更新数据库,故障后恢复,如果日志中包含,但不包含,则事务,T,i,需要取消,(undo)(,即日志中有头无尾时要复旧(急性子将有的麻烦),),如果日志中包含,和,则事务,T,i,需要重做,(redo)(,即日志中有头有尾时,需要重作,),取消,(undo),操作先做,重做,(redo),操作后做,立即更新数据库的例子,检查点,前述恢复过程的问题,搜索整个日志很费时,可能不必要地重做事务(已经将更新输出到数据库的事务),通过周期性地执行,检查点,来精简恢复过程,将当前驻留在主存中的所有日志记录输出到稳定存储器,将所有更新过的缓
13、冲块输出到磁盘,将一条,日志记录写入稳定存储器,检查点,恢复时仅需考虑在检查点之前启动的最近的事务,T,i,以及在,T,i,之后启动的事务,从日志末尾反向扫描,找到最近的,记录,继续反向扫描直到找到一个,记录,仅需考虑日志中在上面这个启动记录之后的部分,之前的部分在恢复时可以忽略,如果愿意可以随时删除,对所有没有,的事务,(,T,i,或更晚,),执行,undo,(T,i,),(,仅当立即更新数据库时才如此,),正向扫描日志,对所有有,的事务,(,T,i,或更晚,),执行,redo,(T,i,),检查点的例子,T,1,可以忽略,(,由于检查点,其更新已经输出到磁盘,),T,2,和,T,3,重做,
14、T,4,取消,T,c,T,f,T,1,T,2,T,3,T,4,检查点,系统故障,影子页面(,Shadow Paging,),Idea,保持真身和影子,影子在非易逝内存,事务前留影,事务中影子不被修改,事务成功,则影子与时俱进,事务失败,则恢复到影子,技巧,影子不是数据页面的副本,只要页面指针表的副本,Sample Page Table,Example of Shadow Paging,Shadow and current page tables after write to page 4,Shadow Paging,To commit a transaction:,事务提交时,1.,Flush
15、变动的值块到磁盘,2.,Output current page table to disk,当前表写盘,3.Make the current page table the new shadow page table,as follows:,keep a pointer to the shadow page table at a fixed(known)location on disk,to make the current page table the new shadow page table,simply update the pointer to point to current pa
16、ge table on disk,Shadow Paging,Once pointer to shadow page table has been written,transaction is committed,No recovery is needed after a crash new transactions can start right away,using the shadow page table,出错后复旧,pCurrPag=pShadow,Pages not pointed to from current/shadow page table should be freed(
17、garbage collected),释放不用的指针,Show Paging(11/4),Advantages,优点,no overhead of writing log records(,快,),recovery is trivial(,易恢复,),Show Paging,Disadvantages:,缺点,Copying the entire page table is very expensive,复制原指针表费时,Can be reduced by using a page table structured like a B,+,-tree,No need to copy entire
18、 tree,only need to copy paths in the tree that lead to updated leaf nodes,可用,B,树解决,只复制路径,Commit overhead is high even with above extension,提交花销大,Need to flush every updated page,and page table,Data gets fragmented related pages get separated on disk,磁盘上数据不连续,After every transaction completion,the da
19、tabase pages containing old versions of modified data need to be garbage collected,需要回收内存空间,Hard to extend algorithm to allow transactions to run concurrently,并发程序难写,并发事务的恢复,对,基于日志的,恢复方案,扩展,,以允许多个事务并发执行,简化模型的,3,个假定:,所有事务共享单个磁盘缓冲区和单个日志,缓冲块可以具有被一个或多个事务更新的数据项,假设使用严格两阶段锁的并发控制,即未提交事务所做的更新对其他事务是不可见的,否则如果,
20、T1,更新,A,然后,T2,又更新,A,并提交,最后,T1,必须夭折时,该怎样执行,undo?,日志登记与前面描述的相同,不同事务的日志记录在日志中可能,交错分布,检查点技术及恢复时的操作必须改变,并发,多检查点,导致,复杂,并发事务的恢复,检查点的执行基本同前,除了检查点日志记录现在形如,其中,L,是检查点时,活跃事务,列表,假设当执行检查点时没有正在进行的更新操作,(,以后对此可放宽,),当系统从崩溃恢复时,首先做下列动作,:,初始化,undo-list,和,redo-list,为空,从末尾开始反向扫描日志,直至发现第一条,记录时停止,.,对在反向扫描过程中发现的每条记录,:,如果是,将,
21、T,i,加入,redo-list,如果是,并且,T,i,不在,redo-list,中,则将,T,i,加入,undo-list,对,L,中的每个,T,i,如果,T,i,不在,redo-list,中,则将,T,i,加入,undo-list,并发事务的恢复,至此,undo-list,包含必须取消操作的未完成事务,而,redo-list,包含必须重做的已完成事务,现在恢复可以如下继续,:,从最近的记录开始反向扫描日志,对,undo-list,中的每个,T,i,当遇到,记录时停止,扫描过程中,对属于,undo-list,中事务的每条日志记录执行,undo,定位于最近的,记录,从,记录开始正向扫描日志,直
22、至日志末尾,扫描过程中,对属于,redo-list,中事务的每条日志记录执行,redo,恢复例,对下列日志按照恢复算法的步骤走一遍,:,日志记录缓冲,模型约定,日志记录缓冲,:,在主存中缓冲日志记录,而不是直接输出到稳定存储器,当缓冲区中的日志记录块满时,或者当执行日志强制(,log force),操作时,将日志记录输出到稳定存储器,日志强制用来提交事务,方法是将它的日志记录(包括提交记录)强行输出到稳定存储器,由于一次输出操作可以输出多条日志记录,从而减少,I/O,代价,日志记录缓冲,日志缓存的规则,日志记录按其创建次序输出到稳定存储器,仅当,日志记录已经输出到稳定存储器,事务,T,i,才进
23、入提交状态,在主存中的数据块输出到数据库之前,所有属于该块中数据的日志记录必须已经输出到稳定存储器,这条规则称为,先写日志(,write-ahead logging),或,WAL,规则,严格地说,,WAL,只需要输出,undo,信息,数据库缓冲,数据库维护一个数据块在内存中的缓冲区,当需要新块时,如果缓冲区已满,则需要将一个现有块移出缓冲区,如果所选的移出块被更新过,(“,脏块,”),则必须将它输出到磁盘,由于先写日志规则,如果要将一个带有未提交更新的块输出到磁盘,必须先将记有那些更新的取消,(undo),信息的日志记录输出到稳定存储器上的日志中,数据库缓冲,当一个块输出到磁盘时,其上不能有正
24、在进行的更新,可以用下列方法确保这一点,写数据项之前,事务在包含数据项的块上获得,排它锁,一旦完成写,锁即可释放,这种持有时间很短的锁称为,latch,(,插销,),在一个块输出到磁盘之前,系统在该块上获得排他,插销,确保该块上没有更新正在进行,数据库缓冲,数据库缓冲区的实现可以是,在实际,主存,中为数据库保留的一个区域,在操作系统,虚拟内存,中,在保留的主存中实现缓冲区有下列缺点,:,预先划分内存给数据库缓冲区,限制了灵活性,需求可能改变,尽管操作系统最清楚在什么时刻应该如何分配内存,它却不能改变内存分配,数据库缓冲,数据库缓冲区可以在操作系统的虚拟内存中实现,尽管有下列缺点,:,当操作系统
25、需要移出一个被更新过的页以便为另一页腾出空间时,该页被写到磁盘上的交换区(,swap space),当数据库决定将缓冲页写到磁盘时,缓冲页可能在交换区中,于是可能必须从磁盘上的交换区读入该页再输出到磁盘上的数据库中,导致一次多余的,I/O!,称为,dual paging,问题,当换出数据库缓冲页时,理想的情况是操作系统将控制交给数据库,由数据库输出该页到数据库中而不是交换区中,(,首先确保输出日志记录,),因此,Dual paging,被避免,但通常的操作系统不支持此功能,缺点:间接管理,有,extra I/O!,非易失性存储器丢失信息的故障,到目前为止我们都假定非易失性存储器不会丢失信息,使
26、用类似检查点的技术来处理非易失性存储器的丢失信息,周期性地,转储(,dump),数据库的全部内容到稳定存储器,转储过程中不能有活跃事务:必须有一个类似于检查点的过程,输出当前在主存中的所有日志记录到稳定存储器,输出所有缓冲块到磁盘,复制数据库内容到稳定存储器,输出,记录到稳定存储器的日志中,从磁盘故障恢复时,从最近的转储恢复数据库,查看日志并重做所有在转储后提交的事务,可以扩展以允许转储过程中有活跃事务,:,称为模糊转储(,fuzzy dump,),或联机转储(,online dump,),Advanced Recovery Algorithm,Advanced Recovery Techni
27、ques,Support high-concurrency locking techniques,such as those used for B+-tree concurrency control,支持并行,B-,树,Operations like B+-tree insertions and deletions release locks early,They cannot be undone by restoring old values(,physical undo,),since once a lock is released,other transactions may have
28、updated the B,+,-tree.,简单的物理恢复不能复旧,因为其他事务可能干扰,Instead,insertions(resp.deletions)are undone by executing a deletion(resp.insertion)operation(known as,logical undo,),因此,采用,逻辑复旧,Advanced Recovery Techniques,For such operations,undo log records should contain the undo operation to be executed,called,log
29、ical undo logging(,逻辑撤消日志,),in contrast to,physical undo logging,(,物理,撤消日志,),Advanced Recovery Techniques,Redo information is logged,physically,(that is,new value for each write)even for such operations,Logical redo is very complicated,,,since database state on disk may not be“operation consistent”,
30、由于磁盘上的数据库状态可能不是,”,操作一致的,”,,故逻辑重做比物理重做复杂的多,Advanced Recovery Techniques,Operation logging is done as follows:,When operation starts,log,,,Here,O,j,is a unique identifier of the operation instance,While operation is executing,normal log records with,physical redo,and,physical undo,information are logg
31、ed,When operation completes,is logged,where,U,contains information needed to perform a,logical undo,U,中包含完成逻辑撤消所需要的信息,Advanced Recovery Techniques,If crash/rollback occurs after the operation completes:,the operation-end log record is found,and in this case,logical undo,is performed using undo infor
32、mation,U,physical undo,information for the operation is ignored(,忽略,),Redo of operation(after crash)still uses physical redo information,Advanced Recovery Techniques,Rollback of transaction,T,i,is done as follows:,Scan the log,backwards,If a log record is found,perform the undo and log a special,red
33、o-only log record,.,If a record is found,Rollback the operation,logically,using the undo information,U,Updates performed during rollback are logged just like during normal operation execution,At the end of the operation rollback,instead of logging an operation-end record,generate a record,Skip all p
34、receding log records for,T,i,until the record is found,Advanced Recovery Techniques,If a,redo-only record,is found ignore it,If a record is found:,skip all preceding log records for,T,i,until the record is found,Stop the scan when the record is found,Add a record to the log,Advanced Recovery Techniq
35、ues,Some points to note:,Cases 3 and 4 above can occur only if the database crashes while a transaction is being rolled back,当事务回滚期间发生系统崩溃导致,Cases 3 and 4,出现,Skipping of log records as in case 4 is important to prevent multiple rollback of the same operation,避免同一个操作被多次撤消,Advanced Recovery Techniques
36、The following actions are taken when recovering from system crash,Scan log forward from last record,Repeat history,by physically redoing all updates of all transactions,Create an undo-list during the scan as follows,undo-list,is set to,L,initially,Whenever is found,T,i,is added to,undo-list,Wheneve
37、r or is found,T,i,is deleted from,undo-list,This brings database to state as of crash,with committed as well as uncommitted transactions having been redone,Now,undo-list,contains transactions that are,incomplete,that is,have neither committed nor been fully rolled back,Advanced Recovery Techniques,S
38、can log,backwards,performing undo on log records of transactions found in,undo-list,Transactions are rolled back as described earlier,When is found for a transaction,T,i,in,undo-list,write a log record,Stop scan when records have been found for all,T,i,in,undo-list,This undoes the effects of incompl
39、ete transactions(those with neither commit nor abort log records),Recovery is now complete,Advanced Recovery Techniques,Checkpointing is done as follows:,Output all log records in memory to stable storage,Output all modified buffer blocks to stable storage,Output a record to log on stable storage,Tr
40、ansactions are not allowed to perform any actions while checkpointing is in progress,Fuzzy checkpointing allows transactions to progress while the most time consuming parts of checkpointing are in progress,Performed as described on next slide,Advanced Recovery Techniques,Fuzzy checkpointing,is done
41、as follows:,Temporarily stop all updates by transactions,Write a log record and force log to stable storage,Note list,M,of modified buffer blocks,Now permit transactions to proceed with their actions,Output all modified buffer blocks in list,M,to disk,blocks should not be updated while being output,
42、Follow WAL:all log records pertaining to a block must be output before the block is output,Store a pointer to the checkpoint record in a fixed position last_checkpoint on disk,Advanced Recovery Techniques,When recovering using a fuzzy checkpoint,start scan from the checkpoint record pointed to by la
43、st_checkpoint,Log records before last_checkpoint have their updates reflected in database on disk,and need not be redone,Incomplete checkpoints,where system had crashed while performing checkpoint,are handled safely,ARIES Recovery Algorithm,Introduction,Name,Algorithm for Recovery and Isolation Expl
44、oiting Semantics,(,ARIES),参考文献,:,ARIES:A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,C.MOHAN IBM Almaden Research Center,DON HADERLE IBM Santa Teresa Laboratory,BRUCE LINDSAY,HAMID PIRAHESH and PETER SCHWARZ,IBM Almaden Research Cen
45、ter,Introduction,Introduction,ARIES is a state of the art(,最新成果,)recovery method,Incorporates numerous optimizations to reduce overheads during normal processing and to speed up recovery,核心思想,:,融合多种优化技术,减少开销,即尽量不作多余的事,The“advanced recovery algorithm”we studied earlier is modeled after ARIES,but grea
46、tly simplified by removing optimizations,Introduction,Supports,Page-level,logical,logging,Updates in place,就地更新,(as opposed to multi-versioning),Record-level locking,行级锁,Partial and total rollbacks,部分及完全回滚,Media recovery,介质故障恢复,Nested transactions,嵌套事务,Introduction,Advantages:,优点,(1)Allows fine-grai
47、n concurrency,(2)Uses only one type of log to record page-level actions,(3)Compared to physical logging,produces small log files,(4)Allows flexible buffer management(no-force,steal),(5)Fast checkpointing with minimal delay of transactions,(6)Allows relatively fast restart,(7)No locking or deadlocks
48、during restart or rollbacks,(8)Avoids undoing the same log record multiple times,(9)Incurs,招致,only a small overhead per page(page LSN),(10)Allows flexible structures with good storage reclamation(,回收,),Introduction,Disadvantage:,缺点,Very tricky implementation,实现非常有技巧,比较难理解,ARIES:Assumptions,Concurren
49、cy control is in effect,Strict 2PL,in particular,Updates are happening“in place”(,就地更新,),i.e.data is overwritten on the disk,Handling the Buffer Pool,Buffer pool is finite,so,Issue,:How do we guarantee durability of committed data?,Solution,:Policy on what happens when a transaction completes,what t
50、ransactions can do to get more pages,Handling the Buffer Pool,Force every write to disk?,优点:,Provides durability(,保证持久性,),缺点:,But Poor response time(,响应时间差,),Steal(,偷窃,)buffer-pool frames from uncommited Xacts?,If not,poor throughput(,如果不允许,则吞吐率差,),If so,how can we ensure atomicity?(,如果允许,如何保证原子性,),






