1、单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,2020/2/19,史忠植 关联规则,#,2024/12/6 周五,史忠植 关联规则,1,内容提要,引言,Apriori,算法,FP-growth,算法,并行关联规则挖掘,多维关联规则挖掘,相关规则,关联规则改进,2024/12/6 周五,史忠植 关联规则,2,关联规则,关联规则反映一个事物与其他事物之间的相互依存性和关联性。如果两个或者多个事物之间存在一定的关联关系,那么,其中一个事物就能够通过其他事物预测到。,关联规则表示了项之间的关系。,示例,:,cereal,milk,fruit,“,买谷类食品和牛奶的
2、人也会买水果,.”,商店可以把牛奶和谷类食品作特价品以使人们买更多的水果,.,2024/12/6 周五,史忠植 关联规则,3,市场购物篮分析,分析事务数据库表,我们是否可假定,?,Chips=Salsa Lettuce=Spinach,Person,Basket,A,Chips,Salsa,Cookies,Crackers,Coke,Beer,B,Lettuce,Spinach,Oranges,Celery,Apples,Grapes,C,Chips,Salsa,Frozen Pizza,Frozen Cake,D,Lettuce,Spinach,Milk,Butter,2024/12/6 周
3、五,史忠植 关联规则,4,基本概念,通常,数据包含,:,TID,Basket,事务,ID,项的子集,2024/12/6 周五,史忠植 关联规则,5,关联规则挖掘,在事务数据库,关系数据库和其它信息库中的项或对象的集合之间,发现频繁模式,关联,相关,或因果关系的结构,.,频繁模式,:,数据库中出现频繁的模式,(,项集,序列,等等,),2024/12/6 周五,史忠植 关联规则,6,基本概念,项集,事务,关联规则,-,事务数据集,(,例如右图,),事务标识,TID,每一个事务关联着一个标识,称作,TID.,Transaction-id,Items bought,10,A,B,C,20,A,C,30
4、,A,D,40,B,E,F,2024/12/6 周五,史忠植 关联规则,7,关联规则的度量,支持度,s,D,中包含,A,和,B,的事务数与总的事务数的比值,规则,A,B,在数据集,D,中的支持度为,s,其中,s,表示,D,中包含,A,B,(,即同时包含,A,和,B),的事务的百分率,.,2024/12/6 周五,史忠植 关联规则,8,关联规则的度量,可信度,c,D,中同时包含,A,和,B,的事务数与只包含,A,的事务数的比值,规则,A,B,在数据集,D,中的,可信,度为,c,其中,c,表示,D,中包含,A,的事务中也包含,B,的百分率,.,即可用条件概率,P(B|A),表示,.,confide
5、nce(,A,B)=P(B|A),条件概率,P(B|A),表示,A,发生的条件下,B,也发生的概率,.,2024/12/6 周五,史忠植 关联规则,9,关联规则的度量,关联规则根据以下两个标准,(,包含或排除,):,最小支持度,表示规则中的所有项在事务中出现的频度,最小可信度,-,表示规则中左边的项,(,集,),的出现暗示着右边的项,(,集,),出现的频度,2024/12/6 周五,史忠植 关联规则,10,市场购物篮分析,I,是什么,?,事务,ID B,的,T,是什么,?,s(Chips=Salsa),是什么,?,c(Chips=Salsa),是什么,?,事务,ID,购物篮,A,Chips,S
6、alsa,Cookies,Crackers,Coke,Beer,B,Lettuce,Spinach,Oranges,Celery,Apples,Grapes,C,Chips,Salsa,Frozen Pizza,Frozen Cake,D,Lettuce,Spinach,Milk,Butter,Chips,2024/12/6 周五,史忠植 关联规则,11,频繁项集,项集,任意项的集合,k-,项集,包含,k,个项的项集,频繁,(,或大,),项集,满足最小支持度的项集,若,I,包含,m,个项,那么可以产生多少个项集,?,2024/12/6 周五,史忠植 关联规则,12,强关联规则,给定一个项集,容
7、易生成关联规则,.,项集,:Chips,Salsa,Beer,Beer,Chips=Salsa,Beer,Salsa=Chips,Chips,Salsa=Beer,强规则是有趣的,强规则通常定义为那些满足最小支持度和最小可信度的规则,.,2024/12/6 周五,史忠植 关联规则,13,关联规则挖掘,两个基本步骤,找出所有的频繁项集,满足最小支持度,找出所有的强关联规则,由频繁项集生成关联规则,保留满足最小可信度的规则,2024/12/6 周五,史忠植 关联规则,14,内容提要,引言,Apriori,算法,FP-growth,算法,并行关联规则挖掘,多维关联规则挖掘,相关规则,关联规则改进,总
8、结,2024/12/6 周五,史忠植 关联规则,15,Apriori,算法,IBM,公司,Almaden,研究中心的,R.Agrawal,等人在1993年提出的,AIS,和,SETM,。,在1994年提出,Apriori,和,AprioriTid,。,Apriori,和,AprioriTid,算法利用前次过程中的数据项目集来生成新的候选数据项目集,减少了中间不必要的数据项目集的生成,提高了效率,2024/12/6 周五,史忠植 关联规则,16,生成频繁项集,Na,ve algorithm,n-|D|,for,each subset s of I,do,l-0,for,each transact
9、ion T in D,do,if,s is a subset of T,then,l-l+1,if,minimum support=l/n,then,add s to frequent subsets,2024/12/6 周五,史忠植 关联规则,17,生成频繁项集,na,ve algorithm,的分析,I,的子集,:O(2,m,),为每一个子集扫描,n,个事务,测试,s,为,T,的子集,:O(2,m,n),随着项的个数呈指数级增长,!,我们能否做的更好,?,2024/12/6 周五,史忠植 关联规则,18,Apriori,性质,定理,(Apriori,性质,):,若,A,是一个频繁项集,则,
10、A,的每一个子集都是一个频繁项集,.,证明,:,设,n,为事务数,.,假设,A,是,l,个事务的子集,若,A,A,则,A,为,l,(l,l),个事务的子集,.,因此,l/n s(,最小支持度,),l,/n s,也成立,.,2024/12/6 周五,史忠植 关联规则,19,Apriori,算法,Apriori,算法是一种经典的生成布尔型关联规则的频繁项集挖掘算法,.,算法名字是缘于算法使用了频繁项集的性质这一先验知识,.,思想,:Apriori,使用了一种称作,level-wise,搜索的迭代方法,其中,k-,项集被用作寻找,(k+1)-,项集,.,首先,找出频繁,1-,项集,以,L1,表示,.
11、L1,用来寻找,L2,即频繁,2-,项集的集合,.L2,用来寻找,L3,以此类推,直至没有新的频繁,k-,项集被发现,.,每个,Lk,都要求对数据库作一次完全扫描,.,2024/12/6 周五,史忠植 关联规则,20,生成频繁项集,中心思想,:,由频繁,(k-1)-,项集构建候选,k-,项集,方法,找到所有的频繁,1-,项集,扩展频繁,(k-1)-,项集得到候选,k-,项集,剪除不满足最小支持度的候选项集,2024/12/6 周五,史忠植 关联规则,21,Apriori:,一种候选项集生成,-,测试方法,Apriori,剪枝原理,:,若任一项集是不频繁的,则其超集不应该被生成,/,测试,!,方
12、法,:,由频繁,k-,项集生成候选,(k+1)-,项集,并且,在,DB,中测试候选项集,性能研究显示了,Apriori,算法是有效的和可伸缩,(scalablility),的,.,2024/12/6 周五,史忠植 关联规则,22,The Apriori,算法,一个示例,Database TDB,1,st,scan,C,1,L,1,L,2,C,2,C,2,2,nd,scan,C,3,L,3,3,rd,scan,Tid,Items,10,A,C,D,20,B,C,E,30,A,B,C,E,40,B,E,Itemset,sup,A,2,B,3,C,3,D,1,E,3,Itemset,sup,A,2,
13、B,3,C,3,E,3,Itemset,A,B,A,C,A,E,B,C,B,E,C,E,Itemset,sup,A,B,1,A,C,2,A,E,1,B,C,2,B,E,3,C,E,2,Itemset,sup,A,C,2,B,C,2,B,E,3,C,E,2,Itemset,B,C,E,Itemset,sup,B,C,E,2,2024/12/6 周五,史忠植 关联规则,23,Apriori,算法,Algorithm:Apriori,输入,:Database,D,of transactions;minimum support threshold,min_sup.,输出,:L,freuqent ite
14、msets in D.,过程,:,C,k,:Candidate itemset of size k,L,k,:frequent itemset of size k,L,1,=find_frequent_1_itemsets(D);,for,(,k,=2;,L,k+1,!=,;,k,+),do begin,C,k,=,apriori_gen,(,L,k-1,min_sup);,for each,transaction,t,in database D do/scan D for counts,C,t,=subset(,C,k,t);/get the subsets of t that are ca
15、ndidates For each candidate c,C,t,c.count+;,L,k,=candidates in,C,k,with min_support,end,return,L=,k,L,k,;,2024/12/6 周五,史忠植 关联规则,24,Apriori,算法,Procedure apriori_gen(L,k-1,:frequent(k-1)-itemsets;min_sup:minimum support threshold,),for,each itemset l,1,L,k-1,for each,itemset,l,2,L,k-1,if,(l,1,1=l,2,1)
16、,(l,1,2=l,2,2),(l,1,k-1=l,2,k-1),Then,c=,join(l,1,l,2,),/join step:generate candidates,if,has_infrequent_subset(c,L,k-1,),then,delete c;/prune step:remove unfruitful candidate,else add c to,C,k,return,C,k,2024/12/6 周五,史忠植 关联规则,25,Apriori,算法,Join is generate candidates set of itemsets C,k,from 2 item
17、sets in L,k-1,Procedure,join(p,q),insert into,C,k,select p.item,1,p.item,2,.,p.item,k-1,q.item,k-1,from,L,k-1,p,L,k-1,q,where p.item,1,=q.item,1,.,p.item,k-2,=q.item,k-2,p.item,k-1,=q.item,k-1,2024/12/6 周五,史忠植 关联规则,26,Apriori,算法,Procedure,has_infrequent_subset(c:candidate k-itemset;L,k-1,:frequent(k
18、-1)-itemsets,;,)/use prior knowledge,for,each(k-1)-subset s of c,if s,L,k-1,then,return,TRUE;,return,FALSE.,2024/12/6 周五,史忠植 关联规则,27,Apriori,算法,如何生成候选项集,?,步骤,1:,自连接,L,k,步骤,2:,剪枝,如何计算候选项集的支持度,?,候选项庥生成的示例,L,3,=,abc,abd,acd,ace,bcd,自连接,:,L,3,*L,3,由,abc,和,abd,连接得到,abcd,由,acd,和,ace,连接得到,acde,剪枝,:,因为,ade,
19、不在,L,3,中,acde,被剪除,C,4,=,abcd,2024/12/6 周五,史忠植 关联规则,28,如何生成候选项集,?,假定,L,k-1,中的项以一定顺序排列,步骤,1:,自连接,L,k-1,insert into,C,k,select,p.item,1,p.item,2,p.item,k-1,q.item,k-1,from,L,k-1,p,L,k-1,q,where,p.item,1,=q.item,1,p.item,k-2,=q.item,k-2,p.item,k-1,1,:,P,i,generates the complete C,k,using the complete L,
20、k-1,created at the end of pass(k-1).,Each processor has the identical L,k-1,thus generates identical C,k,and puts its count values in a common order into a count array,P,i,makes a pass over data partition D,i,and develop local support counts for candidates in C,k,P,i,exchanges local C,k,counts with
21、all other processors to develop global C,k,counts.All processors must synchronize.,P,i,computes L,k,from C,k,P,i,independently decide to terminate or continue to the next pass,计数分布,CD,2024/12/6 周五,史忠植 关联规则,67,计数分布,CD,2024/12/6 周五,史忠植 关联规则,68,Disadvantages,:,CD does not exploit the aggregate memory o
22、f the system,Must synchronize and develop global count at the end of each pass,M over C,k,N*|M|,N,M over C,k,|M|,1,Usage of memory per processor,Total amount of memory,Number of processor,计数分布,CD,2024/12/6 周五,史忠植 关联规则,69,Objective:utilize aggregate main memory of the system effectively,Technique:,Pa
23、rtitions the candidates into disjoint sets,which are assigned to different processors.Each processor works with the entire dataset but only portion of the candidate set.,Each processor counts mutually exclusive candidates.On a N-processor configuration,DD can count in a single pass candidate set tha
24、t would require,N,pass in CD,数据分布,DD,2024/12/6 周五,史忠植 关联规则,70,Basic Idea,C,k,C,k,1,C,k,2,L,k,1,L,k,2,L,k,Processor 1,Processor 2,Example:2 processors,Data Distribution only processes a subset of C,k,to utilize the aggregate memory,Exchange data to develop global counts for,C,k,i,data,数据分布,DD,2024/12
25、/6 周五,史忠植 关联规则,71,Algorithm:,Pass 1:Same as the CD algorithm,Pass,k1,:,P,i,generates C,k,from L,k-1,.It retains only 1/,N,of the itemsets forming C,i,k,P,i,develops support counts for itemsets in C,i,k,for ALL transactions(using local data pages and data pages received from other processors),At the
26、end of the data pass,P,i,calculates L,i,k,using local C,i,k,Processors exchange L,i,k,so that every processor has the complete L,k,for generating C,k+1,for the next pass(requires processors to synchronize),P,i,can independently decide whether to terminate or continue on to the next pass,数据分布,DD,2024
27、/12/6 周五,史忠植 关联规则,72,L,i,k,L,i,k,L,i,k,L,k,数据分布,DD,2024/12/6 周五,史忠植 关联规则,73,Disadvantages:,heavy communication,Each processor must broadcast their,local data,and,frequent itemsets,to all other processors and,synchronize,in every pass.,数据分布,DD,2024/12/6 周五,史忠植 关联规则,74,Problem:,CD and DD require proce
28、ssors to synchronize at the end of each pass,Basic Idea:Remove dependence among processors,Data dependence,Complete transactions are required to compute support count(in CD,),Frequent itemsets dependency,A global itemset,L,k,is needed during the pruning step of Apiori candidate generation algorithm(
29、in DD),候选分布,CaD,2024/12/6 周五,史忠植 关联规则,75,Remove Data Dependency,Each processor,P,i,works on,C,k,i,a disjoint subset of,C,k,P,i,derives global support counts for,C,k,i,from local data.,Replicate data amongst processors in order to achieve the above,Reduce Frequent itemset dependency,Does not wait for
30、 the complete pruning information to arrive from other processors.,Prune the candidate set as much as possible,Late arriving pruning information is used in subsequent passes.,候选分布,CaD,2024/12/6 周五,史忠植 关联规则,76,Algorithm:,Pass,kl,:,P,i,collects all frequent itemsets sent by other processors,P,i,genera
31、tes C,i,k,using local L,i,k-,take care of pruning(L,j,l-1,),P,i,passes over D R,i,and counts C,i,k,P,i,computes L,i,k,from C,i,k,and asynchronously send it to the other N-1 processo,rs,候选分布,CaD,2024/12/6 周五,史忠植 关联规则,77,Count Distribution,attempts to minimize communication by replicating the candidat
32、e sets in each processor,s memory,Data Distribution,maximizes the use of aggregate memory by allowing each processor works with the entire dataset but only portion of the candidate set,Candidate Distribution,eliminates the synchronization costs at the end of every pass,maximizes the use of aggregate
33、 memory while limiting heavy communication to a single redistribution pass,并行关联算法比较,2024/12/6 周五,史忠植 关联规则,78,Exchange data to repartition database(once only).,Exchange,L,k,i,(require no synchronization).,Partition,L,k,Generate partial,C,k,i,Generate partial,L,k,i,Candidate Distribution,Exchange data
34、 to compute support counts for,C,k,i,Exchange,L,k,i,Synchronize at each pass,Generate complete,C,k,Partition,C,k,Generate partial,L,k,i,Data Distribution,Exchange support counts for,C,k,i,Synchronize at each pass,Generate complete,C,k,Generate complete,L,k,Count Distribution,Communication,Non-Commun
35、ication,并行关联算法比较,2024/12/6 周五,史忠植 关联规则,79,Advantages,Disadvantages,Count Distribution(CD),Low communication cost,Only exchange counts,not data,Doesnt exploit aggregate memory,Must synchronize at the end of every pass,When the candidate itemsets doesnt fit the memory of each node?,Data Distribution(D
36、D),Better utilization of aggregate system memory,Need to broadcast its local data to every other processor at every pass,Need to synchronize,Candidate Distribution,(CaD),Processors can proceed independently without synchronizing at the end of every pass,Communication of entire dataset needed for a s
37、ingle redistribution pass.,The communication cost is higher than the synchronization cost savings,并行关联算法比较,2024/12/6 周五,史忠植 关联规则,80,2024/12/6 周五,史忠植 关联规则,81,关联规则可视化,:,方格图,(Pane Graph),2024/12/6 周五,史忠植 关联规则,82,关联规则可视化,:,规则图,(Rule Graph),2024/12/6 周五,史忠植 关联规则,83,内容提要,引言,Apriori,算法,FP-growth,算法,并行关联规则挖
38、掘,多维关联规则挖掘,相关规则,关联规则改进,总结,2024/12/6 周五,史忠植 关联规则,84,挖掘多种规则或规律,多层,(Multi-level),量化,(quantitative),关联规则,相关,(correlation),和因果,(causality),比率,(ratio),规则,序列,(sequential),模式,浮现,(emerging),模式,temporal associations,局部周期,(partial periodicity),分类,(classification),聚类,(clustering),冰山立方体,(iceberg cubes),等等,.,2024
39、/12/6 周五,史忠植 关联规则,85,多层关联规则,项常常构成层次,可伸缩的,(,flexible),支持度设置,:,在较低层的项预期有较低的支持度,.,事务数据库可以基于维度和层次编码,探寻共享多层挖掘,统一支持度,Milk,support=10%,2%,Milk,support=6%,Skim Milk,support=4%,Level 1,min_sup=5%,Level 2,min_sup=5%,Level 1,min_sup=5%,Level 2,min_sup=3%,减少的支持度,2024/12/6 周五,史忠植 关联规则,86,可伸缩的支持度约束的多层,/,多维,(ML/MD
40、),关联规则,为什么设置,可伸缩,的支持度,?,实际生活中项的出现频率变化巨大,在一个商店购物篮中的钻石,手表,钢笔,统一的支持度未必是一个有趣的模型,一个,可伸缩,模型,较低层的,较多维的组合以及长模式通常具有较小的支持度,总体规则应该要容易说明和理解,特殊的项和特殊的项的组合可以特别设定,(,最小支持度,),以及拥有更高的优先级,2024/12/6 周五,史忠植 关联规则,87,多维关联规则,单维规则,:,buys(X,“,milk,”,),buys(X,“,bread,”,),多维规则,:,2,个维度或谓词,(predicates),跨维度,(Inter-dimension),关联规则,
41、(,无重复谓词,),age(X,”,19-25,”,),occupation(X,“,student,”,),buys(X,“,coke,”,),混合维度,(hybrid-dimension),关联规则,(,重复谓词,),age(X,”,19-25,”,),buys(X,“,popcorn,”,),buys(X,“,coke,”,),分类,(Categorical),属性,有限的几个可能值,值之间不可排序,数量,(Quantitative),属性,数值的,值之间有固有的排序,2024/12/6 周五,史忠植 关联规则,88,多层关联规则,:,冗余滤除,根据项之间的,”,先辈,”,(ancest
42、or),关系,一些规则可能是冗余的,.,示例,milk,wheat bread,support=8%,confidence=70%,2%milk wheat bread,support=2%,confidence=72%,我们说第,1,个规则是第,2,个规则的先辈,.,一个规则是冗余的,当其支持度接近基于先辈规则的,”,预期,”,(expected),值,.,2024/12/6 周五,史忠植 关联规则,89,多层关联规则,:,逐步深化,(Progressive Deepening),一个自上而下的,逐步深化的方法,:,首先挖掘高层的频繁项,:,milk(15%),bread(10%),然后挖掘
43、它们的较低层,”,较弱,”,(weaker),频繁项,:,2%milk(5%),wheat bread(4%),多层之间不同的最小支持度阈值导致了不同的算法,:,如果在多个层次间采用了相同的最小支持度,若,t,的任何一个先辈都是非频繁的则扔弃,(toss)t.,如果在较低层采用了减少的最小支持度,则只检验那些先辈的支持度是频繁的不可忽略的派生(,descendents,)即可,2024/12/6 周五,史忠植 关联规则,90,多维关联规则挖掘的技术,搜索频繁,k,-,谓词集,(predicate set):,示例,:,age,occupation,buys,是一个,3-,谓词集,以,age,处
44、理,的方式,技术可以如下分类,1.,利用数量属性的统计离散,(static discretization),方法,利用预先确定的概念层次对数量属性进行统计离散化,2.,量化关联规则,基于数据的分布,数量属性被动态地离散化到不同的容器空间,(bins),3.,基于距离,(Distance-based),的关联规则,这是一个动态离散化的过程,该过程考虑数据点之间的距离,2024/12/6 周五,史忠植 关联规则,91,数量属性的统计离散化,挖掘之前利用概念层次离散化,数值被范围,(ranges),替代,.,关系数据库中,找出所有的频繁,k-,谓词,(predicate),集要求,k,或,k,+1,
45、次表扫描,.,数据立方体,(data cube),非常适合数据挖掘,.,N-,维立方体的,cells,与谓词集,(,predicate sets),相对应,.,通过数据立方体挖掘会非常快速,.,(,income),(,age),(),(,buys),(,age,income),(,age,buys),(,income,buys),(,age,income,buys),2024/12/6 周五,史忠植 关联规则,92,量化关联规则,age(X,”30-34”),income(X,”24K-48K”),buys(X,”high resolution TV”),数值属性动态离散化,这样挖掘的规则的可
46、信度或紧密度最大化,2-,维 量化关联规则,:A,quan1,A,quan2,A,cat,示例,2024/12/6 周五,史忠植 关联规则,93,Mining Distance-based Association Rules,Binning methods do not capture the semantics of interval data,Distance-based partitioning,more meaningful discretization considering:,density/number of points in an interval,“,closeness,”
47、,of points in an interval,2024/12/6 周五,史忠植 关联规则,94,Interestingness Measure:Correlations(Lift),play basketball,eat cereal,40%,66.7%is misleading,The overall percentage of students eating cereal is 75%which is higher than 66.7%.,play basketball,not eat cereal,20%,33.3%is more accurate,although with lo
48、wer support and confidence,Measure of dependent/correlated events:,lift,Basketball,Not basketball,Sum(row),Cereal,2000,1750,3750,Not cereal,1000,250,1250,Sum(col.),3000,2000,5000,2024/12/6 周五,史忠植 关联规则,95,内容提要,引言,Apriori,算法,FP-growth,算法,并行关联规则挖掘,多维关联规则挖掘,相关规则,关联规则改进,总结,2024/12/6 周五,史忠植 关联规则,96,相关规则,(
49、Correlation Rules),“,Beyond Market Baskets,”,Brin et al.,假设执行关联规则挖掘,c,c,row,t,20,5,25,t,70,5,75,col,90,10,100,tea=coffee,20%support,80%confidence,but 90%of the people buy coffee anyway!,2024/12/6 周五,史忠植 关联规则,97,相关规则,一种度量是计算相关性,若两个随机变量,A,和,B,是统计独立的,对,tea,和,coffee:,2024/12/6 周五,史忠植 关联规则,98,相关规则,利用,2,统
50、计检验来测试独立性,设,n,为购物篮的总数,设,k,为考虑的项的总数,设,r,为一个包含项,(,i,j,i,j,),的规则,设,O(r),表示包含规则,r,的购物篮的数量,(,即频率,),对单个项,i,j,设,Ei,j,=O(i,j,)(,反过来即为,n-Ei,j,),Er=n*Er,1,/n*,*Er,k,/n,2024/12/6 周五,史忠植 关联规则,99,相关规则,2,统计量定义为,Look up for significance value in a statistical textbook,There are k-1 degrees of freedom,If test fails
©2010-2025 宁波自信网络信息技术有限公司 版权所有
客服电话:4008-655-100 投诉/维权电话:4009-655-100