资源描述
,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Mining:Concepts and Techniques,*,Chapter 4:Mining Frequent Patterns,Association and Correlations,Basic concepts and a road map,Scalable frequent,itemset,mining methods,Mining various kinds of association rules,Constraint-based association mining,From association to correlation analysis,Mining colossal patterns,Summary,2026/1/12 周一,1,Data Mining:Concepts and Techniques,What Is Frequent Pattern Analysis?,Frequent pattern,:a pattern(a set of items,subsequences,substructures,etc.)that occurs frequently in a data set,First proposed by Agrawal,Imielinski,and Swami AIS93 in the context of,frequent,itemsets,and,association rule mining,Motivation:Finding inherent regularities in data,What products were often purchased together?Beer and diapers?!,What are the subsequent purchases after buying a PC?,What kinds of DNA are sensitive to this new drug?,Can we automatically classify web documents?,Applications,Basket data analysis,cross-marketing,catalog design,sale campaign analysis,Web log(click stream)analysis,and DNA sequence analysis.,2026/1/12 周一,2,Data Mining:Concepts and Techniques,关联规则挖掘,关联规则挖掘的典型案例:,购物篮问题,在商场中拥有大量的商品(项目),如:牛奶、面包等,客户将所购买的商品放入到自己的购物篮中。,通过发现顾客放入购物篮中的不同商品之间的联系,分析顾客的购买习惯,哪些物品经常被顾客购买?,同一次购买中,哪些商品经常会被一起购买?,一般用户的购买过程中是否存在一定的购买时间序列?,具体应用:利润最大化,商品货架设计:更加适合客户的购物路径,货存安排:实现超市的零库存管理,用户分类:提供个性化的服务,2026/1/12 周一,3,Data Mining:Concepts and Techniques,关联规则挖掘,简单的说,,关联规则挖掘就是发现大量数据中项集之间有趣的关联,在交易数据、关系数据或其他信息载体中,查找存在于项目集合或对象集合之间的频繁模式、关联、相关性、或因果结构。,应用,购物篮分析、交叉销售、产品目录设计、聚集、分类等,两种策略:,1,。商品放近,增加销量,2,。,商品放远,增加其他商品的销量,2026/1/12 周一,4,Data Mining:Concepts and Techniques,Why Is Freq.Pattern Mining Important?,Freq.pattern:An intrinsic and important property of datasets,Foundation for many essential data mining tasks,Association,correlation,and causality analysis,Sequential,structural(e.g.,sub-graph)patterns,Pattern analysis in spatiotemporal,multimedia,time-series,and stream data,Classification:discriminative,frequent pattern analysis,Cluster analysis:frequent pattern-based clustering,Data warehousing:iceberg cube and cube-gradient,Semantic data compression:fascicles,Broad applications,2026/1/12 周一,5,Data Mining:Concepts and Techniques,关联规则挖掘形式化定义,给定:,设,I=i,1,i,2,i,m,是项,(item),的,集合。若干项的集合,称为,项集,(,Item Sets,),记,D,为交易,(transaction)T(,或,事务,),的集合,这里交易,T,是项的集合,并且,T,I,。,对应每一个交易有唯一的标识,如交易号,记作,TID,。设,X,是一个,I,中项的集合,如果,X,T,,,那么称交易,T,包含,X,。,寻找:有趣的关联规则,(,强规则,).,2026/1/12 周一,6,Data Mining:Concepts and Techniques,关联规则,所有形如,X,Y,蕴涵式的称为关联规则,这里,X,I,Y,I,,,并且,X,Y=,。,关联规则是有趣的,如果它满足最小支持度阈值与最小置信度阈值,并称之为强规则,2026/1/12 周一,7,Data Mining:Concepts and Techniques,confidence and support,Itemset,X=i,1,i,k,Find all the rules,X,Y,with min confidence and support,Customer,buys diaper,Customer,buys both,Customer,buys beer,support,s,probability,that a transaction contains XY,support(XY)=,同时包含项目集,X,和,Y,的交易数,/,总交易数,用于描述有用性,.,confidence,c,conditional probability,that a transaction having X also contains,Y,.,confidence(XY)=,同时购买商品,X,和,Y,的交易数,/,购买商品,X,的交易数,用于描述确定性,即,”,值得信赖的程度,”,可靠性,”,2026/1/12 周一,8,Data Mining:Concepts and Techniques,Mining Association Rules,an,Example,For rule,A,C,:,support=support(,A,C,)=50%,confidence=support(,A,C,)/support(,A,)=66.6%,Min.support 50%,Min.confidence 50%,Transaction-id,Items bought,10,A,B,C,20,A,C,30,A,D,40,B,E,F,Frequent pattern,Support,A,75%,B,50%,C,50%,A,C,50%,2026/1/12 周一,9,Data Mining:Concepts and Techniques,关联规则的基本形式,关联规则的基本形式:,前提条件结论,支持度,置信度,buys(x,“,diapers,”,),buys(x,“,beers,”,)0.5%,60%,major(x,“,CS,”,)takes(x,“,DB,”,),grade(x,“,A,”,)1%,75%,包含,k,个项目的集合,称为,k-,项集,项集的出现频率是包含项集的事务个数,称为项集的频率、支持计数或者计数,2026/1/12 周一,10,Data Mining:Concepts and Techniques,Basic Concepts:Frequent Patterns,itemset,:A set of one or more items,k-itemset,X=x,1,x,k,(absolute)support,or,support count,of X:Frequency or occurrence of an,itemset,X,(relative),support,s,is the fraction of transactions that contains X(i.e.,the,probability,that a transaction contains X),An,itemset,X is,frequent,if Xs support is no less than a,minsup,threshold,Customer,buys diaper,Customer,buys both,Customer,buys beer,Tid,Items bought,10,Beer,Nuts,Diaper,20,Beer,Coffee,Diaper,30,Beer,Diaper,Eggs,40,Nuts,Eggs,Milk,50,Nuts,Coffee,Diaper,Eggs,Milk,2026/1/12 周一,11,Data Mining:Concepts and Techniques,Basic Concepts:Association Rules,Find all the rules,X,Y,with minimum support and confidence,support,s,probability,that a transaction contains X Y,confidence,c,conditional probability,that a transaction having X also contains,Y,Let,minsup,=50%,minconf,=50%,Freq.Pat.:,Beer:3,Nuts:3,Diaper:4,Eggs:3,Beer,Diaper:3,Customer,buys diaper,Customer,buys both,Customer,buys beer,Nuts,Eggs,Milk,40,Nuts,Coffee,Diaper,Eggs,Milk,50,Beer,Diaper,Eggs,30,Beer,Coffee,Diaper,20,Beer,Nuts,Diaper,10,Items bought,Tid,Association rules:(many more!),Beer,Diaper,(60%,100%),Diaper,Beer,(60%,75%),2026/1/12 周一,12,Data Mining:Concepts and Techniques,Closed Patterns and Max-Patterns,A long pattern contains a combinatorial number of sub-patterns,e.g.,a,1,a,100,contains,(,100,1,)+(,100,2,)+(,1,1,0,0,0,0,)=2,100,1=1.27*10,30,sub-patterns!,Solution:,Mine,closed patterns,and,max-patterns,instead,An,itemset,X,is,closed,if X is,frequent,and there exists,no super-pattern,Y,X,with the same support,as X(proposed by,Pasquier,et al.ICDT99),An,itemset,X is a,max-pattern,if X is frequent and there exists no frequent super-pattern Y,X(proposed by,Bayardo,SIGMOD98),Closed pattern is a lossless compression of freq.patterns,Reducing the#of patterns and rules,2026/1/12 周一,13,Data Mining:Concepts and Techniques,Closed Patterns and Max-Patterns,Exercise.DB=,Min_sup=1.,What is the set of,closed,itemset,?,:1,:2,What is the set of,max-pattern?,:1,What is the set of,all patterns,?,!,2026/1/12 周一,14,Data Mining:Concepts and Techniques,Computational Complexity of Frequent,Itemset,Mining,How many itemsets are potentially to be generated in the worst case?,The number of frequent itemsets to be generated is senstive to the minsup threshold,When minsup is low,there exist potentially an exponential number of frequent itemsets,The worst case:M,N,where M:#distinct items,and N:max length of transactions,The worst case complexty vs.the expected probability,Ex.Suppose Walmart has 10,4,kinds of products,The chance to pick up one product 10,-4,The chance to pick up a particular set of 10 products:10,-40,What is the chance this particular set of 10 products to be frequent 10,3,times in 10,9,transactions?,2026/1/12 周一,15,Data Mining:Concepts and Techniques,Chapter 4:Mining Frequent Patterns,Association and Correlations,Basic concepts and a road map,Scalable frequent,itemset,mining methods,Mining various kinds of association rules,Constraint-based association mining,From association to correlation analysis,Mining colossal patterns,Summary,2026/1/12 周一,16,Data Mining:Concepts and Techniques,使用,Apriori,方法挖掘关联规则,频繁项集的定义,如果项集满足最小支持度,则称之为频繁项集,(高频项集),频繁项集的基本特征,任何频繁项集的非空子集均为频繁项集。例如:,ABC,是频繁项集,则,AB,、,AC,、,BC,均为频繁项集。,反之:如,AB,不是,频繁项集,则,ABC,不可能,是频繁项集,2026/1/12 周一,17,Data Mining:Concepts and Techniques,Apriori,方法,是一种称作逐层搜索的迭代方法。,用,k-,项集探求(,k+1,),-,项集。,具体地:首先找出频繁,1-,项集,该集合记为,L,1,;,用,L,1,找出频繁,2-,项集的集合,L,2,;,如此继续下去,直到找到最大频繁项集,该方法,主要有连接和剪枝两步构成。,2026/1/12 周一,18,Data Mining:Concepts and Techniques,The,Apriori,Algorithm,An Example,Database TDB,1,st,scan,C,1,L,1,L,2,C,2,C,2,2,nd,scan,C,3,L,3,3,rd,scan,Tid,Items,10,A,C,D,20,B,C,E,30,A,B,C,E,40,B,E,Itemset,sup,A,2,B,3,C,3,D,1,E,3,Itemset,sup,A,2,B,3,C,3,E,3,Itemset,A,B,A,C,A,E,B,C,B,E,C,E,Itemset,sup,A,B,1,A,C,2,A,E,1,B,C,2,B,E,3,C,E,2,Itemset,sup,A,C,2,B,C,2,B,E,3,C,E,2,Itemset,B,C,E,Itemset,sup,B,C,E,2,2026/1/12 周一,19,Data Mining:Concepts and Techniques,由频繁项集,产生关联规则,两步,对每一个,频繁项集,u,产生,u,的所有非空真子集,对,u,的,每一个,非空真子集,s,,,若,support_count(u)/support_count(s)=min_conf,则输出:,s,(u-s),例如:频繁项集,u=B,,,C,E,产生所有非空真子集,6,个,对应有,6,个可能的规则,分别计算每一条规则的,confidence,2026/1/12 周一,20,Data Mining:Concepts and Techniques,Apriori:A Candidate Generation&Test Approach,Apriori,pruning principle,:,If there is,any,itemset which is infrequent,its superset should not be generated/tested!(,Agrawal&Srikant VLDB94,Mannila,et al.KDD 94),Method:,Initially,scan DB once to get frequent 1-itemset,Generate,length(k+1),candidate,itemsets,from length k,frequent,itemsets,Test,the candidates against DB,Terminate when no frequent or candidate set can be generated,2026/1/12 周一,21,Data Mining:Concepts and Techniques,The Apriori Algorithm,An Example,Database TDB,1,st,scan,C,1,L,1,L,2,C,2,C,2,2,nd,scan,C,3,L,3,3,rd,scan,Tid,Items,10,A,C,D,20,B,C,E,30,A,B,C,E,40,B,E,Itemset,sup,A,2,B,3,C,3,D,1,E,3,Itemset,sup,A,2,B,3,C,3,E,3,Itemset,A,B,A,C,A,E,B,C,B,E,C,E,Itemset,sup,A,B,1,A,C,2,A,E,1,B,C,2,B,E,3,C,E,2,Itemset,sup,A,C,2,B,C,2,B,E,3,C,E,2,Itemset,B,C,E,Itemset,sup,B,C,E,2,Sup,min,=2,2026/1/12 周一,22,Data Mining:Concepts and Techniques,The Apriori Algorithm(,Pseudo-Code,),C,k,:Candidate itemset of size k,L,k,:frequent itemset of size k,L,1,=frequent items;,for,(,k,=1;,L,k,!=,;,k,+),do begin,C,k+1,=candidates generated from,L,k,;,for each,transaction,t,in database do,increment the count of all candidates in,C,k+1,that are contained in,t,L,k+1,=candidates in,C,k+1,with min_support,end,return,k,L,k,;,2026/1/12 周一,23,Data Mining:Concepts and Techniques,Implementation of Apriori,How to generate candidates?,Step 1:self-joining,L,k,Step 2:pruning,Example of Candidate-generation,L,3,=,abc,abd,acd,ace,bcd,Self-joining:,L,3,*L,3,abcd,from,abc,and,abd,acde,from,acd,and,ace,Pruning:,acde,is removed because,ade,is not in,L,3,C,4,=,abcd,2026/1/12 周一,24,Data Mining:Concepts and Techniques,Candidate Generation:An SQL Implementation,SQL Implementation of candidate generation,Suppose the items in,L,k-1,are listed in an order,Step 1:self-joining,L,k-1,insert into,C,k,select,p.item,1,p.item,2,p.item,k-1,q.item,k-1,from,L,k-1,p,L,k-1,q,where,p.item,1,=q.item,1,p.item,k-2,=q.item,k-2,p.item,k-1,Wage:mean=$7/hr(overall mean=$9),2026/1/12 周一,57,Data Mining:Concepts and Techniques,Static,Discretization,of Quantitative Attributes,Discretized,prior to mining using concept hierarchy.,Numeric values are replaced by ranges,In relational database,finding all frequent k-predicate sets will require,k,or,k,+1 table scans,Data cube is well suited for mining,The cells of an n-dimensional,cuboid,correspond to the,predicate sets,Mining from data cubescan be much faster,(income),(age),(),(buys),(age,income),(age,buys),(income,buys),(age,income,buys),2026/1/12 周一,58,Data Mining:Concepts and Techniques,Quantitative Association Rules,Proposed by Lent,Swami and,Widom,ICDE97,Numeric attributes are,dynamically,discretized,Such that the confidence or compactness of the rules mined is maximized,2-D quantitative association rules:A,quan1,A,quan2,A,cat,Cluster,adjacent,association rules to form general rules using a 2-D grid,Example,age(X,“34-35”),income(X,“30-50K”),buys(X,“high resolution TV”),2026/1/12 周一,59,Data Mining:Concepts and Techniques,Mining Other Interesting Patterns,Flexible support constraints(Wang,et al.VLDB02),Some items(e.g.,diamond)may occur rarely but are valuable,Customized,sup,min,specification and application,Top-K closed frequent patterns(Han,et al.ICDM02),Hard to specify,sup,min,but top-k,with,length,min,is more desirable,Dynamically raise,sup,min,in FP-tree construction and mining,and select most promising path to mine,2026/1/12 周一,60,Data Mining:Concepts and Techniques,Chapter 4:Mining Frequent Patterns,Association and Correlations,Basic concepts and a road map,Efficient and scalable frequent,itemset,mining methods,Mining various kinds of association rules,From association mining to correlation analysis,Constraint-based association mining,Mining colossal patterns,Summary,2026/1/12 周一,61,Data Mining:Concepts and Techniques,兴趣度的度量,客观度量,两个最为流行的度量,:,支持度和置信度(,support and confidence,),主观度量,(,Silberschatz,&,Tuzhilin,KDD95),一个规则(模式)是感兴趣的,如果,没有想到的,(,用户感到惊讶的,);,可操作的,(,用户在得到结果后,可以在此之上做些什么,),2026/1/12 周一,62,Data Mining:Concepts and Techniques,支持度,-,置信度方法的不足,Example 1:,(,Aggarwal,&Yu,PODS98),5000,个学生中,3000,喜欢打篮球,3750,喜欢吃米饭,2000,同时喜欢打篮球和吃米饭,关联规则:,play basketball,eat cereal 40%,66.7%,该规则具有欺骗性,因为从整个学生情况来看,有,75%,的学生喜欢吃米饭,大大高于,66.7%,。,关联规则:,play basketball,not eat cereal 20%,33.3%,该规则虽然拥有较低的支持度和置信度,但是比较精确。,Basketball,Not basketball,Sum(row),Cereal,2000,1750,3750,Not cereal,1000,250,1250,Sum(col.),3000,2000,5000,2026/1/12 周一,63,Data Mining:Concepts and Techniques,提升:一种兴趣度的度量,设,A,B,是两个项集,,P(A,B)=P(B)*P(A),A,和,B,是独立事件,A,B,的,相关性度量,取值小于,1,,,A and B,负相关,取值大于,1,,,A and B,正相关,如:上例,corr,=0.89,P(B|A)/P(B),称为规则,A=B,的,“,提升,”,挖掘相关性关联规则,,X2,检验是否有统计意义,2026/1/12 周一,64,Data Mining:Concepts and Techniques,Interestingness Measure:Correlations(Lift),play basketball,eat cereal,40%,66.7%is misleading,The overall%of students eating cereal is 75%66.7%.,play basketball,not eat cereal,20%,33.3%is more accurate,although with lower support and confidence,Measure of dependent/correlated events:,lift,Basketball,Not basketball,Sum(row),Cereal,2000,1750,3750,Not cereal,1000,250,1250,Sum(col.),3000,2000,5000,2026/1/12 周一,65,Data Mining:Concepts and Techniques,Chapter 4:Mining Frequent Patterns,Association and Correlations,Basic concepts and a road map,Efficient and scalable frequent,itemset,mining methods,Mining various kinds of association rules,From association mining to correlation analysis,Constraint-based association mining,Mining colossal patterns,Summary,2026/1/12 周一,66,Data Mining:Concepts and Techniques,基于约束的挖掘,使用约束的必要性,产生的多数规则是用户不感兴趣的,应在用户提供的各种约束的指导下进行挖掘,在数据挖掘中常使用的几种约束:,知识类型约束:,指定要挖掘的知识类型,如关联知识,数据约束:,指定与任务相关的数据集,Find product pairs sold together in,Vancouver,in,Dec.98,.,维,/,层次约束,:,指定所用的维或概念结构中的层,in relevance to,region,price,brand,customer category,.,规则约束:,指定要挖掘的规则形式,(,如规则模板,),单价,(price$200).,兴趣度约束:,指定规则兴趣度阈值或统计度量,如,(min_support,3%,min_confidence,60%).,2026/1/12 周一,67,Data Mining:Concepts and Techniques,Constraint-based(Query-Directed)Mining,Finding,all,the patterns in a database,autonomously,?unrealistic!,The patterns could be too many but not focused!,Data mining should be an,interactive,process,User directs what to be mined using a,data mining query language,(or a graphical user interface),Constraint-based mining,User flexibility:provides,constraints,on what to be mined,System optimization:explores such constraints for efficient mining,constraint-based mining:,constraint-pushing,similar to push selection first in DB query processing,Note:still find all the answers satisfying constraints,not finding some answers in“heuristic search”,2026/1/12 周一,68,Data Mining:Concepts and Techniques,Constraints in Data Mining,Knowledge type constraint,:,classification,association,etc.,Data constraint,using,SQL-like queries,find product pairs sold together in stores in,Chicago,in,Dec.02,Dimension/level constraint,in relevance to,region,price,brand,customer category,Rule(or pattern)constraint,small sales(price$200),Interestingness constraint,strong rules:min_support,3%,min_confidence,60%,2026/1/12 周一,69,Data Mining:Concepts and Techniques,Constraint-Based Frequ
展开阅读全文