ImageVerifierCode 换一换
格式:PPTX , 页数:88 ,大小:856.68KB ,
资源ID:10290790      下载积分:10 金币
验证码下载
登录下载
邮箱/手机:
图形码:
验证码: 获取验证码
温馨提示:
支付成功后,系统会自动生成账号(用户名为邮箱或者手机号,密码是验证码),方便下次登录下载和查询订单;
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

开通VIP
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.zixin.com.cn/docdown/10290790.html】到电脑端继续下载(重复下载【60天内】不扣币)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录  

开通VIP折扣优惠下载文档

            查看会员权益                  [ 下载后找不到文档?]

填表反馈(24小时):  下载求助     关注领币    退款申请

开具发票请登录PC端进行申请。


权利声明

1、咨信平台为文档C2C交易模式,即用户上传的文档直接被用户下载,收益归上传人(含作者)所有;本站仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿,我们不确定上传用户享有完全著作权,根据《信息网络传播权保护条例》,如果侵犯了您的版权、权益或隐私,请联系我们,核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确),网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据,个别因单元格分列造成显示页码不一将协商解决,平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺,下载前须认真查看,确认无误后再购买,务必慎重购买;若有违法违纪将进行移交司法处理,若涉侵权平台将进行基本处罚并下架。
3、本站所有内容均由用户上传,付费前请自行鉴别,如您付费,意味着您已接受本站规则且自行承担风险,本站不进行额外附加服务,虚拟产品一经售出概不退款(未进行购买下载可退充值款),文档一经付费(服务费)、不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
4、如你看到网页展示的文档有www.zixin.com.cn水印,是因预览和防盗链等技术需要对页面进行转换压缩成图而已,我们并不对上传的文档进行任何编辑或修改,文档下载后都不会有水印标识(原文档上传前个别存留的除外),下载后原文更清晰;试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓;PPT和DOC文档可被视为“模板”,允许上传人保留章节、目录结构的情况下删减部份的内容;PDF文档不管是原文档转换或图片扫描而得,本站不作要求视为允许,下载前可先查看【教您几个在下载文档中可以更好的避免被坑】。
5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用;网站提供的党政主题相关内容(国旗、国徽、党徽--等)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
6、文档遇到问题,请及时联系平台进行协调解决,联系【微信客服】、【QQ客服】,若有其他问题请点击或扫码反馈【服务填表】;文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“【版权申诉】”,意见反馈和侵权处理邮箱:1219186828@qq.com;也可以拔打客服电话:4009-655-100;投诉/维权电话:18658249818。

注意事项

本文(《数据仓库与数据挖掘》第9章.pptx)为本站上传会员【可****】主动上传,咨信网仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知咨信网(发送邮件至1219186828@qq.com、拔打电话4009-655-100或【 微信客服】、【 QQ客服】),核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载【60天内】不扣币。 服务填表

《数据仓库与数据挖掘》第9章.pptx

1、Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Mining:Concepts and Techniques,*,第7章:分类和预测,What is classification?What is prediction?,Issues regarding classification and prediction,Classification by decision tree induction,B

2、ayesian Classification,Classification by Neural Networks,Classification by Support Vector Machines(SVM),Classification based on concepts from association rule mining,Other Classification Methods,Prediction,Classification accuracy,Summary,2025/5/17 周六,1,Data Mining:Concepts and Techniques,Classificat

3、ion:,predicts categorical class labels(discrete or nominal),classifies data(constructs a model)based on the training set and the values(,class labels,)in a classifying attribute and uses it in classifying new data,Prediction:,models continuous-valued functions,i.e.,predicts unknown or missing values

4、Typical Applications,credit approval,target marketing,medical diagnosis,treatment effectiveness analysis,Classification vs.Prediction,2025/5/17 周六,2,Data Mining:Concepts and Techniques,ClassificationA Two-Step Process,Model construction,:describing a set of predetermined classes,Each tuple/sample i

5、s assumed to belong to a predefined class,as determined by the,class label attribute,The set of tuples used for model construction is,training set,The model is represented as classification rules,decision trees,or mathematical formulae,Model usage,:for classifying future or unknown objects,Estimate

6、accuracy of the model,The known label of test sample is compared with the classified result from the model,Accuracy rate is the percentage of test set samples that are correctly classified by the model,Test set is independent of training set,otherwise over-fitting will occur,If the accuracy is accep

7、table,use the model to classify data tuples whose class labels are not known,2025/5/17 周六,3,Data Mining:Concepts and Techniques,Classification Process(1):Model Construction,Training,Data,Classification,Algorithms,IF rank=professor,OR years 6,THEN tenured=yes,Classifier,(Model),2025/5/17 周六,4,Data Mi

8、ning:Concepts and Techniques,Classification Process(2):Use the Model in Prediction,Classifier,Testing,Data,Unseen Data,(,Jeff,Professor,4),Tenured?,2025/5/17 周六,5,Data Mining:Concepts and Techniques,Supervised vs.Unsupervised Learning,Supervised learning(classification),Supervision:The training data

9、observations,measurements,etc.)are accompanied by labels indicating the class of the observations,New data is classified based on the training set,Unsupervised learning,(clustering),The class labels of training data is unknown,Given a set of measurements,observations,etc.with the aim of establishin

10、g the existence of classes or clusters in the data,2025/5/17 周六,6,Data Mining:Concepts and Techniques,第7章:分类和预测,What is classification?What is prediction?,Issues regarding classification and prediction,Classification by decision tree induction,Bayesian Classification,Classification by Neural Network

11、s,Classification by Support Vector Machines(SVM),Classification based on concepts from association rule mining,Other Classification Methods,Prediction,Classification accuracy,Summary,2025/5/17 周六,7,Data Mining:Concepts and Techniques,Issues Regarding Classification and Prediction(1):Data Preparation

12、Data cleaning,Preprocess data in order to reduce noise and handle missing values,Relevance analysis(feature selection),Remove the irrelevant or redundant attributes,Data transformation,Generalize and/or normalize data,2025/5/17 周六,8,Data Mining:Concepts and Techniques,Issues regarding classificatio

13、n and prediction(2):Evaluating Classification Methods,Predictive accuracy,Speed and scalability,time to construct the model,time to use the model,Robustness,handling noise and missing values,Scalability,efficiency in disk-resident databases,Interpretability:,understanding and insight provided by the

14、 model,Goodness of rules,decision tree size,compactness of classification rules,2025/5/17 周六,9,Data Mining:Concepts and Techniques,第7章:分类和预测,What is classification?What is prediction?,Issues regarding classification and prediction,Classification by decision tree induction,Bayesian Classification,Cla

15、ssification by Neural Networks,Classification by Support Vector Machines(SVM),Classification based on concepts from association rule mining,Other Classification Methods,Prediction,Classification accuracy,Summary,2025/5/17 周六,10,Data Mining:Concepts and Techniques,Training Dataset,This follows an exa

16、mple from Quinlans ID3,2025/5/17 周六,11,Data Mining:Concepts and Techniques,Output:A Decision Tree for“,buys_computer”,age?,overcast,student?,credit rating?,no,yes,fair,excellent,40,no,no,yes,yes,yes,30.40,2025/5/17 周六,12,Data Mining:Concepts and Techniques,Algorithm for Decision Tree Induction,Basic

17、 algorithm(a greedy algorithm),Tree is constructed in a,top-down recursive divide-and-conquer manner,At start,all the training examples are at the root,Attributes are categorical(if continuous-valued,they are discretized in advance),Examples are partitioned recursively based on selected attributes,T

18、est attributes are selected on the basis of a heuristic or statistical measure(e.g.,information gain,),Conditions for stopping partitioning,All samples for a given node belong to the same class,There are no remaining attributes for further partitioning,majority voting,is employed for classifying the

19、 leaf,There are no samples left,2025/5/17 周六,13,Data Mining:Concepts and Techniques,Attribute Selection Measure:Information Gain(ID3/C4.5),Select the attribute with the highest information gain,S contains s,i,tuples of class C,i,for i=1,m,information,measures info required to classify any arbitrary

20、tuple,entropy,of attribute A with values a,1,a,2,a,v,information gained,by branching on attribute A,2025/5/17 周六,14,Data Mining:Concepts and Techniques,Attribute Selection by Information Gain Computation,Class P:buys_computer=“yes”,Class N:buys_computer=“no”,I(p,n)=I(9,5)=0.940,Compute the entropy f

21、or,age,:,means“age=30”has 5 out of 14 samples,with 2 yeses and 3 nos.Hence,Similarly,2025/5/17 周六,15,Data Mining:Concepts and Techniques,Other Attribute Selection Measures,Gini index,(CART,IBM IntelligentMiner),All attributes are assumed continuous-valued,Assume there exist several possible split va

22、lues for each attribute,May need other tools,such as clustering,to get the possible split values,Can be modified for categorical attributes,2025/5/17 周六,16,Data Mining:Concepts and Techniques,Gini,Index(IBM IntelligentMiner),If a data set,T,contains examples from,n,classes,gini index,gini,(,T,)is de

23、fined as,where,p,j,is the relative frequency of class,j,in,T.,If a data set,T,is split into two subsets,T,1,and,T,2,with sizes,N,1,and,N,2,respectively,the,gini,index of the split data contains examples from,n,classes,the,gini,index,gini,(,T,)is defined as,The attribute provides the smallest,gini,sp

24、lit,(,T,)is chosen to split the node(,need to enumerate all possible splitting points for each attribute,).,2025/5/17 周六,17,Data Mining:Concepts and Techniques,Extracting Classification Rules from Trees,Represent the knowledge in the form of,IF-THEN,rules,One rule is created for each path from the r

25、oot to a leaf,Each attribute-value pair along a path forms a conjunction,The leaf node holds the class prediction,Rules are easier for humans to understand,Example,IF,age,=“=30”AND,student,=“,no,”THEN,buys_computer,=“,no,”,IF,age,=“40”AND,credit_rating,=“,excellent,”THEN,buys_computer,=“,yes,”,IF,ag

26、e,=“=30”AND,credit_rating,=“,fair,”THEN,buys_computer,=“,no,”,2025/5/17 周六,18,Data Mining:Concepts and Techniques,Avoid Overfitting in Classification,Overfitting:An induced tree may overfit the training data,Too many branches,some may reflect anomalies due to noise or outliers,Poor accuracy for unse

27、en samples,Two approaches to avoid overfitting,Prepruning:Halt tree construction earlydo not split a node if this would result in the goodness measure falling below a threshold,Difficult to choose an appropriate threshold,Postpruning:Remove branches from a“fully grown”treeget a sequence of progressi

28、vely pruned trees,Use a set of data different from the training data to decide which is the“best pruned tree”,2025/5/17 周六,19,Data Mining:Concepts and Techniques,Approaches to Determine the Final Tree Size,Separate training(2/3)and testing(1/3)sets,Use cross validation,e.g.,10-fold cross validation,

29、Use all the data for training,but apply a,statistical test,(e.g.,chi-square)to estimate whether expanding or pruning a node may improve the entire distribution,Use minimum description length(MDL)principle,halting growth of the tree when the encoding is minimized,2025/5/17 周六,20,Data Mining:Concepts

30、and Techniques,Enhancements to basic decision tree induction,Allow for continuous-valued attributes,Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals,Handle missing attribute values,Assign the most common value of the att

31、ribute,Assign probability to each of the possible values,Attribute construction,Create new attributes based on existing ones that are sparsely represented,This reduces fragmentation,repetition,and replication,2025/5/17 周六,21,Data Mining:Concepts and Techniques,Classification in Large Databases,Class

32、ificationa classical problem extensively studied by statisticians and machine learning researchers,Scalability:Classifying data sets with millions of examples and hundreds of attributes with reasonable speed,Why decision tree induction in data mining?,relatively faster learning speed(than other clas

33、sification methods),convertible to simple and easy to understand classification rules,can use SQL queries for accessing databases,comparable classification accuracy with other methods,2025/5/17 周六,22,Data Mining:Concepts and Techniques,Scalable Decision Tree Induction Methods in Data Mining Studies,

34、SLIQ,(EDBT96 Mehta et al.),builds an index for each attribute and only class list and the current attribute list reside in memory,SPRINT,(VLDB96 J.Shafer et al.),constructs an attribute list data structure,PUBLIC,(VLDB98 Rastogi&Shim),integrates tree splitting and tree pruning:stop growing the tree

35、earlier,RainForest,(VLDB98 Gehrke,Ramakrishnan&Ganti),separates the scalability aspects from the criteria that determine the quality of the tree,builds an AVC-list(attribute,value,class label),2025/5/17 周六,23,Data Mining:Concepts and Techniques,Data Cube-Based Decision-Tree Induction,Integration of

36、generalization with decision-tree induction(Kamber et al97).,Classification at primitive concept levels,E.g.,precise temperature,humidity,outlook,etc.,Low-level concepts,scattered classes,bushy classification-trees,Semantic interpretation problems.,Cube-based multi-level classification,Relevance ana

37、lysis at multi-levels.,Information-gain analysis with dimension+level.,2025/5/17 周六,24,Data Mining:Concepts and Techniques,Presentation of Classification Results,2025/5/17 周六,25,Data Mining:Concepts and Techniques,Visualization of a,Decision Tree,in SGI/MineSet 3.0,2025/5/17 周六,26,Data Mining:Concep

38、ts and Techniques,Interactive Visual Mining,by Perception-Based Classification(PBC),2025/5/17 周六,27,Data Mining:Concepts and Techniques,第7章:分类和预测,What is classification?What is prediction?,Issues regarding classification and prediction,Classification by decision tree induction,Bayesian Classificatio

39、n,Classification by Neural Networks,Classification by Support Vector Machines(SVM),Classification based on concepts from association rule mining,Other Classification Methods,Prediction,Classification accuracy,Summary,2025/5/17 周六,28,Data Mining:Concepts and Techniques,Bayesian Classification:Why?,Pr

40、obabilistic learning,:Calculate explicit probabilities for hypothesis,among the most practical approaches to certain types of learning problems,Incremental,:Each training example can incrementally increase/decrease the probability that a hypothesis is correct.Prior knowledge can be combined with obs

41、erved data.,Probabilistic prediction,:Predict multiple hypotheses,weighted by their probabilities,Standard,:Even when Bayesian methods are computationally intractable,they can provide a standard of optimal decision making against which other methods can be measured,2025/5/17 周六,29,Data Mining:Concep

42、ts and Techniques,Bayesian Theorem:Basics,Let X be a data sample whose class label is unknown,Let H be a hypothesis that X belongs to class C,For classification problems,determine P(H/X):the probability that the hypothesis holds given the observed data sample X,P(H):prior probability of hypothesis H

43、i.e.the initial probability before we observe any data,reflects the background knowledge),P(X):probability that sample data is observed,P(X|H):probability of observing the sample X,given that the hypothesis holds,2025/5/17 周六,30,Data Mining:Concepts and Techniques,Bayesian Theorem,Given training da

44、ta,X,posteriori probability of a hypothesis H,P(H|X),follows the Bayes theorem,Informally,this can be written as,posterior=likelihood x prior/evidence,MAP(maximum posteriori)hypothesis,Practical difficulty:require initial knowledge of many probabilities,significant computational cost,2025/5/17 周六,31

45、Data Mining:Concepts and Techniques,Nave Bayes Classifier,A simplified assumption:attributes are conditionally independent:,The product of occurrence of say 2 elements x,1,and x,2,given the current class is C,is the product of the probabilities of each element taken separately,given the same class

46、P(y,1,y,2,C)=P(y,1,C)*P(y,2,C),No dependence relation between attributes,Greatly reduces the computation cost,only count the class distribution.,Once the probability P(X|C,i,)is known,assign X to the class with maximum P(X|C,i,)*P(C,i,),2025/5/17 周六,32,Data Mining:Concepts and Techniques,Training da

47、taset,Class:,C1:buys_computer=,yes,C2:buys_computer=,no,Data sample,X=(age=30,Income=medium,Student=yes,Credit_rating=,Fair),2025/5/17 周六,33,Data Mining:Concepts and Techniques,Nave Bayesian Classifier:Example,Compute P(X/Ci)for each class,P(age=“30”|buys_computer=“yes”)=2/9=0.222,P(age=“30”|buys_co

48、mputer=“no”)=3/5=0.6,P(income=“medium”|buys_computer=“yes”)=4/9=0.444,P(income=“medium”|buys_computer=“no”)=2/5=0.4,P(student=“yes”|buys_computer=“yes)=6/9=0.667,P(student=“yes”|buys_computer=“no”)=1/5=0.2,P(credit_rating=“fair”|buys_computer=“yes”)=6/9=0.667,P(credit_rating=“fair”|buys_computer=“no

49、)=2/5=0.4,X=(age credit approval(Yes/No),Temp,Humidity-Rain(Yes/No),Classification,Mathematically,2025/5/17 周六,40,Data Mining:Concepts and Techniques,Linear Classification,Binary Classification problem,The data above the red line belongs to class x,The data below red line belongs to class o,Example

50、s SVM,Perceptron,Probabilistic Classifiers,x,x,x,x,x,x,x,x,x,x,o,o,o,o,o,o,o,o,o,o,o,o,o,2025/5/17 周六,41,Data Mining:Concepts and Techniques,Discriminative Classifiers,Advantages,prediction accuracy is generally high,(as compared to Bayesian methods in general),robust,works when training examples co

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        抽奖活动

©2010-2025 宁波自信网络信息技术有限公司  版权所有

客服电话:4009-655-100  投诉/维权电话:18658249818

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :微信公众号    抖音    微博    LOFTER 

客服