1、单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,数据挖掘实用机器学习技术及,Java,实现,原书,英文版,Data MiningPractical Machine Learning Tools and Techniques with Java Implementations,,新西兰,Ian H.,Witten,、,Eibe,Frank,著,Weka,An open source framework for text analysis implemented in Java that is being developed at the Universit
2、y of,Waikato,in New Zealand.,www.,cs,.,waikato,.ac.,nz,/ml/,weka,/,www.,mkp,.com/,datamining,/,概念:,KDD、ML、OLAP,与,DM,KDD(Knowledge Discovery in Database),是一种知识发现的,一连串过程,。,ML(Machine Learning),KD,,不限于,Database,的数据,过程:,挖掘数据模式表示验证预测,OLAP(Online Analytical Process),是数据库在线分析过程。,数据挖掘(,data Mining),只是,KDD/M
3、L,的一个重要组成部分。,DM,用在产生假设,而,OLAP,则用于查证假设,概念:,DM,与,DB,Data Preparation,要占,Data mining,过程70工作量,Data base Data mining,会说话的数据库,概念:,Data Mining,概念:数据挖掘是从大量的数据中,抽取出潜在的、有价值的知识(模型或规则)的过程,Key Characteristics of Data Mining:,Large amount of data,Discovering previously unknown,hidden information,Extracting valuab
4、le information,Making important business decision using the information,DM/ML,的一些要点,The data is stored electronically and,the search is automated,by computer;,About solving problems by,analyzing data already present in databases,;,Defined as the process of,discovering patterns in data,;,This book is
5、 aboutTechniques for finding and describing,structural patterns,in data.,structural patterns,表示法:表、树、规则,概念:,Machine Learning,To learn:,to get knowledge of study,experience,or being taught;,to become aware by information or from observation;,to commit to memory;,to be informed of,ascertain(,确定);,to r
6、eceive instruction,Shortcomings,when it comes to talking about computes,Its virtually,impossible to test,if learning as bean achieved or not.,This,ties learning to performance,rather than knowledge,简单例子:天气问题*,天气数据:,weather.nominal.,arff,运行,Weka,,,载入数据,选择算法,id3,预测,(决策树),outlook=rainy,|windy=TRUE:no,|
7、windy=FALSE:yes,测试方法:采用10,Cross-validation,的,测试结果:,Confusion Matrix(P.138),和准确率,a b no,Ordinal:,距离无法度量,如,hot mild cool,Interval:,距离可度量,如整数,Ratio:,如58.1%,输入:,Preparing the input*,Gathering the data together,The data must be assembled,integrated,and cleaned up(,Data Warehousing,),Selecting the,right t
8、ype and level of aggregation,is usually critical for success,属性类型:,ARFF,文件格式(备注:,weather.nominal.,arff,),支持两种基本类型:,nominal and numeric,,尽可能用前者,属性值,Missing value:,去掉该样本、替代、(用,?,来表示字段值),Inaccurate value:,一粒老鼠屎需要领域知识!,Getting to know your data!,数据清理一个耗时、费力,却很重要的过程,,Garbage in,garbage out!,输出:,Knowledge
9、 representation,Decision tables,Decision trees,Classification rules,If a and b then x,Association rules:,多个结果,If then outlook=sunny and humidity=high,Rules with exceptions(P.66),If then exceptelse except,Trees for numeric prediction,Instance-based representation,Clusters,算法:,The basic methods,Simpli
10、city-first,:,simple ideas often work very well,Very simple classification rules perform well on most commonly used datasets(,Holte,1993),Inferring rudimentary rules(,算法:,1R、1-Rule),Statistical modeling(,算法:,Na,ve Bayes,),使用所有属性,假设属性无关、且同等重要,Divide and conquer:Constructing decision trees,循环选择一个属性来分割样
11、本(算法:,ID3、C4.5),Covering algorithms:Constructing rules(,算法:,Prism),Take each class in turn and seek a way of covering all instances in it,at the same time excluding instances not in the class.,Covering approach,导出一个规则集而不是决策树,算法:,The basic methods,Mining association rules:,参数:,coverage(support),accur
12、acy(confidence),Linear models,(,参考,cpu,.,arff,例子),主要用于值预估和分类(,Linear regression),Instance-based learning,算法:,Nearest-neighbor,K-Nearest-neighbor,评估可信度*,三个数据集:,Training data:,用于导出模型,越大则模型越好,Validation data:,用于优化模型参数,Test data:,用于计算最终模型的错误率,越大越准确,原则:测试数据无论如何也不能用于模型的训练,问题:如果样本很少,如何划分?,方法:,N-fold Cross-
13、validation,(n=3,10),Leave-one-out Cross-validation,Bootstrap(e=0.632):best for very small datasets,Counting the cost:,Lift charts(Respondents/Sample Size,)、,ROC curves(P.141),The MDL principle(Minimum Description Length),Occam,s Razor:Other things being equal,simple theories are preferable to comple
14、x ones.,爱因斯坦:,Everything should be made as simple as possible,but no simpler.,实现,:,Real machine learning schemes(,略),参考阅读:,Ch6.1 Decision tree,Ch6.2 Classification rules,Ch6.3 Extending linear classification:Support vector machines,Ch6.4 Instance-based learning,Ch6.5 Numeric prediction,Ch6.6 Cluster
15、ing,改进:,Engineering the input and output,数据工程,Attribute selection,Discretizing,(,离散化),numeric attributes,Automatic data cleaning,Combining multiple models,Bagging,Boosting,Stacking,Error-correcting output codes,未来:,Looking forward,大数据集,可视化:输入、输出,Incorporating domain knowledge,Metadata often involves
16、 relations among attributes,文本挖掘,挖掘,Web,回顾:目录,DM,综合的技术领域,DM,的功能分类,DM,的具体应用,DM,的步骤,DM,的理论技术和算法,DM,的常用分析工具,回顾:,DM,综合的技术领域,Database systems,Data Warehouses,OLAP,Machine learning,Statistical and data analysis methods,Visualization,Mathematical programming,High performance computing,回顾:,DM,的功能分类,分类方法一,分类
17、classification),估计(,estimation),预测,(,prediction),关联分组,(,affinity grouping),聚类,(,clustering),分类方法二,Classification,Regression,Time-Series For,e,casting,Clustering,Association,Sequence Discovery,回顾:,DM,的具体应用,市场-购物蓝分析,客户关系管理,寻找潜在客户,提高客户终生价值,保持客户忠诚度,行销活动规划,预测金融市场方向,保险欺诈侦察,客户信用风险评级,电话盗打,NBA,球员强弱分析,信用卡可
18、能呆帐预警,星际星体分类,回顾:,DM,的步骤*,一种步骤划分方式,理解资料与进行的工作,获取相关知识与技术(,Acquisition),整合与查核资料(,Integration and checking),去除错误、不一致的资料(,Data cleaning),模式与假设的演化(,Model and hypothesis development),实际数据挖掘工作,测试,与核查所分析的资料(,Testing and verification),解释与运用(,Interpretation and use),另一种步骤划分方式(见本页的备注!),不管那种方式,前期数据处理占很大比率,回顾:,DM
19、的理论技术和算法,统计分析方法(,Statistical Methods),决策树(,Decision Tree),人工神经网络(,Neural Network),规则归纳法(,Rules Induction),遗传算法(,Genetic algorithms),常用的分析,DM,工具,回顾:,DM,的常用分析工具,Case-based Reasoning,Data Visualization,Fuzzy Query and Analysis,Knowledge Discovery,Neural Networks,典型案例:,英国,Safeway,公司简介,英国,Safeway,的年销售量超
20、过一百亿美金,员工接近七万名,是英国第三大的连锁超级市场,提供的服务种类则达三十四种。,问题,在英国市场运用传统的技术,如更低的价位、更多的店面、以及更多种类的产品,竞争已经越来越困难了,问题确认:,必须以,客户,为导向,而非以产品与店家为导向。,必须了解六百万客户所做的每一笔交易,以及这些交易彼此之间的关连性。,英国,Safeway,想要知道哪些种类的客户买了哪些种类的产品以及购买的频率,以建立个人导向的市场,典型案例:,英国,Safeway,数据来源,公司开始发信用卡给客户,客户用这种信用卡结帐可以享受各种优惠,这种信用卡就成为该公司在500家店面搜集六百万客户资料的网,使用工具:,使用,
21、IBM Intelligent Miner,从数据库中取得商业知识。,根据客户的相关资料,将客户分为150类。然后再用,Association,的技术来比较这些资料集合,然后将列出产品吸引力的清单 。,典型案例:,英国,Safeway,找出模式:由于,Data Mining,的贡献,我们找出了超过人类概念范围的关连性。,发现某一种乳酪产品虽然销售额排名第209,可是消费额最高的客户中有25%都常常买这种乳酪;,发现在28种品牌的橘子汁中,有8种特别受到欢迎。因此该公司得以重新安排货架的摆设,使得橘子汁的销量能够增加到最大。,在了解客户每次采购时会购买哪些产品以后,就可以利用,Data Mini
22、ng,中的,Sequence Discovery,的功能,以侦测出长期的经常购买行为。,将这些资料与主数据库的人口统计资料结合在一起,,Safeway,的行销部门就可以根据每个家庭的弱点,也就是在哪些季节会购买哪些产品的趋势,发出邮件。,总结:,DM,的功能/算法/应用的比较,总结:,DM,常用方法的综合比较*,总结:,DM,不能做什么,DM,不能告诉你某个模型对你的企业的实际价值。,DM,是一个工具,他只是帮助商业人士更深入、更容易地分析数据,但是无法告诉你某个模型对你的企业的实际价值,,DM,中得到的模型必须在现实生活中进行验证。,DM,不会在缺乏指导的情况下自动的发现模型。数据分析者必须知道你所选用的,DM,工具是如何工作的,采用的算法的原理是什么。,DM,永远不会替代有经验的商业分析师或管理人员所起的作用,它只是一个强大的工具。,结论,更多的人在从事,Data Mining,,且更多不同类型的人在从事,Data Mining。,Data Mining,技术导入企业,,它的重点不是数据库本身,而在于以企业领域为主。,妥善地运用,Data Mining,技术,,必能提高企业的竞争优势。,






