1、数据挖掘与管理决策课程教学大纲课程编号:20157英文名 :Data mining and Management Decision课程类别:专业主干(双语)前置课:统计学、线性代数、管理学后置课:企业资源计划学分:3学分课时:51课时选定教材 :Data Mining Introductory and Advanced Topics(影印版).Margaret H.Dunham.清华大学出版社,2003年10月课程概述:数据挖掘是近年来伴随着数据库系统的大量建立和万维网的广泛使用而发 展起来的_门数据处理和分析技术,它是数据库、机器学习与统计学这三个领域 的交叉结合而形成的一门新兴技术。本课程
2、系统地介绍各种数据挖掘的基本概念、 方法和算法,并结合软件介绍和管理决策案例分析进行系统学习数据挖掘和应用。 本课程由四部分构成:第一部分是导论,全面介绍数据挖掘的背景信息、相关概 念以及数据挖掘所使用的主要技术;第二部分是数据挖掘的核心算法,系统深入 地描述了用于分类、聚类和关联规则的常用算法;第三部分是数据挖掘的高级课 题,主要叙述了 Web挖掘、空间数据挖掘、时序数据和序列数据挖掘。通过数 据挖掘技术找到蕴藏在数据中的有用信息,进而找到尚未发现的知识,为商业竞 争、企业生产和管理、政府部门决策以及科学探索等提供信息与知识,对于帮助 管理者作出科学决策具有重要价值。第七章Web挖掘(Web
3、 Mining )课时分配:6课时教学要求:通过本章的教学,使学生了解Web内容挖掘(爬虫、Harvest系统、虚拟Web 视图)、Web结构挖掘(PageRank、Clever) . Web使用挖掘(预处理、数 据结构、模式发现、模式分析)等高级数据挖掘技术和方法。教学内容:7.1 IntroductionWeb Content MiningWeb Structure MiningWeb Usage Mining思考题:1. Construct the trie for the string .2. The use of a Web server through a proxy (such
4、as an ISP) complicates the collection of frequent sequence statistics. Suppose that two users use one proxy and have the following sessions:User 1:When these are viewed together by the Web server(taking into account the time stamps), one large session is generated: Identify the maximal frequent sequ
5、ences assuming a minimum support of 2. What are the maximal frequent sequences if the two users could be separated?3. Perform a literature survey concerning current research into solutions to the proxy problem identified in Exercise 6.第八章空间数据挖掘(Spatial Mining )课时分配:6课时教学要求:通过本章的教学,使学生了解空间数据相关基本概念(空间
6、查询、空间数据结构、 主题地图和图像数据库)、空间数据挖掘原语、一般化和特殊化(渐进求精、- 般化、最近邻、STING ).空间规则(空间关联规则、空间分类算法、对ID3 的扩展、空间决策树)、空间聚类算法(对CLARANS的扩展、SD(CLARANS)、 DBCLASD. BANG、WaveCluster 以及近似)。教学内容:8.1 Introduction8.2 Spatial Data Overview8.3 Spatial Data Mining Primitives8.4 Generalization and Specialization8.5 Spatial Rules8.6 S
7、patial Classification Algorithm8.7 Spatial Clustering Algorithms思考题:1. Compare the R-tree to the R*-tree.2. Another commonly used spatial index is the grid file. Define a grid file.Compare it to a k-D tree and a quad tree. Show the grid file that would be used to index the data found in Figure8.5.第九
8、章时序数据挖掘(Temporal Mining )课时分配:6课时教学要求:通过本章的教学,使学生了解时序事件建模、时间序列(时间序列分析、趋势 分析、变换、相似性、预测)、模式检测、时序序列(AprioriAIL SPADE、特 征抽取)、时序关联规则(事务间关联规则、情节规则、趋势依赖、序列关联规 则、日历关联规则)等方法,重点结合管理案例讲解数据分析方法。教学内容:9.1 IntroductionModeling Temporal EventsTime SeriesPattern Detcdtion9.2 SequencesTemporal Association Rules思考题:1.
9、 Assume that you are given the following temperature values, Zt, taken at 5-minute time intervals:( 50, 52z 55, 58, 60, 57, 66, 62, 60). Plot both乙曷 and Zt. Does there appear to be an autocorrelation? Calculate the correlation coefficient.2. Plot the following time series values as well as the movin
10、g average found by replacing a given value with the average of it and ones preceding and following it:( 5 15 7 20 13 5 8 10 12 11 9 15. For the first and last values, you are to use only the two values available to calculate the average.3. Investigate and describe two techniques which have been used
11、 to predict future stock prices.附录:参考书目1、数据挖掘导论(完整版),Pangning Tan, Michael Steinbach, Vipin Kumar.范明,范宏建等译,人民邮电出版,20162、数据挖掘概念与技术,Jiawei Han , Micheline Kamber.范明,孟 小峰等译,机械工业出版社,20073、SPSS Modeler数据挖掘方法及应用(第2版).薛薇,陈欢歌,电 子工业出版社,2014教学目的:数据挖掘技术经过十几年的发展,已经取得一些重要成果,特别是在基本概 念、基本原理、基本算法等方面发展的越来越清晰。因此,现在开设
12、此课程具备 基本的技术条件。本课程以介绍基本概念和基本算法为主,作为高级数据处理和 分析技术,其目的是通过本课程学习让学生了解信息处理技术的发展方向以及数 据挖掘技术本身的概念、原理和方法。同时结合管理决策的案例进行教学,以前 沿问题的讨论与探索为辅,为学生将来研究和学习提供知识储备,适应大数据时 代的管理需要。教学方法:本课程课堂教学主要采用多媒体授课,并辅助以案例教学、课堂讨论和软件 应用。各章教学要求及教学要点第一章 弓|言(Introduction )课时分配:3课时教学要求:通过本章的教学,使学生了解数据挖掘基本概念、数据挖掘技术,包括分类、 回归、时间序列分析、预测、聚类、关联规则
13、、序列发现,以及数据挖掘与数 据库中的知识发现、数据挖掘对未来管理决策和社会发展的影响。教学内容:1.1 Basic Data Mining TasksData Mining Versus Knowledge Discovery in Databases1.2 Data Mining IssuesData Mining MetricsSocial Implications of Data MiningData Mining from a Database Perspective1.3 The Future思考题:1. Identify and describe the phases in th
14、e KDD process, and how does KDD differ from data mining?2. Find at least three examples of data mining applications that have appeared in the business section of your local publication. And describe the data mining application involved.第二章相关概念(Related Concepts )课时分配:4课时教学要求:通过本章的教学,使学生了解数据处理相关概念,掌握数
15、据库/OLTP系统、 模糊集和模糊逻辑、信息检索、决策支持系统、维数据建模、多维模式、索引、 数据仓储、Web搜索引擎、机器学习、模式匹配等方法及其应用的相关概念。 教学内容:2.1 Database/OLTP SystemsFuzzy Sets and Fuzzy Logic2.2 Information RetrievalDecision Support SystemsDimensional ModelingIndexing2.3 Data WarehousingOLAPWeb Search EnginesStatistics2.4 Machine Learning1. Compare a
16、nd contrast database, information retrieval, and data mining queries. What metrics are used to measure the performance of each type of query?2. Data warehouse are often viewed to contain relatively static data. Investigate techniques that have been proposed to provide updates to this data from the o
17、perational data . How often should these updates occur?第三章数据挖掘技术Data Mining Techniques课时分配:4课时 教学要求:通过本章的教学,使学生了解数据挖掘技术的统计方法、贝叶斯定理、回归 和相关、决策树、相似性、神经网络、激励函数和遗传算法等基本公式、计算步 骤等内容。教学内容:3.1 IntroductionA Statistical Perspective on Data MiningSimilarity MeasuresDecision Trees3.2 Neural NetworksGenetic Algo
18、rithms思考题:1 . Given the following set of values (1,3 ,9 15z 20, determine the jackknife estimate for both the mean and standard deviation of the mean.2. Find the similarity between ,and using the Dice, Jaccard and Cosine similarity measures.3. given the decision tree in Fig.3.5, classify each of the
19、 following students: , and .第四章分类Classification课时分配:8课时教学要求:了解分类中的问题和数据分析方法,包括基于统计的算法(如回归、贝叶斯 分类)、基于距离的算法(K最近邻)、基于决策树的算法、神经网络、基于规 则的算法以及其他组合技术。教学内容:4.1 IntroductionStatistical-Based AlgorithmsDistance-Based AlgorithmsDecision Tree-Based Algorithms4.2 Neural Network-Based AlgorithmsRule-Based Algorit
20、hmsCombining Techniques思考题:1 .Apply the method of least squares technique to determine the division between medium and tall persons using the training data in Table4.1 and classification shown in outputl(see example 4.3). You may use either the division technique or the prediction technique.2. Expla
21、in the difference between P(编 and P (Q/ti)3. Compare at least three different guideline that have been proposed for determining the optimal number of hidden nodes in an NN.4. Various classification algorithm can be found online. Apply these programs to the height example in Table4.1 using the traini
22、ng classification shown in the output2 column.第五章聚类Clustering课时分配:6课时教学要求:掌握相似性和距离度量、异常点、层次算法、划分算法(最小生成树、平方 误差聚类算法、K均值聚类、最近邻算法等)、大型数据库聚舞BIRCH、DBSCAN. CURE算法)以及对类别属性进行聚类等方法教学内容:5.1 IntroductionSimilarity and Distance MeasuresOutliersHierarchical Algorithms5.2 Partitional AlgorithmsClustering Large Da
23、tabasesClustering with Categorical attributesComparison思考题:1. Show the dendrogram created by the single, complete, and average link clustering algorithms using the following adjacency matrix.ItemABCDA0145B1026c4303D56302. A major problem with the single link algorithm is that clusters consisting of
24、long chains may be created. Describe and illustrate this concept.3. Trace the use of the nearest neighbor algorithm on the data of Exercise 1 assuming a threshold of 3.4. Perform a survey of recently proposed clustering algorithms. Identify where they fit in the classification tree in Figure5.2. Try
25、 to describe their approach and performance.第六章关联规则(Association Rules )课时分配:8课时教学要求:通过本章的教学,使学生了解大项目集法、基本算法(Apriori算法、抽样算法、划分)、并行和分布式算法、方法比较、增量规则、高级关联规则技术相 关规则以及如何度量规则的质量,并结合实际案例进行应用分析。教学内容:6.1 IntroductionLarge Item setsBasic AlgorithmsParallel and Distributed Algorithms6.2 Comparing ApproachesIncr
26、emental RulesAdvanced Association Rule TechniquesMeasuring the Quality of Rules思考题:1. Trace the results of using the Apriori algorithm on the grocery store example with s=20% and a=40%. Be sure to show the candidate an large itemsets for each database scan. Also indicate the association rules that w
27、ill be generated.2. Trace the results of using the sampling algorithm on the clothing store example with s=20% and a=40%. Be sure to show the use of negative border function as well as the candidate and large itemsets for each database scan.3. Calculate the lift and conviction for the rules shown in Table 6.3, Compare these to the shown support and confidence.4. Perform a survey of recent research examining techniques to generate rules incrementally.