《数据挖掘与管理决策》课程教学大纲.docx

资源描述

《数据挖掘与管理决策》课程教学大纲课程编号:20157英文名 :Data mining and Management Decision课程类别：专业主干（双语）前置课:统计学、线性代数、管理学后置课:企业资源计划学分：3学分课时：51课时选定教材 :Data Mining Introductory and Advanced Topics（影印版）.Margaret H. Dunham.清华大学出版社，2003年10月课程概述：数据挖掘是近年来伴随着数据库系统的大量建立和万维网的广泛使用而发展起来的_门数据处理和分析技术，它是数据库、机器学习与统计学这三个领域的交叉结合而形成的一门新兴技术。本课程系统地介绍各种数据挖掘的基本概念、方法和算法,并结合软件介绍和管理决策案例分析进行系统学习数据挖掘和应用。本课程由四部分构成：第一部分是导论，全面介绍数据挖掘的背景信息、相关概念以及数据挖掘所使用的主要技术；第二部分是数据挖掘的核心算法，系统深入地描述了用于分类、聚类和关联规则的常用算法；第三部分是数据挖掘的高级课题，主要叙述了 Web挖掘、空间数据挖掘、时序数据和序列数据挖掘。通过数据挖掘技术找到蕴藏在数据中的有用信息，进而找到尚未发现的知识，为商业竞争、企业生产和管理、政府部门决策以及科学探索等提供信息与知识,对于帮助管理者作出科学决策具有重要价值。第七章Web挖掘(Web Mining ) 课时分配：6课时教学要求：通过本章的教学，使学生了解Web内容挖掘（爬虫、Harvest系统、虚拟Web 视图）、Web结构挖掘（PageRank、Clever） . Web使用挖掘（预处理、数据结构、模式发现、模式分析）等高级数据挖掘技术和方法。教学内容： 7.1 IntroductionWeb Content MiningWeb Structure MiningWeb Usage Mining 思考题： 1. Construct the trie for the string < A B A C >. 2. The use of a Web server through a proxy （such as an ISP） complicates the collection of frequent sequence statistics. Suppose that two users use one proxy and have the following sessions: User 1:< 1,3,1,3,436,8,2,3,6〉User2：v2,3,4,3,6,8,6,3,l> When these are viewed together by the Web server（taking into account the time stamps）, one large session is generated: <1,2,3,3,4,1,376,3,8,4,3/63,6, l,8/2/3/6> Identify the maximal frequent sequences assuming a minimum support of 2. What are the maximal frequent sequences if the two users could be separated? 3. Perform a literature survey concerning current research into solutions to the proxy problem identified in Exercise 6. 第八章空间数据挖掘（Spatial Mining ）课时分配：6课时教学要求：通过本章的教学,使学生了解空间数据相关基本概念（空间查询、空间数据结构、主题地图和图像数据库）、空间数据挖掘原语、一般化和特殊化（渐进求精、- 般化、最近邻、STING ）.空间规则（空间关联规则、空间分类算法、对ID3 的扩展、空间决策树）、空间聚类算法（对CLARANS的扩展、SD（CLARANS）、 DBCLASD. BANG、WaveCluster 以及近似）。教学内容： 8.1 Introduction 8.2 Spatial Data Overview 8.3 Spatial Data Mining Primitives 8.4 Generalization and Specialization 8.5 Spatial Rules 8.6 Spatial Classification Algorithm 8.7 Spatial Clustering Algorithms 思考题： 1. Compare the R-tree to the R*-tree. 2. Another commonly used spatial index is the grid file. Define a grid file. Compare it to a k-D tree and a quad tree. Show the grid file that would be used to index the data found in Figure8.5. 第九章时序数据挖掘（Temporal Mining ）课时分配：6课时教学要求：通过本章的教学，使学生了解时序事件建模、时间序列（时间序列分析、趋势分析、变换、相似性、预测）、模式检测、时序序列（AprioriAIL SPADE、特征抽取）、时序关联规则（事务间关联规则、情节规则、趋势依赖、序列关联规则、日历关联规则）等方法，重点结合管理案例讲解数据分析方法。教学内容： 9.1 IntroductionModeling Temporal EventsTime SeriesPattern Detcdtion 9.2 SequencesTemporal Association Rules思考题： 1. Assume that you are given the following temperature values, Zt, taken at 5-minute time intervals:( 50, 52z 55, 58, 60, 57, 66, 62, 60). Plot both 乙曷 and Zt. Does there appear to be an autocorrelation? Calculate the correlation coefficient. 2. Plot the following time series values as well as the moving average found by replacing a given value with the average of it and ones preceding and following it:( 5 15 7 20 13 5 8 10 12 11 9 15}. For the first and last values, you are to use only the two values available to calculate the average. 3. Investigate and describe two techniques which have been used to predict future stock prices. 附录:参考书目 1、《数据挖掘导论（完整版）》,Pangning Tan, Michael Steinbach, Vipin Kumar.范明,范宏建等译,人民邮电出版，2016 2、《数据挖掘概念与技术》,Jiawei Han , Micheline Kamber.范明,孟小峰等译，机械工业出版社，2007 3、《SPSS Modeler数据挖掘方法及应用（第2版）》.薛薇，陈欢歌,电子工业出版社,2014 教学目的：数据挖掘技术经过十几年的发展，已经取得一些重要成果，特别是在基本概念、基本原理、基本算法等方面发展的越来越清晰。因此,现在开设此课程具备基本的技术条件。本课程以介绍基本概念和基本算法为主，作为高级数据处理和分析技术,其目的是通过本课程学习让学生了解信息处理技术的发展方向以及数据挖掘技术本身的概念、原理和方法。同时结合管理决策的案例进行教学,以前沿问题的讨论与探索为辅,为学生将来研究和学习提供知识储备，适应大数据时代的管理需要。教学方法：本课程课堂教学主要采用多媒体授课,并辅助以案例教学、课堂讨论和软件应用。各章教学要求及教学要点第一章弓|言(Introduction )课时分配：3课时教学要求：通过本章的教学，使学生了解数据挖掘基本概念、数据挖掘技术，包括分类、回归、时间序列分析、预测、聚类、关联规则、序列发现,以及数据挖掘与数据库中的知识发现、数据挖掘对未来管理决策和社会发展的影响。教学内容： 1.1 Basic Data Mining TasksData Mining Versus Knowledge Discovery in Databases 1.2 Data Mining IssuesData Mining MetricsSocial Implications of Data MiningData Mining from a Database Perspective 1.3 The Future思考题： 1. Identify and describe the phases in the KDD process, and how does KDD differ from data mining? 2. Find at least three examples of data mining applications that have appeared in the business section of your local publication. And describe the data mining application involved. 第二章相关概念(Related Concepts )课时分配：4课时教学要求：通过本章的教学，使学生了解数据处理相关概念，掌握数据库/OLTP系统、模糊集和模糊逻辑、信息检索、决策支持系统、维数据建模、多维模式、索引、数据仓储、、Web搜索引擎、机器学习、模式匹配等方法及其应用的相关概念。教学内容： 2.1 Database/OLTP SystemsFuzzy Sets and Fuzzy Logic 2.2 Information RetrievalDecision Support SystemsDimensional ModelingIndexing 2.3 Data WarehousingOLAPWeb Search EnginesStatistics 2.4 Machine Learning 1. Compare and contrast database, information retrieval, and data mining queries. What metrics are used to measure the performance of each type of query? 2. Data warehouse are often viewed to contain relatively static data. Investigate techniques that have been proposed to provide updates to this data from the operational data . How often should these updates occur? 第三章数据挖掘技术Data Mining Techniques 课时分配:4课时教学要求: 通过本章的教学,使学生了解数据挖掘技术的统计方法、贝叶斯定理、回归和相关、决策树、相似性、神经网络、激励函数和遗传算法等基本公式、计算步骤等内容。教学内容： 3.1 IntroductionA Statistical Perspective on Data MiningSimilarity MeasuresDecision Trees 3.2 Neural NetworksGenetic Algorithms思考题： 1 . Given the following set of values (1,3 ,9 15z 20}, determine the jackknife estimate for both the mean and standard deviation of the mean. 2. Find the similarity between ,<0 1 0.5 0.3 1 >and <1 0 0.5 0 0> using the Dice, Jaccard and Cosine similarity measures. 3. given the decision tree in Fig.3.5, classify each of the following students: < Mary, 20, F, 2mf Senior, Math>, <Dave, 19, Mz 1.7m, Sophomore, Computer science> and < Martha, 18, F, L2m, Freshman, English>. 第四章分类Classification 课时分配：8课时教学要求：了解分类中的问题和数据分析方法，包括基于统计的算法(如回归、贝叶斯分类)、基于距离的算法(K最近邻)、基于决策树的算法、神经网络、基于规则的算法以及其他组合技术。教学内容： 4.1 IntroductionStatistical-Based AlgorithmsDistance-Based AlgorithmsDecision Tree-Based Algorithms 4.2 Neural Network-Based AlgorithmsRule-Based AlgorithmsCombining Techniques思考题： 1 .Apply the method of least squares technique to determine the division between medium and tall persons using the training data in Table4.1 and classification shown in outputl(see example 4.3). You may use either the division technique or the prediction technique. 2. Explain the difference between P(〃编 and P (Q/ti) 3. Compare at least three different guideline that have been proposed for determining the optimal number of hidden nodes in an NN. 4. Various classification algorithm can be found online. Apply these programs to the height example in Table4.1 using the training classification shown in the output2 column. 第五章聚类Clustering课时分配：6课时教学要求：掌握相似性和距离度量、异常点、层次算法、划分算法（最小生成树、平方误差聚类算法、K均值聚类、最近邻算法等）、大型数据库聚舞BIRCH、DBSCAN. CURE算法）以及对类别属性进行聚类等方法教学内容： 5.1 IntroductionSimilarity and Distance MeasuresOutliersHierarchical Algorithms 5.2 Partitional AlgorithmsClustering Large DatabasesClustering with Categorical attributesComparison 思考题： 1. Show the dendrogram created by the single, complete, and average link clustering algorithms using the following adjacency matrix. Item A B C D A 0 1 4 5 B 1 0 2 6 c 4 3 0 3 D 5 6 3 0 2. A major problem with the single link algorithm is that clusters consisting of long chains may be created. Describe and illustrate this concept. 3. Trace the use of the nearest neighbor algorithm on the data of Exercise 1 assuming a threshold of 3. 4. Perform a survey of recently proposed clustering algorithms. Identify where they fit in the classification tree in Figure5.2. Try to describe their approach and performance. 第六章关联规则(Association Rules )课时分配:8课时教学要求：通过本章的教学，使学生了解大项目集法、基本算法(Apriori算法、抽样算法、划分)、并行和分布式算法、方法比较、增量规则、高级关联规则技术相关规则以及如何度量规则的质量，并结合实际案例进行应用分析。教学内容: 6.1 IntroductionLarge Item setsBasic AlgorithmsParallel and Distributed Algorithms 6.2 Comparing ApproachesIncremental RulesAdvanced Association Rule TechniquesMeasuring the Quality of Rules 思考题： 1. Trace the results of using the Apriori algorithm on the grocery store example with s=20% and a=40%. Be sure to show the candidate an large itemsets for each database scan. Also indicate the association rules that will be generated. 2. Trace the results of using the sampling algorithm on the clothing store example with s=20% and a=40%. Be sure to show the use of negative border function as well as the candidate and large itemsets for each database scan. 3. Calculate the lift and conviction for the rules shown in Table 6.3, Compare these to the shown support and confidence. 4. Perform a survey of recent research examining techniques to generate rules incrementally.

展开阅读全文