1、1/7华南理工大学计算机科学与工程学院华南理工大学计算机科学与工程学院 20052006 学年度第一学期期末考试学年度第一学期期末考试 数据仓库与数据挖掘技术试数据仓库与数据挖掘技术试 卷卷 专业:双语班专业:双语班 年级:年级:2002 姓名:姓名:学号:学号:注意事项:注意事项:1.本试卷共四大题,满分 100 分,考试时间 120 分钟;2.所有答案请直接答在试卷上;题号题号 一一 二二 三三 四 总分 得分得分 一.Fill in the following blanks.(1 point per blank,the total:20 points)1.A data warehouse
2、 is a _,_,_ and _ collection of data in support of managements decision making process.2.The most popular data model for a data warehouse is a multidimensional model.Such a model can exist in the form of a _ schema,a _ schema,or a _ schema.3.List four OLAP operations _,_,_,and _.4.Measures can be or
3、ganized into the following three categories,based on the kind of aggregate functions used,_,_,and _.5.For interestingness measures of a pattern,there are four objective measures:_,_,_ and novelty.6.List three knowledge types to be mined:_,_,and _.二.Miscellaneous questions.(8 points per question,the
4、total:40 points)1.Suppose that the data for analysis include the attribute age.The age values for the data tuples are:13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,35,36,40,45,46,52,70.(a).Use min-max normalization to transform the value 35 for age onto the range 0.0,1.0.2/7(b).Use
5、z-score normalization to transform the value 35 for age,where the deviation of age is 12.94 years.(c).Use normalization by decimal scaling to transform the value 35 for age.2.Consider Association Rule(1)bellow,which was mined from a university database:major(X,“science”)status(X,“undergraduate”).Sup
6、pose that the number of students at the university(that is,the number of task-relevant data tuples)is 5000,that 56%of undergraduates at the university major in science,that 64%of the students are registered in programs leading to undergraduates degrees,and that 70%of the students are majoring in sci
7、ence.(a).Compute the confidence and support of rule(1).(b).Consider Rule(2)below:major(X,“biology”)status(X,“undergraduate”)17%,80%.Suppose that 30%of science students are majoring in biology.Would you consider the rule(2)to be novel with respect to rule(1)?Explain.3.Given the following table(Table
8、1):Table 1 Locationitem DVD TV Computer Guangzhou 280 180 340 Beijing 260 220 320 Shanghai 360 200 240(1).Map the class Beijing(target class)into a(bi-directional)quantitative descriptive rule.For example,X,Guangzhou(X)TV(X)t:x%,d:y%.(2).Map the class Computer(target class)into a(bi-directional)quan
9、titative descriptive rule.4.A partitioning of variation of Apriori subdivides the transactions of a database D into n nonoverlapping partitions.Prove that any itemset that is frequent in D must be frequent in at least one part of D.3/7 5.Prove that the constraints sum(S)(a S,a 0)is monotone,and sum(
10、S)(a S,a 0)is antimonotone.三.Problems.(The total:30 points)1.Given the following transaction database(table 2),and the minimum support is 60%,minimum confidence is 80%.(1).Find all frequent patterns using Apriori algorithm,and generate strong association rules from L3(i.e.the frequent 3-pattern).Ass
11、ume the support count is 2 and the confidence is 80%.(12 points)(2).Draw the frequent pattern tree.(6 points)Table 2 T1 I1 I2 I6 T2 I1 I3 I5 I6 T3 I1 I2 I6 T4 I1 I3 I4 T5 I1 I2 I4 I6 4/7 2.Table 3 presents a training set of data tuples about whether to play basketball.Given a tuple(Outlook=sunny,tem
12、perature=cool,Humidity=high,Wind=strong),decide that the target class Playbasketball is yes or no using Bayesiannave classifier.(18 points)Table 3 No.Outlook Temperature Humidity Wind Playbasketball 1 Overcast Hot High Weak Yes 2 Sunny Hot High Weak No 3 Sunny Hot High Strong No 4 Overcast Hot Norma
13、l Weak Yes 5 Rain Mild High Weak Yes 6 Sunny Cool Normal Weak Yes 7 Rain Cool Normal Weak Yes 8 Rain Mild Normal Weak Yes 9 Rain Cool Normal Strong No 10 Overcast Cool Normal Strong Yes 11 Sunny Mild High Weak No 12 Overcast Mild High Strong Yes 5/7 3.Table 4 presents distances between any two objec
14、ts,e.g.the distance between objects 1 and 2 is 2.5.Assume the distance between two clusters d(C1,C2)is defined as follows:d(C1,C2)=Maxdij|i C1,j C2,where C1,C2 are two clusters,and dij is the distance between objects i and j,Max is used to compute the minimum value of a set.Clustering the objects us
15、ing the agglomerative hierarchical clustering method and draw the dendrogram(i.e.shows how the clusters are merged hierarchically).(10 points)Table 4 1 2 3 4 5 1 0 2 9 0 3 4 6 0 4 8 5 2 0 5 10 7 3 5 0 华南理工大学计算机科学与工程学院华南理工大学计算机科学与工程学院 20052006 学年度第一学期期末考试学年度第一学期期末考试 数据仓库与数据挖掘技术试数据仓库与数据挖掘技术试 卷卷 答案答案(一
16、)略(一)略(二)(二).1.和和541,平均值,平均值541/20=27.05 标准差的平方标准差的平方(13-27.05)2+(15-27.05)2+(16-27.05)2+(16-27.05)2+(19-27.05)2+(20-27.05)2+(20-27.05)2+(21-27.05)2+(22-27.05)2+(25-27.05)2+(25-27.05)2+(30-27.05)2+(33-27.05)2+(33-27.05)2+(35-27.05)2+(35-27.05)2+(35-27.05)2+(36-27.05)2+(40-27.05)2+(52-27.05)2=1960.95
17、 标准差为(1960.95/(20-1)1/2=10.16(a)=0+(30-13)/(52-13)*(1.0-0)=0.44 6/7(b)=(30-27.05)/10.16=0.29(c)30/100=0.3 (二).2(a).55%10000/70%1000055/70=78.58%(Confidence)55%10000/10000=55%(Support)(b)因为 55%33%=18.15%,所以 R2 没有什么意义 (二).3(a).X,Guangzhou(X)DVD(X)t=280/1000,d=280/900 TV(X)t=380/1000,d=380/1100 Compute
18、r(X)340/1000,d=340/1200.(b).X,Computer(X)Guangzhou(X)t=340/1200,d=340/1000 Beijing(X)t=320/1200,d=320/800 Shanghai(X)540/1200,d=540/1400.(三).1 I1:5 I2:3 I3:2 I4:2 I5:1 I6:4 有:2模式 I1I2:3 I1I6:4 I2I6:3 有 3 模式:I1I2I6:3(四).P(Outlook=sunny|yes)=1/7 P(Outlook=sunny|no)=3/5 P(temperature=cool|yes)=3/7 P(te
19、mperature=cool|no)=1/5 P(Humidity=high|yes)=2/7 P(Humidity=high|No)=4/5 P(wind=strong|yes)=2/7 P(Humidity=strong|No)=3/5 P(yes)=7/12 P(no)=5/12 P(X|YES)=1/7 3/7 2/7 2/7 7/12=0.00292 P(X|NO)=3/5 1/5 4/5 3/5 5/12=0.024 (五)NullI1:5I6:4I2:37/7第一步:3 和 4 合并得到3,4 3,4和1的距离min4,8=4 3,4和2的距离min5,9=5 3,4和5的距离min5,6=5 第二步:1和2合并得到1,2 1,2和3,4的距离min4,5=4 1,2和5的距离min10,7=7 第三步步:1,2和3,4合并得到1,2,3,4 1 2 3 4 5 1 0 2 3 0 3 4 9 0 4 8 5 2 0 5 10 7 6 5 0 1 2 3,4 5 1 0 2 3 0 3,4 4 5 0 5 10 7 5 0 1,2 3,4 5 1,2 0 3,4 4 0 5 7 5 0