数据挖掘复习题和答案.doc

资源描述

一、考虑表中二元分类问题得训练样本集 1. 整个训练样本集关于类属性得熵就是多少？ 2. 关于这些训练集中ａ1，a2得信息增益就是多少？ 3. 对于连续属性a3，计算所有可能得划分得信息增益. 4. 根据信息增益,a1，a2，a3哪个就是最佳划分？ 5. 根据分类错误率，a1，a2哪具最佳? 6. 根据gini指标,a1，a２哪个最佳? 答1、 P（+） = 4/９ａnd P（−）　= 5/９ −４/9　log2（4/９） −　5/9 loｇ２(5／9）＝ 0、9911、答2： (估计不考）答3: 答4： According　ｔｏ informatiｏn gain，ａ1 ｐroduｃes the best　sｐｌit、答5：Ｆor attribute a1： eｒｒｏｒ ratｅ　= 2/9、 For attribuｔe a2:　eｒroｒ　ｒatｅ　= 4/9、 Tｈerefore, accｏrｄing　to error　ratｅ, ａ1 produces tｈe best split、答6: 二、考虑如下二元分类问题得数据集　 1. 计算ａ、ｂ信息增益，决策树归纳算法会选用哪个属性 2. 计算a、b ginｉ指标，决策树归纳会用哪个属性？这个答案没问题 3. 从图４-１3可以瞧出熵与gini指标在［0，0、５]都就是单调递增,而[0、5,1]之间单调递减。有没有可能信息增益与ｇinｉ指标增益支持不同得属性？解释您得理由 Yｅｓ, ｅven　ｔhough tｈeｓe　mｅasures havｅ sｉｍｉlar range and mｏｎoｔonouｓｂeｈavior， tｈeｉr respeｃtｉｖe　ｇains， Δ， which are scaｌｅd　dｉffｅreｎces　oｆ the meaｓures， dｏ not ｎecｅssarilｙ behａｖe in the saｍｅ　way，　aｓ　illustraｔed bｙ the resulｔs ｉn　ｐarts （a)　ａnd (b)、贝叶斯分类 1. P(Ａ = 1|−） = 2／５＝　0、4, Ｐ（B = 1｜−）　= 2／5 = 0、4， P（C = 1|−) = １， P(Ａ　= 0|−）　= 3/5 =　0、6, P(B =　０|−) = 3／5 ＝ 0、６, Ｐ(Ｃ =　0|−） =　０; Ｐ(Ａ＝ 1｜＋） = 3/５ = 0、６, P（B ＝ 1|＋） =　1/5　= 0、２, P(C = 1｜＋)　＝ 2／5　=　0、４， P(Ａ =　0｜+） = 2/５ = ０、4， P（B　= 0|+） =　４/5 = 0、8, P(C = 0｜＋)　= 3/5 = 0、６、 2. 3. P（A = 0|+) = （2 +　2）／（5 + 4） = ４/９, Ｐ（A ＝ 0｜−) ＝（3+2)/(５ +　4)　=　5/9, P（B　= 1｜+) = (1 + ２）/（5 + 4) = 3/９， P（B　＝１｜−） = （2+２）/(5　+ 4） = 4/9, P(C　= 0|＋) = （3 +　2)/（5　+ ４） =　５/9, Ｐ(C = 0|−）＝ (0+2）/（5 + 4）＝ 2／９、 4. Lｅt　Ｐ(A = 0，B = 1，　C =　0） = K 5. 当得条件概率之一就是零,则估计为使用m-估计概率得方法得条件概率就是更好得,因为我们不希望整个表达式变为零。 1. P(A = １｜＋)　＝ 0、6, P（B = 1｜+) = 0、4，Ｐ(C =　1|＋）　= 0、8, Ｐ(Ａ　= 1｜−） =　0、4， P（B = 1|−）＝ 0、4，　and　P（C = １|−) = ０、2 2、Ｌｅt R : (Ａ = 1,Ｂ＝ 1, Ｃ　＝　1） bｅ　ｔhe　ｔest rｅcoｒd、　To deterｍinｅｉts cｌａｓs，　we nｅed to ｐuｔe P(+|R) aｎd P（−|Ｒ）、Ｕsiｎg　Bayes　ｔheｏrｅｍ， P（+｜Ｒ) =　P（R｜+)P（+）／P(Ｒ）ａｎd Ｐ(−|R） = P（Ｒ｜−)P(−）/P(Ｒ)、 Sincｅ P（+) ＝　Ｐ（−)　= 0、5 and P(R) is cｏnstant， R cａn be classifｉed by ｐarｉｎｇ P（+｜R） and P（−｜Ｒ）、Ｆor thiｓ　question, Ｐ(R｜+)　= P（Ａ　=　1|＋） × P(B ＝１｜＋）　× P(Ｃ = 1｜+) = 0、192 P（R|−) =　P（A =　1|−） × P（B = 1|−） ×　P（C =　1｜−） = 0、03２Ｓince P（R｜+) iｓ laｒgｅr， tｈe record is ａｓｓｉgｎed to (+)　class、３、 P（A　=　1） = 0、５, P(B ＝　1）　= 0、4 and P（A = 1，B = １） =　P（A） × Ｐ（B） = ０、2、 Theｒefore，Ａ and Ｂａｒe　iｎｄｅpendｅnｔ、４、 P(A　＝ 1）　= 0、５, Ｐ（B = 0） = ０、6, anｄ P（A ＝ 1，B = ０） = Ｐ（A =1)× Ｐ(B ＝　０） =　０、３、 A and　Ｂ are stｉｌｌ independent、 5、 pare P(A　=　１，B =　1｜+）＝ 0、2 againsｔＰ(A = 1|+）＝ 0、６ anｄＰ（Ｂ　=　1|Class = ＋） = 0、4、Ｓince tｈｅ　prｏdｕcｔ betｗeen P(Ａ = 1|+） and　P(A　= １|−) ａre nｏt　thｅ same aｓ P（Ａ　＝　１,B　=　１｜+)，　Ａ and Ｂ arｅ noｔｃonｄitionalｌｙ　indｅｐendeｎt givｅｎ the cｌasｓ、三、使用下表中得相似度矩阵进行单链与全链层次聚类。绘制树状况显示结果,树状图应该清楚地显示合并得次序。 Tｈｅｒe are no　apparｅnt ｒelationships　bｅtweｅn ｓ1，　ｓ２， c1， anｄ c2、 ﻬ A2: Ｐｅｒceｎtage　of freｑueｎt　itemsｅｔs = １6/32 = 50、0％ (iｎclｕdｉnｇｔhe ｎull ｓeｔ)、 A４：ﻩFalse　aｌaｒm ｒate　is the ratｉo of I　to ｔhe　totａl　numｂer oｆ itｅmsｅｔｓ、 Sｉnce the　ｃｏuｎｔ of I = ５，ｔherefore　ｔｈe falｓe alarｍ　rate is 5/３2　= 15、6％、

展开阅读全文