Cluster-Analysis(丛聚分析).doc

资源描述

第13週講義 - 10 - Cluster Analysis(叢聚分析) 假設現有n個個體，每個個體取p種測量值。 ˙將p種測量值影響其變化的因子找出並按照共同因子分類(因子分析) ˙若n個個體有明顯確定的生物分類(判別分析或分類分析) ˙依照p個測量值訂定n個個體之”距離”(或相似矩陣)，或依照某種方法(如DNA差異百分比)決定個體”距離”而自行將n個個體分類(叢聚分析) 叢聚分析有兩種類型： (1) 相斥(disjoint) cluster 在不同叢聚中之個體彼此互斥 (2) 層疊(hierarchical) cluster 1 2 3 4 5 應用 ˙心理學上，將個體分成不同人格的類別 ˙生物演化的樹型(族譜) ˙商業調查中，將顧客分類 ˙將城市依其城市各項指標分類共有多少組合? 叢聚分析組合數看叢聚個數而定，並視是否允許叢聚中個體數為0而不同，可用應用機率的模型來描述。個體：看成機率理論中的球數(看成不同，即均可分辨) 叢聚：看成機率理論中的盒子數(可分辨或不可分辨) 組合情形如下：球數n 所有情形不允許空盒 m盒子可分辨 m盒子不可分辨：Stirling number of the second kind 例： (12,3),(13,2),(23,1) (1,234),(2,134),(3,124),(4,123),(12,34),(13,24),(14,23) (1,2,34),(1,3,24),(1,4,23),(2,3,14),(3,4,12),(2,4,13) ,,, ,,, 不同人種遺傳距離: Hartman et.al. (1994,Am. J. Hum. Genet.) examined the diversity of races in US based on VNTR RFLP at 4 loci. Nei’s genetic distances were computed in the following table. Use UPGMA to cluster the past evolution tree of the 6 races. (Note this data is based on 4 loci only. It is subject to some bias. ) / Chinese Japanese Korean Vietnamese Black White Hispanic C ___ 0.024 0.004 0.021 0.040 0.047 0.032 J *** _____ 0.012 0.024 0.020 0.037 0.019 K *** *** _____ 0.015 0.026 0.027 0.023 V *** *** *** _____ 0.034 0.043 0.018 B *** *** *** *** _____ 0.023 0.021 W *** *** *** *** *** _____ 0.010 H *** *** *** *** *** *** _____ options nodate nonotes ps=60; data genmtx (type=distance); input C J K V B W H races $; cards; 0.000 0.024 0.004 0.021 0.040 0.047 0.032 C 0.024 0.000 0.012 0.024 0.020 0.037 0.019 J 0.004 0.012 0.000 0.015 0.026 0.027 0.023 K 0.021 0.024 0.015 0.000 0.034 0.043 0.018 V 0.040 0.020 0.026 0.034 0.000 0.023 0.021 B 0.047 0.037 0.027 0.043 0.023 0.000 0.010 W 0.032 0.019 0.023 0.018 0.021 0.010 0.000 H ; proc cluster method=average nonorm nosquare; id races; proc tree; run; The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History NCL --Clusters Joined--- FREQ Aver Dist Tie 6 C K 2 0.004 5 W H 2 0.01 4 CL6 J 3 0.018 T 3 CL4 V 4 0.02 2 B CL5 3 0.022 1 CL3 CL2 7 0.0305 Since the NOSQUARE option was specified, the data are assumed to be SQUARED Euclidean distances for computing R-squared and related statistics defined in a Euclidean coordinate system. C J K V B W H C 0 0.024 0.004 0.21 0.040 0.047 0.032 J 0 0.012 0.024 0.020 0.037 0.019 K 0 0.015 0.026 0.027 0.023 V 0 0.034 0.043 0.018 B 0 0.023 0.021 W 0 0.010 H 0 形成CL6(6個clusters)之距離矩陣形成CL5(5個clusters)之平均距離矩陣 (CK) J V B W H (CK) 0 .018 0.018 0.033 0.037 0.0275 J 0 0.024 0.020 0.037 0.019 V 0 0.034 0.043 0.018 B 0 0.023 0.021 W 0 0.010 H 0 例如： (CK)與J之距離(CJ+KJ)/2=(0.024+0.012)/2=0.018 形成CL4(4個clusters)之平均距離矩陣 (CK) J V B (WH) (CK) 0 0.018 0.018 0.033 0.03225 J 0 0.024 0.020 0.03 V 0 0.034 0.0305 B 0 0.022 (WH) 0 個體到個體之間的距離︰第個個體第k個測量值 ˙歐氏(Euclidean)距離此距離會隨尺度不同而改變，故必須先標準化 ˙Mahalanobis distance S︰pooled var-cov matrix 個體到叢聚之距離或叢聚之間的距離 ˙Average linkage(平均距離)︰所有點到點距離之平均值 (method=average) ˙Complete linkage(最大距離)︰所有點到點距離之最大值 (method=complete) ˙Single linkage(最小距離)︰所有點到點距離之最小值 (method=single) ˙Ward’s minimum variance method (method=ward) Ward’s Minimum Variance Clustering Method Using the following example 1 10 5 2 20 20 3 30 10 4 30 15 5 5 10 step possible partitions E 1 (12) 3 4 5 162.5 (13) 2 4 5 212.5 (14) 2 3 5 250 (15) 2 3 4 25 (23) 1 4 5 100 (24) 1 3 5 62.5 (25) 1 3 4 162.5 (34) 1 2 5 12.5* (35) 1 2 4 312.5 (45) 1 2 3 325.0 2 (34) (12) 5 175.0 (34) (15) 2 37.5* (34) (25) 1 175 (134) 2 5 316.7 (234) 1 5 116.7 (345) 1 2 433.3 3 (234) (15) 141.7* (125) (34) 245.9 (1345) 2 568.8 4 (12345) 650 _____________________________________________________________ Ward’s Minimum Variance Cluster Analysis Root-Mean-Square Total-Sample Standard Deviation=9.013878 NCL Clusters Joined FREQ SPRSO RSQ BSS 4 3 4 2 0.019231 98077 12.50000 3 1 5 2 0.038462 94231 25.00000 2 2 CL4 3 0.160256 78205 104.1667 1 CL3 CL2 5 0.782051 00000 508.3333 Step0. Consider all possible partitions Step1. Merge cluster 3 and 4，giving 1,2,(34) and 5 at the value of E=12.5 Step2. Merge cluster 1 and 5，giving 2,(34) and (15) at the value of E=37.5 Step3. Merge cluster 2 and (34)，giving (15) and (234) at the value of E=141.7 Step 4. Merge (15) and (234)，giving (12345) at the value of E=650 相斥叢聚之算法 Fastclus ˙ 用nearest centroid sorting (fixed # of cluster k) (1) 先選擇k seeds 當作k cluster 之中心點 SAS會選擇第一個 nonmissing data 當作第一個，再逐次選擇距離超過某定數(Radius)之其他點。 (2) 其他點每一個分到最靠近seed之cluster形成temporary cluster (3) temporary cluster再求mean形成新的seeds，再將每一個obs重新劃分到最靠近新的seed之cluster (4) 直到沒有任何其他改變為止對Missing value會用adjusted distance，使相近點劃分在同一cluster /* 叢聚分析 Cluster Analysis */ data cluster1; input name $ x1 x2 @@; cards; A 10 14 E 11 16 H 9 15 J 11 15 B 20 21 D 21 24 G 19 23 C 11 30 F 12 27 I 13 31 ; proc plot; plot x2*x1=name/vpos=20; proc cluster data=cluster1 std method=single nonorm; var x1 x2; id name; proc tree; proc cluster data=cluster1 std method=average nonorm noeigen; var x1 x2; id name; proc cluster data=cluster1 std method=ward nonorm noeigen; var x1 x2; id name; proc fastclus data=cluster1 out=cluster2 maxclusters=3; var x1 x2; proc plot; plot x2*x1=cluster/vpos=20; proc sort; by cluster distance; proc print; by cluster; run; Plot of x2*x1. Symbol is value of name. x2 | 31 + I 30 + C 29 + 28 + 27 + F 26 + 25 + 24 + D 23 + G 22 + 21 + B 20 + 19 + 18 + 17 + 16 + E 15 + H J 14 + A |---+-----+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+--- 9 10 11 12 13 14 15 16 17 18 19 20 21 x1 Single Linkage Cluster Analysis Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 1.28723793 0.57447586 0.6436 0.6436 2 0.71276207 . 0.3564 1.0000 The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 NCL --Clusters Joined--- FREQ Min Dist Tie 9 E J 2 0.1555 8 A CL9 3 0.2713 T 7 CL8 H 4 0.2713 6 B G 2 0.3822 5 CL6 D 3 0.471 T 4 C I 2 0.471 3 CL4 F 3 0.5167 2 CL5 CL3 6 1.6758 1 CL7 CL2 10 1.7244 Average Linkage Cluster Analysis The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 NCL --Clusters Joined--- FREQ RMS Dist Tie 9 E J 2 0.1555 8 A H 2 0.2713 7 B G 2 0.3822 6 CL8 CL9 4 0.3998 5 C I 2 0.471 4 CL7 D 3 0.4944 3 CL5 F 3 0.5929 2 CL4 CL3 6 2.1001 1 CL6 CL2 10 2.398 Ward's Minimum Variance Cluster Analysis The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 NCL --Clusters Joined--- FREQ SPRSQ RSQ BSS Tie 9 E J 2 0.0007 .999 0.0121 8 A H 2 0.0020 .997 0.0368 7 B G 2 0.0041 .993 0.073 6 C I 2 0.0062 .987 0.1109 5 CL8 CL9 4 0.0075 .980 0.1354 4 CL7 D 3 0.0077 .972 0.1386 3 CL6 F 3 0.0110 .961 0.1974 2 CL4 CL3 6 0.3531 .608 6.3558 1 CL5 CL2 10 0.6078 .000 10.94 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1 Initial Seeds Cluster x1 x2 ------------------------------------------- 1 10.00000000 14.00000000 2 11.00000000 30.00000000 3 20.00000000 21.00000000 Criterion Based on Final Seeds = 1.0508 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Distance Between Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids -------------------------------------------------------------------------------------- 1 4 0.8898 1.2500 3 12.4032 2 3 1.6330 2.3333 3 10.4137 3 3 1.2910 1.6667 2 10.4137 Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) ------------------------------------------------------------------ x1 4.49815 0.98198 0.962932 25.977778 x2 6.43256 1.48003 0.958826 23.286957 OVER-ALL 5.55028 1.25594 0.960174 24.109434 Pseudo F Statistic = 84.38 Approximate Expected Over-All R-Squared = . Cubic Clustering Criterion = . WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster x1 x2 ------------------------------------------- 1 10.25000000 15.00000000 2 12.00000000 29.33333333 3 20.00000000 22.66666667 Cluster Standard Deviations Cluster x1 x2 ------------------------------------------- 1 0.957427108 0.816496581 2 1.000000000 2.081665999 3 1.000000000 1.527525232 ------------------------------------------Cluster=1----------------------------------- Obs name x1 x2 DISTANCE 1 J 11 15 0.75000 2 A 10 14 1.03078 3 E 11 16 1.25000 4 H 9 15

展开阅读全文