资源描述
第13週講義 - 10 -
Cluster Analysis(叢聚分析)
假設現有n個個體,每個個體取p種測量值。
˙將p種測量值影響其變化的因子找出並按照共同因子分類(因子分析)
˙若n個個體有明顯確定的生物分類(判別分析或分類分析)
˙依照p個測量值訂定n個個體之”距離”(或相似矩陣),或依照某種方法(如DNA差異百分比)決定個體”距離”而自行將n個個體分類(叢聚分析)
叢聚分析有兩種類型:
(1) 相斥(disjoint) cluster
在不同叢聚中之個體彼此互斥
(2) 層疊(hierarchical) cluster
1 2 3 4 5
應用
˙心理學上,將個體分成不同人格的類別
˙生物演化的樹型(族譜)
˙商業調查中,將顧客分類
˙將城市依其城市各項指標分類
共有多少組合?
叢聚分析組合數看叢聚個數而定,並視是否允許叢聚中個體數為0而不同,可用應用機率的模型來描述。
個體:看成機率理論中的球數(看成不同,即均可分辨)
叢聚:看成機率理論中的盒子數(可分辨或不可分辨)
組合情形如下:
球數n
所有情形 不允許空盒
m盒子可分辨
m盒子不可分辨
:Stirling number of the second kind
例: (12,3),(13,2),(23,1)
(1,234),(2,134),(3,124),(4,123),(12,34),(13,24),(14,23)
(1,2,34),(1,3,24),(1,4,23),(2,3,14),(3,4,12),(2,4,13)
,,,
,,,
不同人種遺傳距離:
Hartman et.al. (1994,Am. J. Hum. Genet.) examined the diversity of races in US based on VNTR RFLP at 4 loci. Nei’s genetic distances were computed in the following table. Use UPGMA to cluster the past evolution tree of the 6 races. (Note this data is based on 4 loci only. It is subject to some bias. )
/
Chinese
Japanese
Korean
Vietnamese
Black
White
Hispanic
C
___
0.024
0.004
0.021
0.040
0.047
0.032
J
***
_____
0.012
0.024
0.020
0.037
0.019
K
***
***
_____
0.015
0.026
0.027
0.023
V
***
***
***
_____
0.034
0.043
0.018
B
***
***
***
***
_____
0.023
0.021
W
***
***
***
***
***
_____
0.010
H
***
***
***
***
***
***
_____
options nodate nonotes ps=60;
data genmtx (type=distance);
input C J K V B W H races $;
cards;
0.000 0.024 0.004 0.021 0.040 0.047 0.032 C
0.024 0.000 0.012 0.024 0.020 0.037 0.019 J
0.004 0.012 0.000 0.015 0.026 0.027 0.023 K
0.021 0.024 0.015 0.000 0.034 0.043 0.018 V
0.040 0.020 0.026 0.034 0.000 0.023 0.021 B
0.047 0.037 0.027 0.043 0.023 0.000 0.010 W
0.032 0.019 0.023 0.018 0.021 0.010 0.000 H
;
proc cluster method=average nonorm nosquare;
id races;
proc tree;
run;
The CLUSTER Procedure
Average Linkage Cluster Analysis
Cluster History
NCL --Clusters Joined--- FREQ Aver Dist Tie
6 C K 2 0.004
5 W H 2 0.01
4 CL6 J 3 0.018 T
3 CL4 V 4 0.02
2 B CL5 3 0.022
1 CL3 CL2 7 0.0305
Since the NOSQUARE option was specified, the data are assumed to be SQUARED
Euclidean distances for computing R-squared and related statistics defined in
a Euclidean coordinate system.
C
J
K
V
B
W
H
C
0
0.024
0.004
0.21
0.040
0.047
0.032
J
0
0.012
0.024
0.020
0.037
0.019
K
0
0.015
0.026
0.027
0.023
V
0
0.034
0.043
0.018
B
0
0.023
0.021
W
0
0.010
H
0
形成CL6(6個clusters)之距離矩陣
形成CL5(5個clusters)之平均距離矩陣
(CK)
J
V
B
W
H
(CK)
0
.018
0.018
0.033
0.037
0.0275
J
0
0.024
0.020
0.037
0.019
V
0
0.034
0.043
0.018
B
0
0.023
0.021
W
0
0.010
H
0
例如:
(CK)與J之距離(CJ+KJ)/2=(0.024+0.012)/2=0.018
形成CL4(4個clusters)之平均距離矩陣
(CK)
J
V
B
(WH)
(CK)
0
0.018
0.018
0.033
0.03225
J
0
0.024
0.020
0.03
V
0
0.034
0.0305
B
0
0.022
(WH)
0
個體到個體之間的距離
︰第個個體第k個測量值
˙歐氏(Euclidean)距離
此距離會隨尺度不同而改變,故必須先標準化
˙Mahalanobis distance
S︰pooled var-cov matrix
個體到叢聚之距離或叢聚之間的距離
˙Average linkage(平均距離)︰所有點到點距離之平均值
(method=average)
˙Complete linkage(最大距離)︰所有點到點距離之最大值
(method=complete)
˙Single linkage(最小距離)︰所有點到點距離之最小值
(method=single)
˙Ward’s minimum variance method
(method=ward)
Ward’s Minimum Variance Clustering Method Using the following example
1 10 5
2 20 20
3 30 10
4 30 15
5 5 10
step possible partitions E
1 (12) 3 4 5 162.5
(13) 2 4 5 212.5
(14) 2 3 5 250
(15) 2 3 4 25
(23) 1 4 5 100
(24) 1 3 5 62.5
(25) 1 3 4 162.5
(34) 1 2 5 12.5*
(35) 1 2 4 312.5
(45) 1 2 3 325.0
2 (34) (12) 5 175.0
(34) (15) 2 37.5*
(34) (25) 1 175
(134) 2 5 316.7
(234) 1 5 116.7
(345) 1 2 433.3
3 (234) (15) 141.7*
(125) (34) 245.9
(1345) 2 568.8
4 (12345) 650
_____________________________________________________________
Ward’s Minimum Variance Cluster Analysis
Root-Mean-Square Total-Sample Standard Deviation=9.013878
NCL Clusters Joined FREQ SPRSO RSQ BSS
4 3 4 2 0.019231 98077 12.50000
3 1 5 2 0.038462 94231 25.00000
2 2 CL4 3 0.160256 78205 104.1667
1 CL3 CL2 5 0.782051 00000 508.3333
Step0. Consider all possible partitions
Step1. Merge cluster 3 and 4,giving 1,2,(34) and 5 at the value of E=12.5
Step2. Merge cluster 1 and 5,giving 2,(34) and (15) at the value of E=37.5
Step3. Merge cluster 2 and (34),giving (15) and (234) at the value of E=141.7
Step 4. Merge (15) and (234),giving (12345) at the value of E=650
相斥叢聚之算法
Fastclus
˙ 用nearest centroid sorting (fixed # of cluster k)
(1) 先選擇k seeds 當作k cluster 之中心點
SAS會選擇第一個 nonmissing data 當作第一個,再逐次選擇距離超過某定數(Radius)之其他點。
(2) 其他點每一個分到最靠近seed之cluster形成temporary cluster
(3) temporary cluster再求mean形成新的seeds,再將每一個obs重新劃分到最靠近新的seed之cluster
(4) 直到沒有任何其他改變為止
對Missing value會用adjusted distance,使相近點劃分在同一cluster
/* 叢聚分析 Cluster Analysis */
data cluster1;
input name $ x1 x2 @@;
cards;
A 10 14 E 11 16 H 9 15 J 11 15 B 20 21
D 21 24 G 19 23 C 11 30 F 12 27 I 13 31
;
proc plot;
plot x2*x1=name/vpos=20;
proc cluster data=cluster1 std method=single nonorm;
var x1 x2;
id name;
proc tree;
proc cluster data=cluster1 std method=average nonorm noeigen;
var x1 x2;
id name;
proc cluster data=cluster1 std method=ward nonorm noeigen;
var x1 x2;
id name;
proc fastclus data=cluster1 out=cluster2 maxclusters=3;
var x1 x2;
proc plot;
plot x2*x1=cluster/vpos=20;
proc sort;
by cluster distance;
proc print;
by cluster;
run;
Plot of x2*x1. Symbol is value of name.
x2 |
31 + I
30 + C
29 +
28 +
27 + F
26 +
25 +
24 + D
23 + G
22 +
21 + B
20 +
19 +
18 +
17 +
16 + E
15 + H J
14 + A |---+-----+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+---
9 10 11 12 13 14 15 16 17 18 19 20 21
x1
Single Linkage Cluster Analysis
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1 1.28723793 0.57447586 0.6436 0.6436
2 0.71276207 . 0.3564 1.0000
The data have been standardized to mean 0 and variance 1
Root-Mean-Square Total-Sample Standard Deviation = 1
NCL --Clusters Joined--- FREQ Min Dist Tie
9 E J 2 0.1555
8 A CL9 3 0.2713 T
7 CL8 H 4 0.2713
6 B G 2 0.3822
5 CL6 D 3 0.471 T
4 C I 2 0.471
3 CL4 F 3 0.5167
2 CL5 CL3 6 1.6758
1 CL7 CL2 10 1.7244
Average Linkage Cluster Analysis
The data have been standardized to mean 0 and variance 1
Root-Mean-Square Total-Sample Standard Deviation = 1
NCL --Clusters Joined--- FREQ RMS Dist Tie
9 E J 2 0.1555
8 A H 2 0.2713
7 B G 2 0.3822
6 CL8 CL9 4 0.3998
5 C I 2 0.471
4 CL7 D 3 0.4944
3 CL5 F 3 0.5929
2 CL4 CL3 6 2.1001
1 CL6 CL2 10 2.398
Ward's Minimum Variance Cluster Analysis
The data have been standardized to mean 0 and variance 1
Root-Mean-Square Total-Sample Standard Deviation = 1
NCL --Clusters Joined--- FREQ SPRSQ RSQ BSS Tie
9 E J 2 0.0007 .999 0.0121
8 A H 2 0.0020 .997 0.0368
7 B G 2 0.0041 .993 0.073
6 C I 2 0.0062 .987 0.1109
5 CL8 CL9 4 0.0075 .980 0.1354
4 CL7 D 3 0.0077 .972 0.1386
3 CL6 F 3 0.0110 .961 0.1974
2 CL4 CL3 6 0.3531 .608 6.3558
1 CL5 CL2 10 0.6078 .000 10.94
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=1
Initial Seeds
Cluster x1 x2
-------------------------------------------
1 10.00000000 14.00000000
2 11.00000000 30.00000000
3 20.00000000 21.00000000
Criterion Based on Final Seeds = 1.0508
Cluster Summary
Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids
--------------------------------------------------------------------------------------
1 4 0.8898 1.2500 3 12.4032
2 3 1.6330 2.3333 3 10.4137
3 3 1.2910 1.6667 2 10.4137
Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
------------------------------------------------------------------
x1 4.49815 0.98198 0.962932 25.977778
x2 6.43256 1.48003 0.958826 23.286957
OVER-ALL 5.55028 1.25594 0.960174 24.109434
Pseudo F Statistic = 84.38
Approximate Expected Over-All R-Squared = .
Cubic Clustering Criterion = .
WARNING: The two values above are invalid for correlated variables.
Cluster Means
Cluster x1 x2
-------------------------------------------
1 10.25000000 15.00000000
2 12.00000000 29.33333333
3 20.00000000 22.66666667
Cluster Standard Deviations
Cluster x1 x2
-------------------------------------------
1 0.957427108 0.816496581
2 1.000000000 2.081665999
3 1.000000000 1.527525232
------------------------------------------Cluster=1-----------------------------------
Obs name x1 x2 DISTANCE
1 J 11 15 0.75000
2 A 10 14 1.03078
3 E 11 16 1.25000
4 H 9 15
展开阅读全文