生存分析-随机森林实验与代码.doc

资源描述

随机森林模型在生存分析中得应用【摘要】目得：本文探讨随机森林方法用于高维度、强相关、小样本得生存资料分析时，可以起到变量筛选得作用。方法：以乳腺癌数据集构建乳腺癌转移风险评估模型为实例进行实证分析，使用随机森林模型进行变量选择，然后拟合cox回归模型。结果：随机森林模型通过对变量得选择，有效得解决数据维度高且强相关得情况，得到了较高得AUC值。一、数据说明该乳腺癌数据集来自于NCBI，有77个观测值以及22286个基因变量。通过筛选选取454个基因变量。将数据随机分为训练集合测试集，其中2/3为训练集，1/3为测试集。绘制K-M曲线图：二、随机森林模型随机森林由许多得决策树组成，因为这些决策树得形成采用了随机得方法，因此也叫做随机决策树。随机森林中得树之间就是没有关联得。当测试数据进入随机森林时，其实就就是让每一颗决策树进行分类，最后取所有决策树中分类结果最多得那类为最终得结果。因此随机森林就是一个包含多个决策树得分类器，并且其输出得类别就是由个别树输出得类别得众数而定。使用 randomForestSRC包得到得随机森林模型具有以下性质： Number of deaths: 27 Number of trees: 800 Minimum terminal node size: 3 Average no、 of terminal nodes: 14、4275 No、 of variables tried at each split: 3 Total no、 of variables: 452 Analysis: RSF Family: surv Splitting rule: logrank Error rate: 19、87% 发现直接使用随机森林得到得模型，预测误差很大，达到了19、8%,进一步考虑使用随机森林模型进行变量选择，结果如下： > our、rf$rfsrc、refit、obj Sample size: 52 Number of deaths: 19 Number of trees: 500 Minimum terminal node size: 2 Average no、 of terminal nodes: 11、554 No、 of variables tried at each split: 3 Total no、 of variables: 9 Analysis: RSF Family: surv Splitting rule: logrank *random* Number of random split points: 10 Error rate: 11、4% > our、rf$topvars [1] "213821_s_at" "219778_at" "204690_at" "220788_s_at" "202202_s_at" [6] "211603_s_at" "213055_at" "219336_s_at" "37892_at" 一共选取了9个变量，同时误差只有11、4% 接下来，使用这些变量做cox回归，剔除模型中不显著（>0、01）得变量，最终参与模型建立得变量共有4个。模型结果如下： exp(coef) exp(-coef) lower 、95 upper 、95 `218150_at` 1、6541 0、6046 0、11086 24、6800 `200914_x_at` 0、9915 1、0086 0、34094 2、8833 `220788_s_at` 0、2649 3、7750 0、05944 1、1805 `201398_s_at` 1、7457 0、5729 0、33109 9、2038 `201719_s_at` 2、4708 0、4047 0、93808 6、5081 `202945_at` 0、4118 2、4284 0、03990 4、2499 `203261_at` 3、1502 0、3174 0、33641 29、4983 `203757_s_at` 0、7861 1、2720 0、61656 1、0024 `205068_s_at` 0、1073 9、3180 0、02223 0、5181 最后选取六个变量拟合生存模型，绘制生存曲线如下：下面绘制ROC曲线，分别在训练集与测试集上绘制ROC曲线，结果如下：训练集：测试集：由于测试集上得样本过少，所以得到得AUC值波动大，考虑使用bootstrap多次计算训练集上得AUC值并求平均来测试模型得效果： AUC at 1 year：0、8039456 AUC at 3 year：0、6956907 AUC at 5 year：0、7024846 由此可以瞧到，随机森林通过删除贡献较低得变量，完成变量选择得工作，在测试集上具有较高得AUC值，但就是比lasso-cox模型得到得AUC略低。附录： load("~/R/brea、rda") library(survival) set、seed(10) i<-sample(1:77,52) train<-dat[i,] test<-dat[-i,] library(randomForestSRC) disease、rf<-rfsrc(Surv(time,status)~、,data = train, ntree = 800,mtry = 3, nodesize = 3,splitrule = "logrank") disease、rf our、rf<- var、select(object=disease、rf, vdv, method = "vh、vimp", nrep = 50) our、rf$rfsrc、refit、obj our、rf$topvars index<-numeric(var、rf$modelsize) for(i in 1:var、rf$modelsize){ index[i]<-which(names(dat)==var、rf$topvars[i]) } data<-dat[,c(1,2,index)] i<-sample(1:77,52) train<-data[i,] test<-data[-i,] mod、brea<-coxph(Surv(time,status)~、,data=train) train_data<-train[,c(1,2,which(summary(mod、brea)$coefficients[,5]<=0、1)+2)] tset_data<-test[,c(1,2,which(summary(mod、brea)$coefficients[,5]<=0、1)+2)] mod、brea1<-coxph(Surv(time,status)~、,data=train_data) summary(mod、brea1) names(coef(mod、brea1)) plot(survfit(mod、brea1),xlab="Time",ylab = "Proportion",main="Cox Model",conf、int=TRUE,col=c("black","red","red"),ylim=c(0、6,1)) index0<-numeric(length(coef(mod、brea1))) coefficients<-coef(mod、brea1) name<-gsub("\`","",names(coefficients)) for(j in 1:length(index0)){ index0[j]<-which(names(dat)==name[j]) } library(survivalROC) riskscore<-as、matrix(dat[i,index0])%*% as、matrix(coefficients) y1<-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict、time=1,span = 0、25*(nrow(train))^(-0、20)) y3<-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict、time=3,span = 0、25*(nrow(train))^(-0、20)) y5<-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict、time=5,span = 0、25*(nrow(train))^(-0、20)) a<-matrix(data=c("y1","y3","y5",y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);a plot(y1$FP,y1$TP,type="l",xlab="False Positive Rate",ylab = "True Positive Rate",main="Time-dependent ROC curve",col="green") lines(y3$FP,y3$TP,col="red",lty=2) lines(y5$FP,y5$TP,col="blue",lty=3) legend("bottomright",bty="n",legend = c("AUC at 1 year:0、9271","AUC at 3 years:0、8621","AUC at 5 years:0、8263"),col=c("green","red","blue"),lty=c(1,2,3),cex=0、9) abline(0,1) riskscore<-as、matrix(dat[-i,index0])%*% as、matrix(coefficients) y1<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=1,span = 0、25*(nrow(train))^(-0、20)) y3<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=3,span = 0、25*(nrow(train))^(-0、20)) y5<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=5,span = 0、25*(nrow(train))^(-0、20)) a<-matrix(data=c("y1","y3","y5",y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);a plot(y1$FP,y1$TP,type="l",xlab="False Positive Rate",ylab = "True Positive Rate",main="Time-dependent ROC curve",col="green") lines(y3$FP,y3$TP,col="red",lty=2) lines(y5$FP,y5$TP,col="blue",lty=3) legend("bottomright",bty="n",legend = c("AUC at 1 year:0、8761","AUC at 3 years:0、7611","AUC at 5 years:0、7611"),col=c("green","red","blue"),lty=c(1,2,3),cex=0、9) abline(0,1) a<-matrix(0,30,3) for (c in 1:30){ i<-sample(1:77,52) train<-data[i,] test<-data[-i,] mod、brea<-coxph(Surv(time,status)~、,data=train) train_data<-train[,c(1,2,which(summary(mod、brea)$coefficients[,5]<=0、1)+2)] tset_data<-test[,c(1,2,which(summary(mod、brea)$coefficients[,5]<=0、1)+2)] mod、brea1<-coxph(Surv(time,status)~、,data=train_data) names(coef(mod、brea1)) index0<-numeric(length(coef(mod、brea1))) coefficients<-coef(mod、brea1) name<-gsub("\`","",names(coefficients)) for(j in 1:length(index0)){ index0[j]<-which(names(dat)==name[j]) } riskscore<-as、matrix(dat[-i,index0])%*% as、matrix(coefficients) y1<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=1,span = 0、25*(nrow(train))^(-0、20)) y3<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=3,span = 0、25*(nrow(train))^(-0、20)) y5<-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict、time=5,span = 0、25*(nrow(train))^(-0、20)) a[c,]<-c(y1$AUC,y3$AUC,y5$AUC) }

展开阅读全文