收藏 分销(赏)

统计资料的整理与描述—统计学课件.pdf

上传人:曲**** 文档编号:229882 上传时间:2023-03-20 格式:PDF 页数:97 大小:5.28MB
下载 相关 举报
统计资料的整理与描述—统计学课件.pdf_第1页
第1页 / 共97页
统计资料的整理与描述—统计学课件.pdf_第2页
第2页 / 共97页
统计资料的整理与描述—统计学课件.pdf_第3页
第3页 / 共97页
统计资料的整理与描述—统计学课件.pdf_第4页
第4页 / 共97页
统计资料的整理与描述—统计学课件.pdf_第5页
第5页 / 共97页
点击查看更多>>
资源描述

1、第二章统彳资料的整理与描述Department of Epidemiology&Bioitatiitici,School of Public Health,Nanjing Medical University1The Main Contents of the Statistical Analysis2主要内容个体变异(Individual Variation):频数分布(FrequencyDistribution):定量资料的统计指标(Statistics of Data):总结(Summary)3个体变异:个体变异(i nd i vidua I var i at i on)是同质观察对 象间

2、表现出的差异。:变异是生物体在一种或多种、已知或未知的不可 控因素作用下所产生的综合反映。就每个观察单位而言,其观察指标的变异是随机的(random)。:就总体而言,个体变异是有规律的。4个体变异是统计学应用的前提52.1个体变异:生物体的变异是普遍存在的,是客观事实,无法 准确预测。这种变异是有规律的,是可以认识的。SNPs:common DNA sequence variations among individualsCTGATCA CTCTATG CTGATCA CTCTATC CTGATCA CTCTATC.ATCCTGT CCTACGTGTACAATAGTA,ATCCTGT CCTA

3、CGTGTACAATAGTA.ATCCTGT CCTACGTGTACAATAGTA6主要内容:个体变异(Individual Variation):频数分布(FrequencyDistribution)频数分布表的编制:数据分布的类型。频数分布表的用途:定量资料的统计指标(Statistics ofData)一个原始资料某市1997年12岁男童120人的身高(cm)资料如下。142.3156.6142.7 145.7 138.2141,6 142,5134.4 148.8 137,9151.3 140.8 149.8 145.2150.3 133.1 142.7141.9 140.7 141.

4、2143.5 139.2 144.7 138J 140,2 137.4 142.9 134.9 143.6140.9 141.4 160.9134.7 138.5 138.9141.2 148.9 154.0135,5 144,4 143,4140.2 145.4 142.4143.9 151.1141.5 148.8139.3 141.9145.1 145.8142.3 125,9154.2 137.9137.7 138.5147,7 152,3137,4 143,6148.9 146.7144.0 145.4140.1 150,6147,8 140,5147.9 150,8132.7 15

5、2.9139.9 149.7139.6 143,5146.6 132.1150,0 143,3139.2 139.6130.5 134.5 148.8141.8 146.8 135.1146.2 143.3 156.3139.5 146.4 143.8138.9 134.7 147.3144,5 137,1 147.1147.9 141.8 141.4147,5 136.9 148.1142.9 129.4 142.5145.9 146,7 144.0146.5 149.0 142.1142.4 138.7 139.982.2频数分布:现状:原始数据(raw data)往往是庞大的、混乱的;。

6、原因:由于个体变异的存在,各个体上的观察结果 不是恒定不变的;特点:表面上杂乱无章,但分布(distribution)有一 定规律!:解决:频数分布表,频数分布图。92.2频数分布。(一)频数表的编制(frequency distribution drawings)求极差(R)。R=160.9-125.9=35划分组段。定组数、组段、组距 统计频数。10定量资料的频数分布表组段频数频率12410.008312820.0167132100.0833136220.1834140370.3083144260.2167148150.125015240.033315620.016716010.0083合

7、计1201.000011定量资料的频数分布x Freq.124128132136140144148152156160Total121022372615421120*12#$#藤邸明憾孩喜事定量资料的频数分布图14定性及等级资料的频数分布定性资料的整理:根据才旨标的自然属性归类,计教频教。A等级资料的整理:根据才旨标的不同等级归类,计数频教。血型频数频率(%)020540.43A11222.09B15029.59AB407.89合计507100.00152.2频数分布总结定量资料的频数分布:人为地划分为若干个相 连接的区间,计数频数。频数分布用于表达指标的分布规律。分布规律:变异规律。16主要内

8、容个体变异(Individual Variation):频数分布(FrequencyDistribution)频数分布表的编制。数据分布的类型:频数分布表的用途:定量资料的统计指标(Statistics ofData)2.2频数分布(二)数据分布的类型 types of frequency distributionA分布的对称A峰的多少(对称分布symmetric distribution偏 态分布skewness 单耳瓠监u翔阳e Peak Distribution双多峰分布Bimodal or Multpeak Distribution 监2.2频数分布身高(cm)2.2频数分布偏态(sk

9、ewness):。Skewness means the lack of symmetry in a probability distribution.(The Cambridge Dictionary of Statistics in the Medical Sciences.)。An asymmetric distribution is called skew.(Armitage:Statistical Methods in Medical Research.)212.2频数分布:非对称分布称为skewness;俗称偏态分 而,有人称偏*分布。:“偏”是偏离的意思,表示个别观察值 偏离均数较

10、远,而不是“集中住置偏”;222.2频数分布“分布不对称者称为偏态分布。偏态;分布又分为正偏分布和负偏分布。所谓正偏 分布 是相 分布的长尾在峰的右 侧,又称右偏分布;*所谓负偏 分布指分布的长在蜂的左 侧,又称左偏分布。”23(a)239人热艇的海露分布24图 某城市892名老年人生存质量自评分的频数分布25(b)102名黑色素瘤患者的生存时间频数分布图 某地19901992年男性死亡年龄分布27主要内容个体变异(Individual Variation):频数分布(FrequencyDistribution)频数分布表的编制。数据分布的类型:频数分布表的用途:定量资料的统计指标(Stati

11、stics ofData)282.2频数分布:(三)频数分布表的用途 观察有无可疑值;考察分布的特征;考察分布的类型;便于进一步计算;29The Importance of Graphs!79A sua009 70 5 10 15tc10 tg15 20a七s ua0 5 10 15 20 25hdl_c0 2 4 6 8 10ldl_cA-s ua图数值变量频率分布图The Importance of Graphs!1973年,统计学家FJ Anscombe构造出了四组奇特的数据。Anscombes QuartetIIIIIIIVXyXyXyXy10.08.0410.09.1410.07.4

12、68.06.588.06.958.08.148.06.778.05.7613.07.5813.08.7413.012.748.07.719.08.819.08.779.07.118.08.8411.08.3311.09.2611.07.818.08.4714.09.9614.08.1014.08.848.07.046.07.246.06.136.06.088.05.254.04.264.03.104.05.3919.012.5012.010.8412.09.1312.08.158.05.567.04.827.07.267.06.428.07.915.05.685.04.745.05.738.0

13、6.8931The Importance of Graphs!PropertyValueMean of x in each case9.0Variance of x in each case11.0Mean of y in each case7.5Variance of y in each case4.12Correlation between x and y in each case0.816Linear regression line in each casey=3+0.5xA奇特之处:单从这些统计数字上看来,四组数据 所反映出的实际情况非常相近;32The Importance of G

14、raphs!A而事实上,这四组数据有着天壤之别!33The Importance of Graphs!Causes of Mortality in the Army in the East April,1 854 to Ma rch 1 855I Non-BattleI Battl oJan 1 B 55F ro m:F.Nightingale,u Notee on M attars Affecting th e Health.Efficiency and Hospital Administration o t th e British A rm y*r 1 85834The Importan

15、ce of Graphs!历史上最好的统计图图拿破仑181218X X年与俄国战争行军路线图(C.J.Minard,1869)35主要内容个体变异(Individual Variation):频数分布(FrequencyDistribution):定量资料的统计指标(Statistics ofData)。集中趋势的描述:离散程度的描述*正确应用2.3定量资料的描述:图形描述频数分布图 趋势图 指标描述集中位置:算术均数、几何均数、中位数、百分位数离散程度:极差、标准差、方差、四分位数间距37(一)集中趋势的描述(average)均数(arithmetic mean,mean)几何均数(geom

16、etric mean)中位数(median)百分位数(percentile)38(一)集中趋势的描述(average):均数(arithmetic mean,mean)x=Xi+X2+.+x nn 小.y _ z=l=inn39(一)集中趋势的描述(average):力口权均数(weighted mean)Xw=叫吊+吗占+匕匕均数是加权均数的一个特例=:看+:莅+1工40(一)集中趋势的描述(average)均数的应用:1.景适于对称分布资料特别是正态分布资料;2.一组数据的均衡点所在;3.易更极端值的影响。对于偏态资料,均教不能 较好地反映其集中趋势。41(一)集中趋势的描述(averag

17、e)张村有个张千万,隔壁九个穷光蛋平均起来算一算,人人都是张百万这说明了什么?42(一)集中趋势的描述(average)几何均数(geometric mean)直接法:G=g.X23X.In Xi+In X)+,+In X V _ 1 2 nAlnX 一nG=exp(XlnX)加权法:G=lg。SflgxSf43(一)集中趋势的描述(average)1:10,1:20,1:40,1:80,1:160G=*10 x 20 x 40 x 80 x 160=40 lnl0+ln20+ln40+ln80+lnl60-八Xnx=、=3.6889G=3 6889=4044(一)集中趋势的描述(averag

18、e)几何均数的应用:L等比资料,如抗体平均靖度2.对数正态分布券料45(一)集中趋势的描述(average)使用几何均数时的注意点:1)观察值不能有0。2)观察值不能同时有正值和负值。若全为负 值,在计算时先把负号去掉,得出结果再加 上负号。46(一)集中趋势的描述(average)中位数(median)将一组数据按从小到大的顺序排列,位置居中的数即是中位数。反映一组观察值在位次上温50%50%Sn/2+X/2+l)/2当n为奇数 当n为偶数MM=16 M=4.8 10例正常人的发汞值:;1.1,1.8 3.5 4.2 4.8 5.6 5.9 7.1 10.5 16M=(4.8+5.6)/2=

19、5.248(一)集中趋势的描述(average)中位数应用:1.不易受极端值的影响;2.可用于任何分布的资料。常用于:大 样本偏态分布资料;有不确定值资料;资料分布不明等;3.中位数和均数在对称分布上理论上是相49同的。(一)集中趋势的描述(average)百分位数(percentile)cc cm no cpmcncmmcc cc c cX%PX(100-X)%:.50%分位数就是中位数50(一)集中趋势的描述(average)集中趋势的描述指标小结均数 几何均数 中位数 百分位数适用资料 单峰对称分布等比资料、对数正态分布各种分布、偏各种分布、偏 态分布、不确态分布、不确 定值 定值计算特

20、点 用到全部数据 用到全部数据 中间数据部分数据极端值的 影响敏感敏感、不能同 时有正负数不敏感不敏感51只用平均数描述资料的弊病It has been said that a fellow with one leg frozen in ice and the other leg in boiling water is comfortable.ON AVERAGE!52主要内容个体变异(Individual Variation):频数分布(FrequencyDistribution):定量资料的统计指标(Statistics ofData)。集中趋势的描述:离散程度的描述*正确应用(二)离散程

21、度的描述例如,设有三组同年龄、同性别儿童体重(kg)数据如下:甲组 26 28 30 32乙组 24 27 30 33丙组 26 29 30 3134363454(二)离散程度的描述甲组火 火 火 火 火乙组丙组24 26 28 30 32 34 3655(二)离散程度的描述:极差 四分位数间距 方差 标准差:变异系数variation)(range)(inter-quartile range)(variance)(standard deviation)(coefficient of56(二)离散程度的描述极差(range)差布札 松分分距 全范,用字母R表示,描述数据 极差大,说明数据分布较

22、57特点:方法简单明了;:不灵敏,除了最大最小值外,不能反 映组内其他数据的变异;:不稳定,样本较大时抽到较大值与较 小值的可能性也较大,因而样本极差 也较大,故样本含量相差较大时,不 食用极差来比较分布的离散度。58如上述三组数据中:甲组数据的极差R=34-26=8乙组数据的极差R=36-24=12丙组数据的极差R=34-26=8 甲组、丙组数据分布较乙组集中。甲组与丙组的离散程度相同?59(二)离散程度的描述四分位数间距(inter-quartile range)(1)I见I分位数(quartile,Q)下四分位数即第25mx位数,常用表示;上四分位数即第75百分位数,常用QU表示。(2)

23、四分位数间距指上、下四分位数的间距,既QL与QU间的差距,它是 从小到大排列后中间一半数据所在的范围。25%25%25%25%QLQU60(三)方差与标准差Deviations about the Mean Data:8,3,5,12,4,10(n=6)f WO2 3 4 5 6 7 8 9 10 11 12x(三)方差与标准差1 Taking the average difference won ft work:n无)=oZ=1n average difference-02.why not take modulus of difference?nZk-五Z=1-Dose not have n

24、ice mathematical properties.62(三)方差与标准差3.Alternatively we could square the differencesn(xz-x)21=1+This has nice mathematical properties;-For samples of data this systematically underestimates63(三)方差与标准差The(sample)Variance5.The Variance far a samplecf obsversations xlyx2,.yxnis defined to bei nn-i i=

25、i-A small problem with the sample variance is that its units are squares of the observations units.The(sample)Standard Deviation6.To get round this we take square roots to obtain the sample Standard Deviation.64(三)方差与标准差 标准差的计算 直接法总体标准差:样本标准差:s=n-165(三)方差与标准差加权法5V f 乂2 比,)2工z力斗TXi是各组段的组中值,fi是相应的频数66

26、(三)方差与标准差甲组:26 28 30 32 34乙组:24 27 30 33 36:丙组:26 29 30 31 34 极差甲组:8:乙组:12丙组:8方差 标准差10.0 3.1622.5 4.748.5 2.9267(四)变异系数/变异系数(coefficient of variation,CV)CV=X100%X68例题某地20岁男子100人,身高均数为166.06cm,标 准差为4.98cm;体重均数为537 2kg,标准差为 4.96kg,试比较身高和体重的变异何者为大。由于度量单位不同,故不能直接比较两者的标准差,而应 比较变异系数:七一 4 98:身局 CV=.xl00%=2

27、.98%166.06体重 CV=4 96 x 100%=9.23%53.72由此可见,该地20岁男子体重的变异度大于身高的变异度。69(二)离散程度的描述离散程度的描述指标小结适用资料计算特点极端值的 影响极差任何分布用到两端数据敏感四分位数间距常用于偏态分 布用到中间数据不敏感方差、标准差正态分布全部数据敏感变异系数度量衡单位不 同、均数相差 悬殊全部数据敏感70主要内容个体变异(Individual Variation):频数分布(FrequencyDistribution):定量资料的统计指标(Statistics of Data):总结(Summary)71总结 不同质的奔料应考虑分别

28、计算平均数。各个指标都有其逡用范围;中住教和百分佳教在样本含量较少时不稳定,越靠 两端越不稔定;中住教在抗极端值的影响方面,比均数具有较好的 稔定性,但不如均教精确。:因此,当奔料适合计算均数或几何均教时,不宜用 中佳教表示其平均水平。72总结标准差的基本内今是离均差,它显示一组变量 值与其均数的间距,故标准差直接地、总结地、平均地描述了变量值的离散程度。在同质的前提下,标准差大表示变量值的离散 程度大,即变量值的分布分散、不整齐、波动 较大;反之,标准差小表示变量值的离散程度 小,即变量值的分布集中、整齐、波动较小。变异宗教派生于标准差,其应用价值在于挑除 了平均水平的影响,并消除了单住。7

29、3总结(3)平均数与变异度均数士标准差(min,max)中位数土四分位数间距(min.max):变异度小,则均数代表性好!:变异度大,数据分散,则均数代表性差!平均数所表示的集中性与变异度所表示的离散性,从两个不同的角度阐明计量资料的特征!74总结(4):统计分析的起点是原始数据,终点是探索出 客观现象内在规律性。统计描述是要找到指标的数量及其分布的规 律性;:统计描述是整个统计学的基础,统计推断则是现代统计学的主要内容。75总结每个观察指标均有其特定的变异规律;:描述变异:图形描述统计量描述平均数:均数、几何均数、中位数:变异度:标准差(方差)、四分位数间距、变异系数、极差:不同分布的指标,

30、用不同的统计量描述;用平均数与变异度共同描述。76Air Quality and Diabetes Prevalence in United States 20X X-20X X:a Time Series Cross-section Analysis.Honggang Yil,2,Wei Yangl1 Nevada Center for Health Statistics and Informatics,School of Community Health Sciences,University of Nevada,Reno,NV,USA.2 Department of Epidemiolo

31、gy and Biostatistics,School of Public Health,Nanjing Medical University,Nanjing,Jiangsu,P.R.China79The relationship between PM2.5 and DM in theUnited Stated,20XX-20XXObjective:To study the association between diabetes mellitus in general population and annual mean levels of particulate matter(PM2.5)

32、爬阴飒y index(AQI).This paper presents a series of analysis of time-series cross-sectional data from 56 metropolitan areas of the U.S covering different periods between 20X X and 20X X.*Results:A statistical association was found between the weighted prevalence of diabetes mellitus by the annual mean o

33、f PM2.5 AQI 例吃出唐)Our study suggested that exposure to relatively higher levels of average annual PM25 AQI may increase the likelihood of diabetes mellitus.soUNIVERSITY OF NEVADARENOAir Quality and Diabetes Prevalence in United States 2002-2006:a Time Series Cross-section Analysis.Statistical Analysi

34、s ReportHongang Yi 3/25/2010YOU MAY COPY THE RESULT.THE TABLE.THE FIGURE.THE CONCLUSION TO YOUR ARTICLE PLEASE DO NOT COPY THE WORDS,WRITE WITH YOUR OWN WORDS!821.Inrroducfion1.1 Data.Anangement1.2 Statistical descnption of data1.lutroduction1.1 Data Arrangement J1.2 Statistical description of data1

35、.2.1 Summarv statistics fbr key studv variablesJ J J1.2.2 Summary descriptions fbr dependent vanable1.2.3 Summary descriptions for independent variables1.2.4 Summary descriptions fbr socioeconomic,demograpliic variables4.One-wa y tixed Ellect Alotlels:L nits meets4 1 The Pooled OLS Regression Model4

36、.2 LSDV1 without a Dummy4.3 LSDV2 without the Iutacq)t4 4 LSDV3 with Restrictions4.5 Within Group Effect Model4 5 1 Manually Estimating the Within Effect Model4 5 2 Within Gro呼 Effect Model Using SAS4 5 3 Within Group Effect Model Uwng Stata4 6 Between Group Effect Model Group Mean Regression834 7 T

37、esting Fixed Group Effects(7-test)5.One-wa y Fixed Effect Models:Time Effects5 1 Least Squares Dummy*anable Models(Time Effects)5.1 1 Least Squares Dumm Vanable Models(Time Effects)Using SAS5.1 2 LSDV2 withour the Intercept(Tune Effects)5.1 3 LSDV3 xnch a Restriction(Time Effects)5.2 Within Time Eff

38、ect Model Using SAS5.3 Within Time Efiect Model Using Statz5.4 Between Time Effect Model5 5 Testing Fixed Time Effects6.Two-wa y Fixed Effect Models6 1 LSDV1 without Two Duiimues61LSDV1+LSDV2 Drop a Dimimy and Suppress the Intercq)t6 3 LSDV1-LSDV3 Drop a Dummy and Impose a Restnction6 4 LSDV2+LSDV3

39、Suppress the Intercept and Impose a Restnction6 5 LSDV3 with Two Restrictions6 6 Two-way Within Effect Model6 7 Two-way Within Effect Model Using Stata6 8 Testing Two-u-ay Fixed Effects7.Ra ndom Effect Models7 1 One-way Random Group Effect Model7 2 One-way Random Time Effect Model7.3 Two-way Random

40、Effect Model in SAS7 4 Testing Random Effect Models7 5 Fixed Effects versus Random Effects8.Poola bility Test9.ConclusionAppendixReferences84The relationship between PM2.5 and DM in theUnited Stated,20XX-20XXData Structure:MetropolitanDMAnnual MeanHigh PeArea ofPrevalenceof PM2.5 AQIHispanicBlackHis

41、paniccapitalObsU.S.AnalyzedYEAR(%)ValuespopulationpopulationpopulationIncome120020025.088243.6539.32082.117588.934217.8700220020034.568940.5337.89471.678488.637218.6399320020046.319627.4138.43732.089289.308521.0366420020055.842425.9637.68391.943089.840121.5617520020066.510928.0340.86152.630790.64432

42、1.9658652020024.783052.164.634129.473191.473326.1048752020036.268453.402.471828.167691.806027.0975852020045.975853.704.645730.299889.507327.5723952020057.369952.483.432023.912191.682533.09461052020067.488751.845.437630.175194.261834.90131172020026.236053.603.328224.224590.956627.22841272020037.01155

43、2.183.267717.366392.541129.90111372020047.806850.372.916825.011693.259733.20161472020057.735250.832.040221.208192.462432.19211572020068.307546.873.056224.728191.728236.02851687520025.723943.6311.05877.365591.503830.72481787520036.791053.7016.66916.360190.174631.73171887520046.075238.8820.09938.59088

44、7.814734.34271987520055.430545.136.30212.178894.170440.29172087520066.642838.2619.94938.226389.544635.668585The relationship between PM2.5 and DM in theUnited Stated,20XX-20XX 1.2 Statistical Description of Data 1.2.1 Summary statistics for key study variables.1.2.2 Summary descriptions for dependen

45、t variable.1.2.3 Summary descriptions for independent variables.1.2.4 Summary descriptions for socioeconomic,demographic variables.86The relationship between PM2.5 and DM in theUnited Stated,20XX-20XX1.2.1 Summary statistics for key study variables.Ta ble 1.Summary Characteristics of the 56 Metropol

46、itan Area of U.S.Analyzed*Plus-miims values are means SD.DM denotes the diabetes.PM25 denotes particulate matter with an aerodynamic diameter less than oi equal to 2.51im.AQI denotes Air Quality Index(AQI).It is used to report daily air quality based on levels of the criteria pollutants.t Propoition

47、s of the population are estimated by Metropolitan Aiea-Level Weighting Methodology on U.S.Behavioral Risk Factor Smveillance System(BRFSS)survey data.*High Per capital Income refers to the proportion of the population who have$75,000 or more annual household income from all sources.Data on race and

48、ethnic group were self-reported.VariableDM Prevalence(%)tAQI of PM2.5High Per capital Income(proportion of population,%)ti High-school graduates(proportion of population,%)f Black population(proportion of population,%)fHispanic population(proportion of population,%)fMean Values2002200320042005i oiai

49、20066.48+1.376.71+1.356.74+1.227.13+1.337.62+1.636.94+1.4344.21 10.80 44.72+10.8840.239.3842.37 10.0339.04+9.5842.1110.3220.91+5.9822.40+6.0223.42 6.4426.29+7.0828.47+6.9124.30+7.0190.24+3.6690.373.3590.04+3.7890.813.2490.514.2990.39 3.6711.81+9.9311.43+9.4511.99+10.0910.25+9.3811.29+9.4711.35+9.628

50、.11+8.588.10+8.669.09+9.767.51+7.559.47+10.058.46+8.93871.2.2 Summary descriptions for dependent variable.(1)The frequency distribution of dependent variable is showed in Figure 1;The P value of Skewness/Kurtosis tests for Normality is 0.0015.Figure 1.The Frequency Distribution of Dependent Variable

展开阅读全文
相似文档                                   自信AI助手自信AI助手
猜你喜欢                                   自信AI导航自信AI导航
搜索标签

当前位置:首页 > 应用文书 > 统计图表

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        获赠5币

©2010-2024 宁波自信网络信息技术有限公司  版权所有

客服电话:4008-655-100  投诉/维权电话:4009-655-100

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :gzh.png    weibo.png    LOFTER.png 

客服