资源描述
外文翻译原文
名称:Fundamentals_of_Statistics
Measures of Central Tendency and Location: mean, median, mode, percentiles, quartiles and deciles.
x sorted x
53 53
55 53
70 53
58 55
64 57
57 57
53 58
69 64
57 68
68 69
53 70
The Measures of Central Tendency are Mean, Median and Mode
Mean ® x-bar or ® for a given variable, it is the sum of the values divided by the number of values (Sxi/n). In this case, we have n = 11. So we need to add all of the values together and divide by 11. S = 657, = 59.73
Median ® the number in a distribution of a variable’s response where one half of the values are above and one half of the values are below. To find the median, we first need to put our data in ascending order (smallest to largest). Then we can determine the median…if the value of n is odd, it is simply the middle observation, but if the value of n is even, it is the average of the two middle observations.
In this case, n is odd, so the median will be the middle observation of our sorted values (the 6th value)...57
Mode ® the value that occurs most frequently. If there are two different values most frequently occurring, the data are said to be bi-modal. If there are more than two modes, and the distribution is said to be multi-modal. In this case, the value that occurs most often is 53. So, the mode is 53.
The measures of location are Percentile, Quartile and Decile
Percentile ® the pth percentile is a value such that at least p percent of the observations are less than or equal to this value and at least (100 – p) percent of the observations are greater than or equal to this value. To calculate percentiles, we use indices (i).
i = (p/100) n for p1, p2, p3,…p99
If the answer is a whole number (an integer), then i is the average of (P/100)n and 1 + (P/100)n.
If the index number is not a whole number, we ALWAYS round up. The position of the index is the next whole number (integer) greater than the computed index.
For example:
i(p50) = (50/100)11 = 5.5...this rounds up to 6
So, we would count from the lowest value of the sorted data to the index number (6). Since the calculated i was not a whole number we had to round up to find the value where at least 50% of the values are equal to or lower than this value and at least 50% are equal to or higher than this value. In this case, the value of the 50th percentile is the 6th value...57 … Does this look familiar? ® The 50th percentile is the same thing as the median.
What does it tell us? In this distribution, AT LEAST 50% of the observations are LESS THAN OR EQUAL TO 57 AND AT LEAST 50% of the observations are GREATER THAN OR EQUAL TO 57.
i(p80) = (80/100)11 = 8.8...this round up to 9. The 9th value is 68.
Again, since the index number is not a whole number, we round up. So, we would count from the lowest value of the sorted data to the index number (9). In this case, the value of the 80th percentile is 68.
Since this dataset has 11 observations, we won’t have any instances where our calculated index number is a whole number. However, if we just remove our value of 70 and create a new distribution, we will be able to see an example...
53 53 53 55 57 57 58 64 68 69
i(p30) = (30/100)10 = 3...this is a whole number, so we must take the 3rd and 4th values and average them to find the 30th percentile. (53 + 55)/2 = 54
So, the value of the 30th percentile is 54.
Return to our original data distribution ...
Quartiles – are special cases of percentiles…Q1 = P25, Q2 = P50, Q3 = P75,
These three values divide the distribution into 4 equal quarters
i(Q1) = (25/100)11 = 2.75...this rounds to 3, so Q1 is the 3rd value...53
i(Q2) = (50/100)11 = 5.5...this round to 6, so Q2 is the 6th value...57
i(Q3) = (75/100)11 = 8.25...this rounds to 9, so Q3 is the 9th value...64
Measures of Dispersion or Variability: Range, interquartile range (IQR), variance, standard deviation and coefficient of variation.
Range = This tells us how wide the span is from the maximum value to the minimum value. (Max – Min) = Range. In this instance, the range is 69 - 53 = 16.
Interquartile Range (IQR) = This tells us how wide the span is in the middle 50% of the data. (Q3 – Q1) = IQR. In this case ... 64 – 53 = 11
We will use IQR in later processes, so we will want to keep this
x
(x-xbar)
(x-xbar)2
53 53
-6.73 -6.73
45.29 45.29
53 53
-6.73 -6.73
45.29 45.29
53 53
-6.73 -6.73
45.29 45.29
55 55
-4.73 -4.73
22.37 22.37
57 57
-2.73 -2.73
7.45 7.45
57 57
-2.73 -2.73
7.45 7.45
58 58
-1.73 -1.73
2.99 2.99
64 64
4.27 4.27
18.23 18.23
68 68
8.27 8.27
68.39 68.39
69 69
9.27 9.27
85.93 85.93
70 70
10.27 10.27
105.47 105.47
657 657
-0.03 -0.03
454.18 454.18
657/11=59.73
454.18/10≈45.2
We use the formula: = s2
The variance for these data is 454.18. For our purposes here, the computation of variance is just a step towards the computation of the standard deviation.
Sample standard deviation (s) is the positive square root of the variance.
= s
So the formula for sample standard deviation is…
Population Variance (s2)®uses the same formula in the numerator, but N instead of n-1 in the denominator. Since we rarely have information about the entire population, we almost always use the formula for sample variance, s2.
Population Standard Deviation: s = …since we rarely have information from the entire population, we use the formula for sample standard deviation, s.
Coefficient of Variation: tells us what percent the sample standard deviation is of the sample mean
This number is “relative” and is only of use in comparing the distribution of two or more variables.
Suppose I have two samples, and I want to know which sample has more variability…
If both samples have the same mean, the one with the higher standard deviation will have the greater variability. However, if they have different means, I need to calculate the coefficient of variation to determine which one has the most variability. xbar = 458, s = 112 versus xbar = 687, s = 192
Standardized Data and Detecting Outliers
Z-score: z =
The z-score tells us how many standard deviations a value is from the mean. We can look at a picture of what a z-score tells us. In the Normal Curve…the mean is at the highest point and the curve tails off symmetrically in both directions.
The sign of the z-score tells us which direction the value is from the mean on the Normal Curve. Negative values will be to the left, and positive values will be to the right.
Standardizing Scores:
Standard Normal Curve…the mean is zero, and the standard deviation is 1. The distribution is bell-shaped and symmetrical. The area under the curve is 1, and the tails of the curve extend out infinitely. They never actually touch the horizontal axis. The highest point on the curve is at the mean
Return to our data …let’s calculate the z-scores for each of the values…
Empirical Rule ®used when the distribution is assumed to known to be approximately normal.
® Approximately 68% of the values will fall within 1 sd of the mean
® Approximately 95% of the values will fall within 2 sd of the mean
® Approximately 99.9% of the values will fall within 3 sd of the mean
Chebyshev’s Theorem ® doesn’t require that the data have a normal distribution
Says that at least (1 – 1/z2) values will fall within z standard deviations of the mean.
1-1/12 = 0, 1-1/22 = .75, 1-1/32 = .88889, 1-1/42 = .9375, 1-1/52 = .96
® We can’t make any assumptions about the percent of values that are within 1 sd of the mean
But…
® At least 75% of the values will fall within 2 sd of the mean
® At least 88.9% of the values will fall within 3 sd of the mean
We use Chebyshev’s Theorem to estimate the variation in a distribution when
® n < 30, or
® the shape of the distribution is unknown, or
®the distribution is assumed to be non-normal.
Outliers: suspect or extreme values of data that must be identified and scrutinized. If they are instances of incorrectly entered data, they should be corrected. If the value was entered correctly and it is a valid number, it should remain in the dataset as part of the initial analysis.
When we use the z-score method for identifying outliers, we assume that any value that has a z-score with an absolute value greater than 3.0 (that is less than -3.0 or greater than +3.0) is an outlier. Before we proceed with data analysis, we need to examine all outliers for accuracy. If we determine that the value is valid, we often run two sets of analysis. One with the outlier, and one without.
Another way to identify outliers…
Related to IQR is the Five number summary…minimum, Q1, Q2, Q3, & maximum. These values feed into upper and lower limits, and we graph them in a box plot.
Five Number Summary
Minimum
53
Q1
53
Q2
57
Q3
64
Maximum
70
® Use the box plot… The advantage of the boxplot is that it is not influenced by outliers or extreme values as are Z-scores.
Box Plots – Whiskers show the range of data within the inner fences
3(IQR) 1.5(IQR) Q1 Median Q3 1.5(IQR) 3(IQR)
below Q1 below Q1 (IQR) above Q3 above Q3
(Lower Outer & Inner Fences) (Upper Inner & Outer Fences) Any values between the inner and outer fences are “unusual,” and any values out beyond the outer fences are “outliers.”
Advantage of using the box plot method as well as the z-score method...the box plot method is not influenced by extreme values in the same way that the mean and the standard deviation are....it is said to be a more conservative method of evaluating outliers.
外文翻译原文
课题名称:统计基础
Measures of Central Tendency and Location:趋势和位置的划分: mean, median, mode, percentiles, quartiles and deciles. 意思是说,中位数,众数,百分位数,四分位数和十分位数。
x x sorted x 排序x
53 53 53 53
55 55 53 53
70 70 53 53
58 58 55 55
64 64 57 57
57 57 57 57
53 53 58 58
69 69 64 64
57 57 68 68
68 68 69 69
53 53 70 70
The Measures of Central Tendency are Mean, Median and Mode 中央趋势的划分是平均数,中位数和众数
均值 均值® ® 对于一个给定的变量,它的值除以变量的数目的总和。 在这种情况下,我们有 N = 11。 因此,我们需要添加所有的值除以11。 S = 657 , S= 657, = 59.73 = 59.73
中位数 ®值的一半以上和一个值的一半以下再在分配变量的响应。找到中位数,我们首先需要把我们的数据在升序(从最小到最大)。然后我们可以判断,中位数,如果n 的值是奇数,它仅仅是中间的观察,但如果n 的值是偶数,这是中间的两个观测的平均 。
在这种情况下,n是奇数,所以中位数将是我们的排序好的变量的中间观察值(第6个值)... 57
Mode 众数® ® the value that occurs most frequently. 发生最频繁的值。 If there are two different values most frequently occurring, the data are said to be bi-modal. 如果有两种不同的价值观最经常发生的,说是数据双峰。 If there are more than two modes, and the distribution is said to be multi-modal. In this case, the value that occurs most often is 5 3 . 如果有两个以上的众数,分布被认为是多众数。在这种情况下,最常出现的值是5 3。 So, the mode is 5 3 . 因此,众数是5 3。
The measures of location are 位置的度量 Percentile, Quartile and Decile百分位数,四分位数和十分位数
百分位数 ® ®在第p百分是一个变量至少为p% 的观测小于或等于这个值 ,至少(100 - P)% 的意见是大于或等于这个值。计算百分,我们使用指数(i)。
If the answer is a whole number (an integer), then i is the average of (P/100)n and 1 + (P/100)n . 如果答案是一个整数(整数),那么 i是 和 的平均值。
If the index number is not a whole number, we ALWAYS round u如果指数数不是一个整数,我们通常取该指数的位置是下一个整数(整数)大于计算指数。
For examp例如:
......这时候取6
So, we would count from the lowest value of the so rted data to the index number (6 ). Since the calculat所以,我们会把从的最小值排序的数据到索引号(6)。由于计算 i was not a whole number we had to round up to find the value where at least 50% of the values are equal to or lower than this value and at least 50% are equal to or higher than this value. In this case, the value of the 50 th percentile is the 6 th value...57i不是一个整数,我们必须找到值四舍五入到至少50%的值等于或低于这个值和至少50%是等于或者高于这个值。在这种情况下, 第50百分位值是第6个... 57 … ... Does this look familiar? 这是否很熟悉? ® The 50 th percentile is the same thing as the median. ®这和第50百分位数相同。
What does it tell us? 它告诉我们什么?In this distribution, AT LEAST 50% of the observations are LESS THAN OR EQUAL TO 57 AND AT LEAST 50% of the observations are GREATER THAN OR EQUAL TO在此分布,至少有50%的意见是小于或等于57,至少有50%的意见都大于或等于57。
i ( p 80) ...... 这时候取9。 The 9 th value is第九个变量是668。
Again, s ince the index number is not a whole number, we round up. So, we would count from the lowest value of the sor ted data to the index number (9 ). In this case, the value of the 80 th percentile is 同样,由于索引号不是一个完整的数,我们。所以,我所以,我们会把从的最小值排序的数据到索引号(9)。在这种情况下,第80百分位值是 68 . 68。
Since this dataset has 11 observations, we won't have any instances where our calculated index number is a whole number.因为这个数据集有11的观察值,我们将不会有任何情况下,我们计算的索引号是一个整数。However, if we just remove our value of 70 and create a new distribution, we will be able to see an example.然而,如果我们只是删除我们的价值为70,并创建一个新的分布,我们将能看到一个例子...
53 53 53 53 53 53 55 55 57 57 57 57 58 58 64 64 68 68 69 69 i (p30) ......这是一个整数,所以我们必须采取第3和第4值和平均他们找到第30百分位。(53 + 55)/ 2 = 54
So, t he value of the 30 th percentile is 因此,第30百分位是54。
Return to our original data distribut返回到我们的原始数据分布 ... ...
Quartiles – are special cases of percentiles…Q 1四分数 - 特殊情况下,这三个值将分布分成4等分
i (Q1).....这时候取3,因此第3个值... 53
i (Q2).....这时候取6,因此是第6个值... 57
i (Q3) ......这时候取6,因此是第6个值...64
离散程度Measures of Dispersion or Variability : Range , interquartile range (IQR) , variance ,离散离散分布 或可变性,极差,四分位间距(IQR),方差, standard deviation and coefficient of variation . 标准差和变异系数。
极差Range = This tells us how wide the span is from the maximum value to the minimum value.极:这就告诉我们有多宽跨度是从最大值到最小值。((Max – Min) = Range. I n this instance, the range is 69 - 5 3最大值-最小值)=极差在这个实例中,极差是69-53=16。
四分位间距(Interquartile Range (IQR) = This tells us how wide the span is in the middle 50% of the data. (Q3 – Q1) = IQR.四分位间距(IQR)= 这告诉我们在中间50%的数据是跨度有多大 In this case 在这种情况下 ... 64 – 53 = 11... 64 - 53 = 11
We will use IQR in later processes, so we will want to keep this 我们将在以后的过程中使用的四分间距,所以我们要保持这个
样本方差-这告诉我们,从平均值的偏差的平方的总和。大的方差表示偏离程度打,小的方差表示偏离程度小。 We square the values so that we don't end up with zero.由于是变量的平方,所以结果不会为零。让我们来看看这是如何实现的
xXX
(x-xbar) ()
(x-xbar) 2
53 53
-6.73 -6.73
45.29 45.29
53 53
-6.73 -6.73
45.29 45.29
53 53
-6.73 -6.73
45.29 45.29
55 55
-4.73 -4.73
22.37 22.37
57 57
-2.73 -2.73
7.45 7.45
57 57
-2.73 -2.73
7.45 7.45
58 58
-1.73 -1.73
2.99 2.99
64 64
4.27 4.27
18.23 18.23
68 68
8.27 8.27
68.39 68.39
69 69
9.27 9.27
85.93 85.93
70 70
10
展开阅读全文