|
|
|
Probability Distribution |
|
Normal Distribution |
|
Student-t Distribution |
|
Chi Square Distribution |
|
F Distribution |
|
Significance Tests |
|
|
|
|
|
|
Parameters: Numbers that describe a population.
For example, the population mean (m)and standard deviation (s). |
|
Statistics: Numbers that are calculated from a
sample. |
|
A given population has only one value of a
particular parameter, but a particular statistic calculated from different
samples of the population has values that are generally different, both
from each other, and from the parameter that the statistics is designed to estimate. |
|
The science of statistics is concerned with how
to draw reliable, quantifiable inferences from statistics about parameters. |
|
|
|
|
Random Variable: A variable whose values occur
at random, following a probability distribution. |
|
Observation: When the random variable actually
attains a value, that value is called an observation (of the variable). |
|
Sample: A collection of several observations is
called sample. If the observations are generated in a random fashion with
no bias, that sample is known as a random sample. |
|
|
|
|
The pattern of probabilities for a set of events
is called a |
|
probability distribution. |
|
The probability of each event or combinations of
events must range from 0 to 1. |
|
The sum of the probability of all possible
events must be equal too 1. |
|
|
|
|
If you
throw a die, there are six possible outcomes: the numbers 1, 2, 3, 4, 5 or
6. This is an example of a random variable (the dice value), a variable
whose possible values occur at random. |
|
When the
random variable actually attains a value (such as when the dice is actually
thrown) that value is called an observation. |
|
If you
throw the die 10 times, then you have a random sample which consists of 10
observations. |
|
|
|
|
P = the probability that a randomly selected
value of a variable X falls between a and b. |
|
f(x)
= the probability density function. |
|
The probability function has to be integrated
over distinct limits to obtain a probability. |
|
The probability for X to have a particular value
is ZERO. |
|
Two important properties of the probability
density function: |
|
(1) f(x)
¾ 0 for all x within the domain of f. |
|
|
|
(2) |
|
|
|
|
The cumulative distribution function F(x) is
defined as the probability that a variable assumes a value less than x. |
|
The cumulative distribution function is often
used to assist in calculating probability (will show later). |
|
The following relation between F and P is
essential for probability calculation: |
|
|
|
|
|
|
|
|
f: probability density function |
|
μ:
mean of the population |
|
σ: standard deviation of the population |
|
The normal distribution is one of the most
important distribution in geophysics. Most geophysical variables (such as
wind, temperature, pressure, etc.) are distributed normally about their
means. |
|
|
|
|
|
|
The standard normal distribution has a mean of 0
and a standard deviation of 1. |
|
This probability distribution is particularly
useful as it can represent any normal distribution, whatever its mean and
standard deviation. |
|
Using the following transformation, a normal
distribution of variable X can be converted to the standard normal
distribution of variable Z: |
|
Z = ( X - μ ) / σ |
|
|
|
|
It can be shown that any frequency function can
be transformed in to a frequency function of given form by a suitable
transformation or functional relationship. |
|
For example, the original data follows some
complicated skewed distribution, we may want to transform this distribution
into a known distribution (such as the normal distribution) whose theory
and property are well known. |
|
Most geoscience variables are distributed
normally about their mean or can be transformed in such a way that they
become normally distributed. |
|
The normal distribution is, therefore, one of
the most important distribution in geoscience data analysis. |
|
|
|
|
Example 1: What is the probability that a value
of Z is greater than 0.75? |
|
Answer: P(Z³0.75) = 1-P(Z£0.75)=1-0.7734=0.2266 |
|
|
|
|
Example 2: What is the probability that Z lies
between the limits Z1=-0.60 and Z2=0.75? |
|
Answer: |
|
P(Z £-0.60) = 1 - P(Z <
0.60) è P(Z >-0.60)=1-0.7257=0.2743 |
|
P(-0.60 £Z £0.75) = P(Z £0.75) - P(Z £-0.60) |
|
= 0.7734 - 0.2743 |
|
= 0.4991 |
|
|
|
|
|
|
|
|
|
|
|
|
This figure shows average temperatures for
January 1989 over the US, expressed as quantiles of the local Normal
distributions. |
|
Different values of m and s have been estimated
for each location. |
|
|
|
|
In order to use the normal distribution, we need
to know the mean and standard deviation of the population |
|
But they are impossible to know in most
geoscience applications, because
most geoscience populations are infinite. |
|
We have to estimate m and s from
samples. |
|
|
|
|
Is the sample mean close to the population mean? |
|
To answer this question, we need to know the
probability distribution of the sample mean. |
|
We can obtain the distribution by repeatedly
drawing samples from a population and find out the frequency distribution. |
|
|
|
|
If we repeatedly draw N observations from a
standard normal distribution (m=0 and s=1) and calculate the sample mean, what is
the distribution of the sample mean? |
|
|
|
|
What did we learn from this example? |
|
|
|
èIf a sample is composed of N random
observations from a normal distribution with mean m and standard deviation s, then the
distribution of the sample average will be a normal distribution with mean m but standard
deviation s/sqrt(N). |
|
|
|
|
The standard deviation of the probability
distribution of the sample mean (X) is also referred as the “standard
error” of X. |
|
|
|
|
One example of how the normal distribution can
be used for data that are “non-normally” distributed is to determine the
distribution of sample means. |
|
Is the sample mean close to the population mean? |
|
To answer this question, we need to know the
probability distribution of the sample mean. |
|
We can obtain the distribution by repeatedly
drawing samples from a population and find out the frequency distribution. |
|
|
|
|
If we repeatedly draw a sample of N values from
a population and calculate the mean of that sample, the distribution of
sample means tends to be in normal distribution. |
|
|
|
|
In the limit, as the sample size becomes large,
the sum (or the mean) of a set of independent measurements will have a
normal distribution, irrespective of the distribution of the raw data. |
|
If there is a population which has a mean of mand a
standard deviation of s, then the distribution of its sampling means
will have: |
|
a mean m x
= m, and |
|
a
standard deviation sx = s/(N)1/2 =
standard error of the mean |
|
|
|
|
The bigger the sample size N, the more accurate
the estimated mean and standard deviation are to the true values. |
|
How big should the sample size is for the
estimate mean and standard deviation to be reasonable? |
|
è N >
30 for s to approach s (m gets approached even with small N) |
|
|
|
|
The distribution of sample means can be
transformed into the standard normal distribution with the following
transformation: |
|
|
|
|
|
|
|
The confidence level of the sample mean can
therefore be estimated based on the standard normal distribution and the
transformation. |
|
|
|
|
With this theorem, statistician can make
reasonable inferences about the sample mean without having to know the
underlying probability distribution. |
|
|
|
|
When the sample size is smaller than about 30
(N<30), we can not use the normal distribution to describe the
variability in the sample means. |
|
With small sample sizes, we cannot assume that
the sample-estimated standard deviation (s) is a good approximation to the
true standard deviation of the population (s). |
|
The quantity no longer follows the standard normal distribution. |
|
In stead, the probability distribution of the
small sample means follows the “Student’s t distribution”. |
|
|
|
|
For variable t |
|
|
|
|
|
The probability density function for the t-distribution
is: |
|
|
|
|
|
Here K(n) is a constant which depends on the number
of degree of freedom (n = n-1). K(n) is chosen so that: |
|
|
|
|
|
|
|
|
|
|
Since the Student’s t distribution approaches
the normal distribution for large N, there is no reason to use the normal distribution in preference to
Student’s t. |
|
The Student ’s t distribution is the most
commonly used in meteorology, and perhaps in all of applied statistics. |
|
|
|
|
|
|
The number of degree of freedom |
|
=
(the number of original observations) – (the number of parameters estimated
from the observation) |
|
|
|
For example, the number of degree of freedom for
a variance is N-1, where N is the number of the observations. |
|
|
|
|
|
|
|
It is because we need to estimate the mean (a
parameter) from the observations, in order to calculate the variance. |
|
|
|
|
We have four observations: [4, 1, 8, 3] |
|
|
|
To calculate the variance, we first get the mean
= (4+1+8+3)/4=4 |
|
|
|
We then calculate the deviations of each
observations from the mean and get d = [4-4, 1-4, 8-4, 3-4] = [0, -3, 4,
-1]. |
|
|
|
Now these four d’s are constrained by the
requirement: |
|
d1
+ d2 + d3 + d4 =0 (a result related to the mean). |
|
|
|
The variance we get (variance =
d1**2+d2**2+d3**2+d4**2) has only 3 degree of freedom ( 4 observations
minus one estimate of the parameter). |
|
|
|
|
Two of the main tools of statistical inference
are: |
|
Confidence Intervals |
|
Within what intervals (or limits) does X% of the population lie in
the distribution |
|
Hypothesis Tests |
|
You
formulate a theory (hypothesis) about the phenomenon you are studying and
examine whether the theory is supported by the statistical evidence. |
|
|
|
|
In stead of asking “what is the probability that
Z falls within limits a and b in the normal distribution”, it is more
important to ask “Within what intervals or limits does X% of the population
lie in the normal distribution”. |
|
The X% is referred as the “confidence level”. |
|
The interval correspond to this level is called
the “confidence intervals” or “confidence limits”. |
|
|
|
|
The distribution of sample means can be
transformed into the standard normal distribution with the following
transformation: |
|
|
|
|
|
|
|
The confidence level of the sample mean can
therefore be estimated based on the standard normal distribution and the
transformation. |
|
|
|
|
Choose a confidence level |
|
Determine Z-values from the table of standard
normal distribution. (Note that Z1 = - Z2) |
|
Transform Z values back to X values using Z = (X
- m ) / s |
|
The Confidence intervals
are determined. |
|
|
|
|
The probability that a sample mean, x, lies
between m-sx and m+ sx may be written: |
|
|
|
|
|
However, it is more important to find out the
confidence interval for the population mean m.. The previous relation
can be written as: |
|
|
|
|
|
But we don’t know the true value of s (the
standard deviation of the population). If the sample size is large enough
(N³30), we can use s (the standard deviation estimated by samples) to
approximate s: |
|
|
|
|
|
|
The 68% confidence intervals for m are: |
|
|
|
|
|
|
|
To generalize the relation to any confidence
level X%, the previous relation can be rewritten as: |
|
|
|
|
|
|
|
|
|
|
In an experiment, forty measurements of air
temperature were made. The mean and standard deviation of the sample are: |
|
x = 18.41°C and
s = 0.6283°C. |
|
|
|
Question 1: |
|
Calculate the 95% confidence interval for the population mean for
these data. |
|
|
|
Answer: |
|
From
the previous slid, we know the interval is: x ± Z95% * s /
(N)0.5 |
|
Z95%
= 1.96 |
|
Z95%
* s / (N)0.5 = 1.96 * 0.6283°C / (40 )0.5 = 0.1947°C |
|
The
95% confidence level for the population mean: 18.22°C ~ 18.60°C. |
|
|
|
|
In an experiment, forty measurements of air
temperature were made. The mean and standard deviation of the sample are: |
|
x = 18.41°C and
s = 0.6283°C. |
|
|
|
Question 2: |
|
How
many measurements would be required to reduce the 95% confidence interval
for the population mean to (x-0.1)°C to (x+0.1)°C? |
|
|
|
Answer: |
|
We
want to have Z95% * s / (N)0.5 = 0.1 °C |
|
We
already know Z95% = 1.96 and s = 0.6283°C |
|
è N =
(1.96´0.6283/0.1)2 = 152 |
|
|
|
|
|
|
A typical use of confidence intervals is to
construct error bars around plotted sample statistics in a graphical
display. |
|
|
|
|
The term 95% confident means that we are
confident our procedure will capture the value of m 95% of the times it
(the procedure
) is used. |
|
|
|
|
Hypothesis testing involves comparing a
hypothesized population parameter with the corresponding number (or
statistic) determined from sample data. |
|
The hypothesis testing is used to construct
confidence interval around sample statistics. |
|
The above table lists the population parameters
that are often tested between the population and samples. |
|
|
|
|
Parametric Tests: conducted in the situations
where one knows or assumes that a particular theoretical distribution
(e.g., Normal distribution) is an appropriate representation for the data
and/or the test statistics. |
|
è In these tests, inferences are made about
the particular distribution parameters. |
|
|
|
Nonparametric Test: conducted without the
necessity of assumption about what theoretical distribution pertains to the
data. |
|
|
|
|
Five Basic Steps: |
|
1.
State the null hypothesis and its alternative |
|
2.
State the statistics used |
|
3.
State the significance level |
|
4.
State the critical region |
|
(i.e., identify the sample distribution of the test statistic) |
|
5.
Evaluate the statistics and state the conclusion |
|
|
|
|
Usually, the null hypothesis and its alternative
are mutually exclusive. For example: |
|
|
|
H0:
The means of two samples are equal. |
|
H1:
The means of two samples are not equal. |
|
|
|
H0:
The variance at a period of 5 days is less than or equal to C. |
|
H1:
The variance at a period of 5 days is greater than C. |
|
|
|
|
|
|
|
|
|
|
A test statistic is the quantity computed from
the sample that will be the subject of the test. |
|
In parametric tests, the test statistic will
often be the sample estimate of a parameter of a relevant theoretical
distribution. |
|
|
|
|
To examine the significance of the difference
between means obtained from two different samples. We use the “two-sample t-statistic”: |
|
|
|
|
|
|
|
|
|
|
Confidence Level: X% |
|
Significance Level: a |
|
|
|
|
At the 5% significance level, there is a one in
twenty chance of rejecting the hypothesis wrongly (i.e., you reject the
hypothesis but it is true). |
|
|
|
|
A statistic is calculated from a batch of data. |
|
The value of the statistic varies from batch to
batch. |
|
A sampling distribution for a statistic is the
probability distribution describing batch-to-batch variations of that
statistic. |
|
The random variations of sample statistics can
be described using probability distributions just as the random variations
of the underlying data can be described using probability distributions. |
|
Sample statistics can be viewed as having been
drawn from probability distributions. |
|
|
|
|
|
|
The p value is the specific probability that the
observed value of the test statistic will occur according to the null
distribution (the probability distribution based on the null hypothesis). |
|
|
|
If p
value falls within the critical regions |
|
è Reject
the null hypothesis. |
|
|
|
The p value depends on the alternative
hypothesis. |
|
|
|
|
Ho: m = 50 |
|
H1:
m ¹ 50 |
|
|
|
If a N=25 sampling mean has a sample mean of 45
and a sample s of 15, then |
|
|
|
Z
= (45-50) / (15/5) = -1.67 |
|
P(Z<-1.67)=0.0478 |
|
p value = 2 * 0.0478 = 0.0956 |
|
|
|
Ho: m = 50 |
|
H1:
m < 50 |
|
|
|
Z
= (45-50) / (15/5) = -1.67 |
|
P(Z<-1.67)=0.0478 |
|
p value = 0.0478 = 0.0478 |
|
|
|
|
|
|
Type I Error: You reject a null hypothesis that
is true. |
|
Type II Error: You fail to reject a null
hypothesis that is false. |
|
By choosing a significance level of, say, a=0.05, we
limit the probability of a type I error to 0.05. |
|
|
|
|
Two-Side Tests |
|
H0:
mean temperature = 20°C |
|
H1:
mean temperature ¹ 20°C |
|
|
|
|
|
|
|
|
We want to compare the averages from two
independent samples to determine whether a significance exists between the
samples. |
|
|
|
For Example: |
|
*
One sample contains the cholesterol data on patients taking a standard
drug, while the second sample contains cholesterol data on patients taking
experimental drug. You would test to see whether there is statistically
significant difference between two sample averages. |
|
|
|
*
Compare the average July temperature at a location produced in a climate
model under a doubling versus no doubling CO2 concentration. |
|
|
|
*
Compare the average winter 500-mb height when one or the other of two
synoptic weather regimes had prevailed. |
|
|
|
|
For samples come from distributions with
different standard deviations, having values of s1 and s2.
This is called “unpooled two-sample t-statistic”: |
|
|
|
|
|
|
|
|
|
If both distributions have the same standard
deviation, then we can “pool” the estimates of the standard deviation from
the two samples into a single estimate: |
|
|
|
|
Paired Data: where observations come in natural
pairs. |
|
For
example: |
|
A doctor might measure the effect of a drug by measuring |
|
the physiological state of patients before and after applying |
|
the drug. Each patient in this study has two observations, and |
|
the observations are paired with each other. To determine the |
|
drug’s effectiveness, the doctor looks at the difference |
|
between the before and the after reading. |
|
|
|
Paired-data test is a one-sample t-test |
|
To
test whether or not there is a significant difference, we use |
|
one-sample t-test. It is because we are essentially looking at |
|
one
sample of data – the sample of paired difference. |
|
|
|
|
|
|
There is a data set that contains the percentage
of women in the work force in 1968 and 1972 from a sample of 19 cities.
There are two observations from each city, and the observations constitute
paired data. You are asked to determine whether this sample demonstrates a
statistically significant increase in the percentage of women in the work
force. |
|
Answer: |
|
1. Ho:
m = 0 (There is no change in the percentage) |
|
H1:
m¹ 0 (There is some change, but we are not assuming the |
|
direction of change) |
|
2.
95% t-Test with N=19 è We can determine t0.25% and t0.975% |
|
3.
S=0.05974 è s = S/(19)0.5 = 0.01371 |
|
4. t
statistic t = (0.0337 – 0) / s = 2.458 > t0.975% |
|
5. We reject the null hypothesis è There has been a
significant change |
|
in women’s participation in the work force in those 4 years. |
|
|
|
|
|
|
|
|
Standard deviation is another important
population parameter to be tested. |
|
The sample variances have a Chi-Squared (c2)
distribution, with the following definition: |
|
|
|
|
|
|
|
|
|
Here N:
sample size |
|
S: the sampled standard deviation |
|
s: the true standard
deviation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We want to know if the standard deviation
estimated from two samples are significantly different. |
|
The statistical test used to compare the
variability of two samples is the F-test: |
|
|
|
|
|
S12:
the estimate of the population variance of sample 1 |
|
S22:
the estimate of the population variance of sample 2. |
|
|
|
|
Here: |
|
X =
S12/S22 |
|
n1 = degree of freedom of sample 1 |
|
n2 = degree of freedom of sample 2 |
|
K(n1 ,
n2 ) = a constant to make the |
|
total area under the |
|
f(x) curve to be 1 |
|
|
|
|
F-distribution is non-symmetric. |
|
There are two critical F values of unequal
magnitude. |
|
Whether to use the upper or the lower critical
value depends on the relative magnitudes of S1 and S2: |
|
If S1
> S2, then use the upper limit FU to reject the
null hypothesis |
|
If S1
< S2, then use the lower limit FL to reject
the null hypothesis |
|
Or you can always make S1 > S2
when F is defined. In this case, you can always use the upper limit to
examine the hypothesis. |
|
|
|
|
|
|
|
|
There are two samples, each has six measurements
of wind speed. The first sample measures a wind speed variance of 10
(m/sec)2. The second sample measures a 6 (m/sec)2
variance. Are these two variance significantly different? |
|
|
|
Answer: (1) Selected a 95% significance level (a = 0.05) |
|
(2) H0: s1 = s2 |
|
H1: s1 ¹ s2 |
|
(3) Use F-test |
|
F = 10/6 = 1.67 |
|
(4) This is a two-tailed test. So choose the 0.025 significance
areas. |
|
For n1 = n1 = 6-1=5, F0.975=7.15. |
|
(5) Since F< F0.975, the null hypothesis can not be
rejected. |
|
(6) Variance from these two samples are not significantly different. |
|