Part 1: Probability
Distributions
|
|
|
Probability Distribution |
|
Normal Distribution |
|
Student-t Distribution |
|
F Distribution |
|
Significance Tests |
Probability Density
Function
|
|
|
P = the probability that a randomly
selected value of a variable X falls between a and b. |
|
f(x) = the probability density function. |
|
The probability function has to be
integrated over distinct limits to obtain a probability. |
|
The probability for X to have some
particular value is ZERO. |
|
Two important properties of the
probability density function: |
|
(1) f(x) ¾ 0 for all x
within the domain of f. |
|
|
|
(2) |
Cumulative Distribution
Function
|
|
|
The cumulative distribution function F(x)
is defined as the probability that a variable assumes a value less than x. |
|
The cumulative distribution function is
often used to assist in calculating probability (will show later). |
|
The following relation between F and P
is essential for probability calculation: |
|
|
|
|
Normal Distribution
|
|
|
F: probability density function |
|
μ: mean of the population |
|
σ: standard deviation of the population |
|
The normal distribution is one of the
most important distribution in geophysics. Most geophysical variables (such
as wind, temperature, pressure, etc.) are distributed normally about their
means. |
|
|
Standard Normal
Distribution
|
|
|
The standard normal distribution has a
mean of 0 and a standard deviation of 1. |
|
This probability distribution is
particularly useful as it can represent any normal distribution, whatever its
mean and standard deviation. |
|
Using the following transformation, a
normal distribution of variable X can be converted to the standard normal
distribution of variable Z: |
|
Z = ( X - μ ) / σ |
How to Use Standard
Normal Distribution
|
|
|
Example 1: What is the probability that
a value of Z is greater than 0.75? |
|
Answer: P(Z³0.75) = 1-P(Z£0.75)=1-0.7734=0.2266 |
Another Example
|
|
|
Example 2: What is the probability that
Z lies between the limits Z1=-0.60 and Z2=0.75? |
|
Answer: |
|
P(Z £-|Z1|)
= 1 - P(Z £-|Z1|)
è P(Z £-0.60)=1-0.7257=0.2743 |
|
P(-0.60 £Z £0.75) = P(Z £0.75) - P(Z £-0.60) |
|
= 0.7734 - 0.2743 |
|
= 0.4991 |
Probability of Normal
Distribution
Confidence Intervals (or
Limits)
|
|
|
In stead of asking “what is the
probability that Z falls within limits and b in the normal distribution”, it
is more important to ask “Within what intervals or limits does X% of the
population lie in the normal distribution”. |
|
The X% is referred as the “confidence
level”. |
|
The interval correspond to this level
is called the “confidence intervals” or “confidence limits”. |
How to Determine
Confidence Intervals
|
|
|
Choose a confidence level |
|
Determine Z values from the table of
standard normal distribution. (Note that Z1 = - Z2) |
|
Transform Z values back to X values
using Z = (X - m ) / s |
|
The Confidence intervals
are determined. |
How to Estimate m and s
|
|
|
In order to use the normal
distribution, we need to know the mean and standard deviation of the
population |
|
But they are impossible to know in most
geoscience applications, because most
geoscience populations are infinite. |
|
We have to estimate m and s from samples. |
Sample Size
|
|
|
The bigger the sample size N, the more
accurate the estimated mean and standard deviation are to the true values. |
|
How big should the sample size is for
the estimate mean and standard deviation to be reasonable? |
|
è N > 30 for s to approach s (m gets
approached even with small N) |
Transformations
|
|
|
It can be shown that any frequency
function can be transformed in to a frequency function of given form by a
suitable transformation or functional relationship. |
|
For example, the original data follows
some complicated skewed distribution, we may want to transform this
distribution into a known distribution (such as the normal distribution)
whose theory and property are well known. |
|
Most geoscience variables are
distributed normally about their mean or can be transformed in such a way
that they become normally distributed. |
|
The normal distribution is, therefore,
one of the most important distribution in geoscience data analysis. |
Distribution of Sample
Means
|
|
|
One example of how the normal
distribution can be used for data that are “non-normally” distributed is to
determine the distribution of sample means. |
|
Is the sample mean close to the
population mean? |
|
To answer this question, we need to
know the probability distribution of the sample mean. |
|
We can obtain the distribution by
repeatedly drawing samples from a population and find out the frequency
distribution. |
An Example of Normal
Distribution
|
|
|
If we repeatedly draw a sample of N
values from a population and calculate the mean of that sample, the
distribution of sample means tends to be in normal distribution. |
Central Limit Theory
|
|
|
In the limit, as the sample size
becomes large, the sum (or the mean) of a set of independent measurements
will have a normal distribution, irrespective of the distribution of the raw
data. |
|
If there is a population which has a
mean of mand a standard deviation of s, then the distribution of its sampling means
will have: |
|
a mean m x = m, and |
|
a standard deviation sx = s/(N)1/2 = standard error of the mean |
Normal Distribution of
Sample Means
|
|
|
The distribution of sample means can be
transformed into the standard normal distribution with the following
transformation: |
|
|
|
|
|
|
|
The confidence level of the sample mean
can therefore be estimated based on the standard normal distribution and the
transformation. |
Confidence Level for
Sample Means
|
|
|
The probability that a sample mean, x,
lies between m-sx and
m+
sx may be written: |
|
|
|
|
|
However, it is more important to find
out the confidence interval for the population mean m. The previous relation
can be written as: |
|
|
|
|
|
But we don’t know the true value of s (the standard
deviation of the population). If the sample
size is large enough (N³30), we can use s
(the standard deviation estimated by samples) to approximate s: |
|
|
Confidence Level for
Sample Means
|
|
|
The 68% confidence level for m is: |
|
|
|
|
|
|
|
To generalize the relation to any
confidence level X%, the previous relation can be rewritten as: |
|
|
|
|
|
|
An Example
|
|
|
In an experiment, forty measurements of
air temperature were made. The mean and standard deviation of the sample are: |
|
x = 18.41°C and
s = 0.6283°C. |
|
|
|
Question 1: |
|
Calculate the 95% confidence interval for the population mean for
these data. |
|
|
|
Answer: |
|
From the previous slid, we know the interval is: x ± Z95% * s / (N)0.5 |
|
Z95% = 1.96 |
|
Z95% * s / (N)0.5 = 1.96 * 0.6283°C / (40 )0.5
= 0.1947°C |
|
The 95% confidence level for the population mean: 18.22°C ~ 18.60°C. |
An Example – cont.
|
|
|
In an experiment, forty measurements of
air temperature were made. The mean and standard deviation of the sample are: |
|
x = 18.41°C and
s = 0.6283°C. |
|
|
|
Question 2: |
|
How many measurements would be required to reduce the 95% confidence
interval for the population mean to (x-0.1)°C to (x+0.1)°C? |
|
|
|
Answer: |
|
We want to have Z95% * s / (N)0.5 = 0.1 °C |
|
We already know Z95% = 1.96 and s = 0.6283°C |
|
è N = (1.96´0.6283/0.1)2
= 152 |
|
|
Small Sampling Theory
|
|
|
When the sample size is smaller than
about 30 (N<30), we can not use the normal distribution to describe the
variability in the sample means. |
|
With small sample sizes, we cannot
assume that the sample-estimated standard deviation (s) is a good
approximation to the true standard deviation of the population (s). |
|
The quantity no longer follows the standard normal distribution. |
|
In stead, the probability distribution
of the small sample means follows the “Student’s t distribution”. |
Student’s t Distribution
|
|
|
For variable t |
|
|
|
|
|
The probability density function for
the t-distribution is: |
|
|
|
|
|
Here K(n) is a constant which depends on the number of
degree of freedom (n = n-1). K(n) is chosen so
that: |
|
|
How to Use t-Distribution
Slide 25
Hypothesis (Significance)
Testing
|
|
|
Hypothesis testing involves comparing a
hypothesized population parameter with the corresponding number (or
statistic) determined from sample data. |
|
The hypothesis testing is used to
construct confidence interval around sample statistics. |
|
The above table lists the population
parameters that are often tested between the population and samples. |
Significance Tests
|
|
|
Five Basic Steps: |
|
1. State the significance level |
|
2. State the null hypothesis and its alternative |
|
3. State the statistics used |
|
4. State the critical region |
|
5. Evaluate the statistics and state the conclusion |
An Example
Null Hypothesis
|
|
|
Usually, the null hypothesis and its
alternative are mutually exclusive. For example: |
|
|
|
H0: The means of two samples are equal. |
|
H1: The means of two samples are not equal. |
|
|
|
H0: The variance at a period of 5 days is less than or
equal to C. |
|
H1: The variance at a period of 5 days is greater than C. |
|
|
Confidence and
Significance
|
|
|
Confidence Level: X% |
|
Significance Level: a |
What Does It Mean 5%
Significant?
|
|
|
At the 5% significance level, there is
a one in twenty chance of rejecting the hypothesis wrongly (i.e., you reject
the hypothesis but it is true). |
One-Side and Two-Side
Tests
|
|
|
Two-Side Tests |
|
H0: mean temperature = 20°C |
|
H1: mean temperature ¹ 20°C |
Type-I and Type-II Errors
|
|
|
Type I Error: You reject a null
hypothesis that is true. |
|
Type II Error: You fail to reject a
null hypothesis that is false. |
|
By choosing a significance level of,
say, a=0.05, we limit the probability of a
type I error to 0.05. |
Significance Test of the
Difference of Means
|
|
|
If we draw two samples from one
population and obtain two different means, how do we estimate the
significance of the difference between these two sample means? |
|
If we repeat these pair-sampling many
many times, the probability distribution for the mean difference will be a
normal distribution with a mean of zero and a standard deviation of: |
|
|
|
|
|
|
|
Test the null hypothesis that the
samples come from the same population ( m1= m2 as well as s1= s2) use the t-distribution
with |
|
v = N1 + N2 - 1 |
|
|
Slide 35
Significance Test of
Variance
|
|
|
Standard deviation is another important
population parameter to be tested. |
|
We want to know if the standard
deviation estimated from two samples are significantly different. |
|
The statistical test used to compare
the variability of two samples is the F-test: |
|
|
|
|
|
S12: the estimate of the population variance of
sample 1 |
|
S22: the estimate of the population variance of
sample 2. |
|
|
F-Distribution
|
|
|
Here: |
|
X = S12/S22 |
|
n1 = degree
of freedom of sample 1 |
|
n2 = degree
of freedom of sample 2 |
|
K(n1 , n2 ) = a constant to make the |
|
total area under the |
|
f(x) curve to be 1 |
F-Test
|
|
|
F-distribution is non-symmetric. |
|
There are two critical F values of
unequal magnitude. |
|
Whether to use the upper or the lower
critical value depends on the relative magnitudes of S1 and S2: |
|
If S1 > S2, then use the upper limit FU
to reject the null hypothesis |
|
If S1 < S2, then use the lower limit FL
to reject the null hypothesis |
|
Or you can always make S1
> S2 when F is defined. In this case, you can always use the
upper limit to examine the hypothesis. |
|
|
Slide 39
An Example on Test of
Variance
|
|
|
There are two samples, each has six
measurements of wind speed. The first sample measures a wind speed variance
of 10 (m/sec)2. The second sample measures a 6 (m/sec)2
variance. Are these two variance significantly different? |
|
|
|
Answer: (1) Selected a 95% significance
level (a = 0.05) |
|
(2) H0: s1 = s2 |
|
H1: s1 ¹ s2 |
|
(3) Use F-test |
|
F = 10/6 = 1.67 |
|
(4) This is a two-tailed test. So choose the 0.025
significance areas. |
|
For n1 = n1 = 6-1=5, F0.975=7.15. |
|
(5) Since F< F0.975, the null
hypothesis can not be rejected. |
|
(6) Variance from these two samples are not
significantly different. |
Summary
|
|
|
Z-Statistic (the Standard Normal
Distribution) |
|
|
|
|
|
|
|
t-Statistic: test sample mean |
|
|
|
|
|
|
|
F-Statistic: test sample variance (or
power spectra) |
|
|
|
|
|
|