Part 1: Probability Distributions

Probability Distribution

Normal Distribution

Student-t Distribution

F Distribution

Significance Tests

Probability Density Function

P = the probability that a randomly selected value of a variable X falls between a and b.

     f(x) = the probability density function.

The probability function has to be integrated over distinct limits to obtain a probability.

The probability for X to have some particular value is ZERO.

Two important properties of the probability density function:

     (1) f(x) ¾ 0 for all x within the domain of f.

     (2)

Cumulative Distribution Function

The cumulative distribution function F(x) is defined as the probability that a variable assumes a value less than x.

The cumulative distribution function is often used to assist in calculating probability (will show later).

The following relation between F and P is essential for probability calculation:

Normal Distribution

F: probability density function

μ: mean of the population

σ: standard deviation of the population

The normal distribution is one of the most important distribution in geophysics. Most geophysical variables (such as wind, temperature, pressure, etc.) are distributed normally about their means.

Standard Normal Distribution

The standard normal distribution has a mean of 0 and a standard deviation of 1.

This probability distribution is particularly useful as it can represent any normal distribution, whatever its mean and standard deviation.

Using the following transformation, a normal distribution of variable X can be converted to the standard normal distribution of variable Z:

Z = ( X - μ ) / σ

How to Use Standard Normal Distribution

Example 1: What is the probability that a value of Z is greater than 0.75?

Answer: P(Z³0.75) = 1-P(Z£0.75)=1-0.7734=0.2266

Another Example

Example 2: What is the probability that Z lies between the limits Z₁=-0.60 and Z₂=0.75?

Answer:

    P(Z £-|Z₁|) = 1 - P(Z £-|Z₁|) è P(Z £-0.60)=1-0.7257=0.2743

    P(-0.60 £Z £0.75) = P(Z £0.75) - P(Z £-0.60)

                                  = 0.7734 - 0.2743

                                  = 0.4991

Probability of Normal Distribution

Confidence Intervals (or Limits)

In stead of asking “what is the probability that Z falls within limits and b in the normal distribution”, it is more important to ask “Within what intervals or limits does X% of the population lie in the normal distribution”.

The X% is referred as the “confidence level”.

The interval correspond to this level is called the “confidence intervals” or “confidence limits”.

How to Determine Confidence Intervals

Choose a confidence level

Determine Z values from the table of standard normal distribution. (Note that Z₁ = - Z₂)

Transform Z values back to X values using Z = (X - m ) / s

The Confidence intervals are determined.

How to Estimate m and s

In order to use the normal distribution, we need to know the mean and standard deviation of the population

But they are impossible to know in most geoscience applications, because most geoscience populations are infinite.

We have to estimate m and s from samples.

Sample Size

The bigger the sample size N, the more accurate the estimated mean and standard deviation are to the true values.

How big should the sample size is for the estimate mean and standard deviation to be reasonable?

è N > 30 for s to approach s (m gets approached even with small N)

Transformations

It can be shown that any frequency function can be transformed in to a frequency function of given form by a suitable transformation or functional relationship.

For example, the original data follows some complicated skewed distribution, we may want to transform this distribution into a known distribution (such as the normal distribution) whose theory and property are well known.

Most geoscience variables are distributed normally about their mean or can be transformed in such a way that they become normally distributed.

The normal distribution is, therefore, one of the most important distribution in geoscience data analysis.

Distribution of Sample Means

One example of how the normal distribution can be used for data that are “non-normally” distributed is to determine the distribution of sample means.

Is the sample mean close to the population mean?

To answer this question, we need to know the probability distribution of the sample mean.

We can obtain the distribution by repeatedly drawing samples from a population and find out the frequency distribution.

An Example of Normal Distribution

If we repeatedly draw a sample of N values from a population and calculate the mean of that sample, the distribution of sample means tends to be in normal distribution.

Central Limit Theory

In the limit, as the sample size becomes large, the sum (or the mean) of a set of independent measurements will have a normal distribution, irrespective of the distribution of the raw data.

If there is a population which has a mean of mand a standard deviation of s, then the distribution of its sampling means will have:

a mean m _x = m, and

a standard deviation s_x = s/(N)^1/2= standard error of the mean

Normal Distribution of Sample Means

The distribution of sample means can be transformed into the standard normal distribution with the following transformation:

The confidence level of the sample mean can therefore be estimated based on the standard normal distribution and the transformation.

Confidence Level for Sample Means

The probability that a sample mean, x, lies between m-s_x and m+ s_xmay be written:

However, it is more important to find out the confidence interval for the population mean m. The previous relation can be written as:

But we don’t know the true value of s (the standard deviation of the population). If the sample size is large enough (N³30), we can use s (the standard deviation estimated by samples) to approximate s:

Confidence Level for Sample Means

The 68% confidence level for m is:

To generalize the relation to any confidence level X%, the previous relation can be rewritten as:

An Example

In an experiment, forty measurements of air temperature were made. The mean and standard deviation of the sample are:

                              x = 18.41°C    and   s = 0.6283°C.

Question 1:

     Calculate the 95% confidence interval for the population mean for these data.

Answer:

     From the previous slid, we know the interval is: x ± Z_95% * s / (N)^0.5

     Z_95% = 1.96

     Z_95% * s / (N)^0.5= 1.96 * 0.6283°C / (40 )^0.5= 0.1947°C

     The 95% confidence level for the population mean: 18.22°C ~ 18.60°C.

An Example – cont.

In an experiment, forty measurements of air temperature were made. The mean and standard deviation of the sample are:

                              x = 18.41°C    and   s = 0.6283°C.

Question 2:

     How many measurements would be required to reduce the 95% confidence interval for the population mean to (x-0.1)°C to (x+0.1)°C?

Answer:

      We want to have Z_95% * s / (N)^0.5= 0.1 °C

      We already know Z_95% = 1.96 and s = 0.6283°C

      è N = (1.96´0.6283/0.1)² = 152

Small Sampling Theory

When the sample size is smaller than about 30 (N<30), we can not use the normal distribution to describe the variability in the sample means.

With small sample sizes, we cannot assume that the sample-estimated standard deviation (s) is a good approximation to the true standard deviation of the population (s).

The quantity no longer follows the standard normal distribution.

In stead, the probability distribution of the small sample means follows the “Student’s t distribution”.

Student’s t Distribution

For variable t

The probability density function for the t-distribution is:

Here K(n) is a constant which depends on the number of degree of freedom (n = n-1). K(n) is chosen so that:

How to Use t-Distribution

Slide 25

Hypothesis (Significance) Testing

Hypothesis testing involves comparing a hypothesized population parameter with the corresponding number (or statistic) determined from sample data.

The hypothesis testing is used to construct confidence interval around sample statistics.

The above table lists the population parameters that are often tested between the population and samples.

Significance Tests

Five Basic Steps:

    1. State the significance level

    2. State the null hypothesis and its alternative

    3. State the statistics used

    4. State the critical region

    5. Evaluate the statistics and state the conclusion

An Example

Null Hypothesis

Usually, the null hypothesis and its alternative are mutually exclusive. For example:

     H₀: The means of two samples are equal.

     H₁: The means of two samples are not equal.

     H₀: The variance at a period of 5 days is less than or equal to C.

     H₁: The variance at a period of 5 days is greater than C.

Confidence and Significance

Confidence Level: X%

Significance Level: a

What Does It Mean 5% Significant?

At the 5% significance level, there is a one in twenty chance of rejecting the hypothesis wrongly (i.e., you reject the hypothesis but it is true).

One-Side and Two-Side Tests

Two-Side Tests

H₀: mean temperature = 20°C

H₁: mean temperature ¹ 20°C

Type-I and Type-II Errors

Type I Error: You reject a null hypothesis that is true.

Type II Error: You fail to reject a null hypothesis that is false.

By choosing a significance level of, say, a=0.05, we limit the probability of a type I error to 0.05.

Significance Test of the Difference of Means

If we draw two samples from one population and obtain two different means, how do we estimate the significance of the difference between these two sample means?

If we repeat these pair-sampling many many times, the probability distribution for the mean difference will be a normal distribution with a mean of zero and a standard deviation of:

Test the null hypothesis that the samples come from the same population ( m1= m2 as well as s1= s2) use the t-distribution with

v = N₁ + N₂ - 1

Slide 35

Significance Test of Variance

Standard deviation is another important population parameter to be tested.

We want to know if the standard deviation estimated from two samples are significantly different.

The statistical test used to compare the variability of two samples is the F-test:

S₁²: the estimate of the population variance of sample 1

S₂²: the estimate of the population variance of sample 2.

F-Distribution

Here:

      X = S₁²/S₂²

n₁ = degree of freedom of sample 1

n₂ = degree of freedom of sample 2

       K(n₁, n₂ ) = a constant to make the

                             total area under the

                             f(x) curve to be 1

F-Test

F-distribution is non-symmetric.

There are two critical F values of unequal magnitude.

Whether to use the upper or the lower critical value depends on the relative magnitudes of S₁ and S₂:

If S₁ > S₂, then use the upper limit F_U to reject the null hypothesis

If S₁< S₂, then use the lower limit F_L to reject the null hypothesis

Or you can always make S₁ > S₂ when F is defined. In this case, you can always use the upper limit to examine the hypothesis.

Slide 39

An Example on Test of Variance

There are two samples, each has six measurements of wind speed. The first sample measures a wind speed variance of 10 (m/sec)². The second sample measures a 6 (m/sec)² variance. Are these two variance significantly different?

Answer: (1) Selected a 95% significance level (a = 0.05)

               (2) H₀: s₁ = s₂

                     H₁: s₁ ¹ s₂

               (3) Use F-test

                     F = 10/6 = 1.67

               (4) This is a two-tailed test. So choose the 0.025 significance areas.

                     For n₁ = n₁ = 6-1=5, F_0.975=7.15.

               (5) Since F< F_0.975, the null hypothesis can not be rejected.

               (6) Variance from these two samples are not significantly different.

Summary

Z-Statistic (the Standard Normal Distribution)

t-Statistic: test sample mean

F-Statistic: test sample variance (or power spectra)


	Probability Distribution
	Normal Distribution
	Student-t Distribution
	F Distribution
	Significance Tests


	P = the probability that a randomly selected value of a variable X falls between a and b.
	f(x) = the probability density function.
	The probability function has to be integrated over distinct limits to obtain a probability.
	The probability for X to have some particular value is ZERO.
	Two important properties of the probability density function:
	(1) f(x) ¾ 0 for all x within the domain of f.

	(2)


	The cumulative distribution function F(x) is defined as the probability that a variable assumes a value less than x.
	The cumulative distribution function is often used to assist in calculating probability (will show later).
	The following relation between F and P is essential for probability calculation:


	F: probability density function
	μ: mean of the population
	σ: standard deviation of the population
	The normal distribution is one of the most important distribution in geophysics. Most geophysical variables (such as wind, temperature, pressure, etc.) are distributed normally about their means.


	The standard normal distribution has a mean of 0 and a standard deviation of 1.
	This probability distribution is particularly useful as it can represent any normal distribution, whatever its mean and standard deviation.
	Using the following transformation, a normal distribution of variable X can be converted to the standard normal distribution of variable Z:
	Z = ( X - μ ) / σ


	Example 1: What is the probability that a value of Z is greater than 0.75?
	Answer: P(Z³0.75) = 1-P(Z£0.75)=1-0.7734=0.2266


	Example 2: What is the probability that Z lies between the limits Z₁=-0.60 and Z₂=0.75?
	Answer:
	P(Z £-\|Z₁\|) = 1 - P(Z £-\|Z₁\|) è P(Z £-0.60)=1-0.7257=0.2743
	P(-0.60 £Z £0.75) = P(Z £0.75) - P(Z £-0.60)
	= 0.7734 - 0.2743
	= 0.4991


	In stead of asking “what is the probability that Z falls within limits and b in the normal distribution”, it is more important to ask “Within what intervals or limits does X% of the population lie in the normal distribution”.
	The X% is referred as the “confidence level”.
	The interval correspond to this level is called the “confidence intervals” or “confidence limits”.


	Choose a confidence level
	Determine Z values from the table of standard normal distribution. (Note that Z₁ = - Z₂)
	Transform Z values back to X values using Z = (X - m ) / s
	The Confidence intervals are determined.


	In order to use the normal distribution, we need to know the mean and standard deviation of the population
	But they are impossible to know in most geoscience applications, because most geoscience populations are infinite.
	We have to estimate m and s from samples.


	The bigger the sample size N, the more accurate the estimated mean and standard deviation are to the true values.
	How big should the sample size is for the estimate mean and standard deviation to be reasonable?
	è N > 30 for s to approach s (m gets approached even with small N)


	It can be shown that any frequency function can be transformed in to a frequency function of given form by a suitable transformation or functional relationship.
	For example, the original data follows some complicated skewed distribution, we may want to transform this distribution into a known distribution (such as the normal distribution) whose theory and property are well known.
	Most geoscience variables are distributed normally about their mean or can be transformed in such a way that they become normally distributed.
	The normal distribution is, therefore, one of the most important distribution in geoscience data analysis.


	One example of how the normal distribution can be used for data that are “non-normally” distributed is to determine the distribution of sample means.
	Is the sample mean close to the population mean?
	To answer this question, we need to know the probability distribution of the sample mean.
	We can obtain the distribution by repeatedly drawing samples from a population and find out the frequency distribution.


	If we repeatedly draw a sample of N values from a population and calculate the mean of that sample, the distribution of sample means tends to be in normal distribution.


	In the limit, as the sample size becomes large, the sum (or the mean) of a set of independent measurements will have a normal distribution, irrespective of the distribution of the raw data.
	If there is a population which has a mean of mand a standard deviation of s, then the distribution of its sampling means will have:
	a mean m _x = m, and
	a standard deviation s_x = s/(N)^1/2= standard error of the mean


	The distribution of sample means can be transformed into the standard normal distribution with the following transformation:



	The confidence level of the sample mean can therefore be estimated based on the standard normal distribution and the transformation.


	The probability that a sample mean, x, lies between m-s_x and m+ s_xmay be written:


	However, it is more important to find out the confidence interval for the population mean m. The previous relation can be written as:


	But we don’t know the true value of s (the standard deviation of the population). If the sample size is large enough (N³30), we can use s (the standard deviation estimated by samples) to approximate s:


	The 68% confidence level for m is:



	To generalize the relation to any confidence level X%, the previous relation can be rewritten as:


	In an experiment, forty measurements of air temperature were made. The mean and standard deviation of the sample are:
	x = 18.41°C and s = 0.6283°C.

	Question 1:
	Calculate the 95% confidence interval for the population mean for these data.

	Answer:
	From the previous slid, we know the interval is: x ± Z_95% * s / (N)^0.5
	Z_95% = 1.96
	Z_95% * s / (N)^0.5= 1.96 * 0.6283°C / (40 )^0.5= 0.1947°C
	The 95% confidence level for the population mean: 18.22°C ~ 18.60°C.


	In an experiment, forty measurements of air temperature were made. The mean and standard deviation of the sample are:
	x = 18.41°C and s = 0.6283°C.

	Question 2:
	How many measurements would be required to reduce the 95% confidence interval for the population mean to (x-0.1)°C to (x+0.1)°C?

	Answer:
	We want to have Z_95% * s / (N)^0.5= 0.1 °C
	We already know Z_95% = 1.96 and s = 0.6283°C
	è N = (1.96´0.6283/0.1)² = 152


	When the sample size is smaller than about 30 (N<30), we can not use the normal distribution to describe the variability in the sample means.
	With small sample sizes, we cannot assume that the sample-estimated standard deviation (s) is a good approximation to the true standard deviation of the population (s).
	The quantity no longer follows the standard normal distribution.
	In stead, the probability distribution of the small sample means follows the “Student’s t distribution”.


	For variable t


	The probability density function for the t-distribution is:


	Here K(n) is a constant which depends on the number of degree of freedom (n = n-1). K(n) is chosen so that:


	Hypothesis testing involves comparing a hypothesized population parameter with the corresponding number (or statistic) determined from sample data.
	The hypothesis testing is used to construct confidence interval around sample statistics.
	The above table lists the population parameters that are often tested between the population and samples.


	Five Basic Steps:
	1. State the significance level
	2. State the null hypothesis and its alternative
	3. State the statistics used
	4. State the critical region
	5. Evaluate the statistics and state the conclusion


	Usually, the null hypothesis and its alternative are mutually exclusive. For example:

	H₀: The means of two samples are equal.
	H₁: The means of two samples are not equal.

	H₀: The variance at a period of 5 days is less than or equal to C.
	H₁: The variance at a period of 5 days is greater than C.


	At the 5% significance level, there is a one in twenty chance of rejecting the hypothesis wrongly (i.e., you reject the hypothesis but it is true).


	Two-Side Tests
	H₀: mean temperature = 20°C
	H₁: mean temperature ¹ 20°C


	Type I Error: You reject a null hypothesis that is true.
	Type II Error: You fail to reject a null hypothesis that is false.
	By choosing a significance level of, say, a=0.05, we limit the probability of a type I error to 0.05.


	If we draw two samples from one population and obtain two different means, how do we estimate the significance of the difference between these two sample means?
	If we repeat these pair-sampling many many times, the probability distribution for the mean difference will be a normal distribution with a mean of zero and a standard deviation of:



	Test the null hypothesis that the samples come from the same population ( m1= m2 as well as s1= s2) use the t-distribution with
	v = N₁ + N₂ - 1


	Standard deviation is another important population parameter to be tested.
	We want to know if the standard deviation estimated from two samples are significantly different.
	The statistical test used to compare the variability of two samples is the F-test:


	S₁²: the estimate of the population variance of sample 1
	S₂²: the estimate of the population variance of sample 2.