Lecture notes

Statistics I

Chapter 1: Data and Decisions

What are Data?

Statistics is about data and decisions
Statistics are quantities calculated from data
Statistics- a toolbox and a way of thinking
Its purpose is to gather data that is relevant to the problem
Statistics: a collection of tools and the associated reasoning to model, summarize and understand data.
Data: it’s not just information but also values that are measured and observed, values with their context.
Data warehouses: vast digital repositories where data is recorded and stored
Big Data: data sets so large that traditional methods of storage and analysis are inadequate
The “Five W’s”
Who: who is the subject or case
Who is being measured
What: what are the variables (information)
What you have
Who and what are part of data and are essential
Where: where are these values recorded
When: when are these values recorded
Why: why are these values recorded
How: how are these values recorded
Why, when, where, how are part of the metadata
Context of data we are collecting
Organize the information gathered on tables or spreadsheets (collections of data)

Rows of the table answer for who, each row is a case
Columns of data tables answer for what, each column is a variable

Variable types

Categorical variables: the values are names of categories
Nominal: just a label without a particular order (no intrinsic order)
Ordinal: labels with a given order
Some nominal variables are used as identifiers
Identifiers: purpose is to assign a unique identifier code to each individual
special case of nominal categorical variables
Variables identify a category for each case
Answers such as yes or no which can’t be added
Qualitative, labels, tittles
Quantitative variables: the values that variables take in a numerical quantity with units
Must be a number quantity and have units
Variables record measurements, amounts or something else but they must have units
Cross-sectional data: when you record something at a given point in time for different units
Time-series data: collects information for 1 subject over different points in time
If there is displacing, then there is no time-series

Chapter 2: Displaying and Describing Categorical Data

Displaying and describing Categorical data

Descriptive statistics: summarizes data and displays it
In statistics, instead of using the whole population, is best to just take a sample of people from the population and making inferences on that sample
Inferential statistics: making inferences on a sample
Samples are chosen randomly
Display and summarize data to see:
Patterns
Relationships
Exceptions in values and observations

Summarizing a Categorical variable

IP Number

Time

Source

245.240.221.71

1 / Feb / 2013 13:15:08

Google

196.345.281.51

1 / Feb / 2013 14:56:23

Direct

Variable source:
IP number: Categorical, ordinal
Time: quantitative
Source: categorical, nominal
Variable source: how their customers found their way to the website
Source is a categorical variable
If the list is too long and does not fit in a page, create a new category called “others”
Frequency table: how often a variable occurs

Source

Count of visits

Visits as a percentage %

Google

Direct

E-mail

Bing

Others

130,158

52,969

16,084

9,581

6,740

57.36

23.34

7.09

4.22

2.97

Total

226,925

XXX

Express each frequency as a percentage
Compute the proportion
Percentage = proportion x 100%
Make and interpret a frequency table for a categorical variable

Frequency table

Count of the number of cases in each category

Count

Relative Frequency Table

Count of the number of cases in each category divided by the total number of cases

Percentage

Cases = people answering the questions

Displaying and describing categorical data

Display the percentage as a bar chart
Keep it proportional
Categorical bar chart must have gaps to indicate different categories
Make and interpret a bar chart or pie chart
Pie chart gives less information
Horizontal axis = categories
Vertical axis = counts
Residual category is always placed last in a bar chart regardless of the count
Area principal: area of the bars are proportional to the amount that is wanted to show
Pie chart: shows how the whole group breaks into several categories and shows all the cases as a circle sliced into pieces whose areas are proportional to the fraction of cases in each category
Keep order from biggest to smallest (residual goes last)
Include the percentages
Bar charts are better than pie charts to represent frequency tables

Exploring Two Categorical Variables: Contingency Tables

Example: survey of 5039 people in 5 countries (do you use social networks?)

Represent ID

Social Networking

Country

0001

Yes

Egypt

0002

No access

Egypt

…

1000

No access

Egypt

1001

Yes

1002

Yes

1003

…

5039

yes

U.S

Frequency table for variable social networking

Social Networking

Count

Relative frequency

1,249

24.8

Yes

2,175

43.2

No access

1,615

32.1

Total

5,039

100.1

Counts must add up to the total number of cases (rows)
Does social networking differ from country to country?

Contingency tables

Total

336

460

293

1,249

Yes

529

300

340

500

506

2,175

No Access

153

630

200

420

212

1,615

Total

1,018

1,000

1,010

1,011

5,039

Marginal distribution: the frequency distribution of either one of the variables
The 300 in the Egypt (EG) column means that there are 300 respondents from Egypt that use social networking
There are 300 respondents out of 5,039 respondents who are from Egypt and who answered yes to the question, meaning they use social networking

For every cell we can compute percentage change in three different ways

Total percentage

Total percentage =

300

5039

x 100%

≈ 5.9%

Approximately 5.9% of the total number of respondents are from Egypt and answered yes
Row percentage

Row percentage =

300

2175

x 100%

≈ 13.8%

Approximately 13.8% out of the total number of respondents who answered yes are from Egypt
Column Percentage

Total percentage =

300

1000

x 100%

≈ 30%

30% out of the total respondents that are from Egypt answered yes to the survey question
Compare social networking country to country
Use column percentages
Conditional distribution: “in what we conditioned”
Make and interpret contingency tables

Contingency Table

Show how individuals are distributed along variables depending on the value of the other variables

Count

Marginal Distribution

The frequency distribution of either one of the variables

Percentage

Chapter 4: Correlation

Scatterplots

Scatterplots: plots two quantitative variables together showing the relationship (if any) between them
Study if a relationship exists
Only in scatterplots it doesn’t matter which variable we choose for the x and y axis
Analysis of the relationships between the quantitative variables

Direction

Form

Strength

Outliers

Positive, negative or neither

Straight, curved, something exotic or no pattern

How clustered is the data? Closer to the line of best fit?

Are there any unusual observations or subgroups?

Assign roles to the variables

Variable can be defined as explanatory or as response variables
Explanatory variables x-axis
Response variables y-axis

Measuring correlation

Case

Sales people

Sales ($)

zxzy

$10,000

-1.49

-1.42

2.12

$11,000

-1.31

-1.24

1.62

$13,000

-0.60

-0.86

0.52

$14,000

-0.25

-0.67

0.17

$18,000

-0.07

0.07

-0.01

$20,000

-0.07

0.45

-0.03

$20,000

0.28

0.45

0.13

$22,000

0.82

0.67

$22,000

0,99

0.82

$26,000

1.70

1.57

2.68

Total

104

$176,000

0.00

8.69

(mean) = 10.4 people

(SD(x)) 5x ≈ 5.6 people

zx = (x - ) / 5x

(mean) = $17,600

(SD(x)) 5y ≈ $5,337

zy = ( y - / 5y

z =

x -

r =

∑ zxzy

n - 1

r =

8.69

10 - 1

≈ 0.97

x: explanatory (or predictor) variable (independent) (cause)
y: response variable (dependent) (effect)
Sum of all z scores is always 0

Correlation Conditions

Only for quantitative variables
Only meaningful if relationship is approximate linear
Straight lines not curves
Beware of outliers

Measuring Correlation

A positive linear relationship, most points are in:
Lower left quadrant: x < and y < , zx < 0 and zy < 0
zxzy > 0
Upper right quadrant: x > and y > , zx > 0 and zy > 0
zxzy > 0
Mean of zxzy > 0 = correlation coefficient = r
A negative linear relationship, so most points are in:
Upper left quadrant: x < and y > , zx < 0 and zy > 0
zxzy < 0
Lower right quadrant: x > and y < , zx > 0 and zy < 0
zxzy < 0
Mean of zxzy < 0

Case

Sales people

Sales ($)

zxzy

$10,000

-1.5

-1.42

2.12

Case 1:
zx = -1.5
zy ≈ -1.4240
zxzy ≈ 2.14

zx =

2 people – 10.4 people

5.6 people

= -1.5 standard deviations below from the mean

zy =

$10,000 - $17,600

$5,537

= -1.4 standard deviations below from the mean

zxzy ≈ 2.1

Sum of zxzy

r =

∑ zxzy

n - 1

≈

8.69

≈ 0.97

Understanding Correlation

The sign of the correlation gives the direction of the relationship
Positive or negative
Boundaries of correlation
A correlation of 1 or -1 is a perfect linear relationship
A correlation of 0 is a lack of linear relationship
Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value
A large correlation is not a sign of a classical relationship
Is necessary that relationship is linear and no outliers

Properties of r
Positive r indicates positive linear relationships and Negative r indicates negative linear relationships
-1 ≤ r ≤ 1
r has no units
r isn’t affected by changes in center or scale of x or y since it’s computed with z-scores
r measures strength of linear relationship

r ≈ 0 doesn’t mean there is no relationship

r is sensitive to outliers
Non-linear correlations

r is not a valid measure when the relationship is not linear
Options
Transform the variables using square roots base 10 logarithms (it becomes more linear)
Correlation of ranks

Lurking Variables

Lurking variables: an unobserved variable that is affecting both of the variables being observed
Examples of lurking variables
Positive relationship between number of doctors and life expectancy
Lurking variable: better living standards?
Positive relationship between consumption of diet soda drinks and traffic accidents
Lurking variable: higher population?
Positive relationship between the consumption of ice cream and number of drownings
Lurking variable: hot weather
The relationship is not between our observed variables but with each one of the observed variables and the lurking variables

What can go wrong

Don’t say correlation when you mean association
To say correlation, first calculate r
Don’t calculate correlation of categorical variables
Make sure the variables ae associated in a linear way
Be careful to see if there are any outliers
Don’t confuse correlation with causation
Lurking variables, spurious correlation

Chapter 6: Random variables and Probability Models

Random Variables

Random variable: variable which value is based on the outcome of a random event
Types of random variables
Discrete random variables: can take values such as: 1, 2, 3 ….
Continuous random variables: can take values such as: 0.1, 0.11, 0.12 …
For both discrete and continuous variables, the set of all possible values and their associated probabilities is called the probability model
Example: Insurance company

Outcome

Probability

P ( X = x )

Death

$100,000

1 / 1000

Disability

$50,000

2 / 1000

Neither

997 / 1000

Sum

---

1000 / 1000

Expected Value of a Random Variable

When the probability value is known, then the expected value can be calculated
The long-term expectation of the model

Expected value = outcome * probability of each outcome

E(x) can be µ but not
For discrete random variables, probability models assign a probability to each possible outcome
Example with previous table
Which payout per policy can an insurance company expect?
Assume they have 1000 clients

E(x) =

($100,000 x 1) + ($50,000 x 2) + ($0 x 997)

1000

= $200

E(x) =

$100,000 x

1000

+ $50,000 x

1000

+ $0 x

997

1000

= $200

Properties of Expected values

If X and Y are random variables and C is a constant:

The expectation of a constant “c” is the constant

Adding a constant “c” increases the expected value by “c”

E(c) = c

E(c + x) = c ± E(x)

Multiplying by “c” multiplies the expected value by “c”

The expected value of the sum of two random variables

E(cx) = c * E(x)

E( x ± y) = E(x) ± E(y)

Standard Deviation of a Random Variable

Variance: is the expected value of the squared deviations

var (x) = ∑ ( x - µ )2 x P(x)

var(x) can be written as σ2 but not as s2
Standard deviation: indicator of variability of our random variables
SD(x) can be written as σ but not as s

Outcome

Probability

P ( X = x )

Deviation

x – E(x)

Death

$100,000

1 / 1000

($100,000 – $200) = $99,800

Disability

$50,000

2 / 1000

($50,000 – $200) = $49,800

Neither

997 / 1000

($0 – $200) = -$200

Sum

---

1000 / 1000

E(x) instead of mean
E(x) = $200

Square the deviations: ($99,800)2, ($49,800)2, (-$200)2
Find the expected value of the squared deviation

var(x) =

$99,8002 x

1000

+ $49,8002 x

1000

+ -$200 2 x

997

1000

= $214,960,000

Take the square root

var(x) = σ2 = ∑ (x - µ)2 x P(x)

Properties of Standard deviations

If all values of x are the same, the standard deviation is zero

Adding a constant “c” does not change the standard deviation

SD(x) = c

SD(c + x) = c ± SD(x)

Multiplying by “c” multiplies the standard deviation by I c I

The SD of the sum of X and Y is equal to or less than the sum of the standard deviations of X and Y

SD(cx) = I c I * SD(x)

SD ( x + y ) =

Properties and Examples

Chebyshev’s inequality: also holds for random variables
For data ( , s) most values are between
In repeated random experiments, most X will fall between this rule

E(x) – 3 x SD(x) and E(x) + 3 x SD(x)

E(x) = $200 ; SD(x) ≈ $3,868

-$11,404 and $11,804

What if insurance company increases payouts by $100 to:
$100,100 (Death); $50,100 (Disability); $100 (Neither)

E(x + $100) =

$100,100 x

1000

+ $50,100 x

1000

+ $100 x

997

1000

E(x + 100) =

$100,000 x

1000

+ $100 +

$50,000 x

1000

+ $100 +

$0 x

997

1000

+ $100

E(x + $100) =

$200 + 100

E(x) + c

E(x + $100) =

$300

E( x ± c) =

E(x) ± c

What if SD (x + $100)

Deviations

= (x + $100) – E(x + $100)

= x + $100 – E(x) - $100

= x - E(x)

SD is unchanged

SD(x ± c)

= SD(x)

Insurance company
X = annual payout per policy
E(x) = $200; SD(x) ≈ $3,686
What if insurance company doubles payouts to
$200,000 (Death); $100,000 (Disability); $0 (Neither)
X = payouts; 2x = double payouts

E(2x) =

$200,000 x

1000

+ $100,000 x

1000

+ $0 x

997

1000

E(2x) =

2 x

$100,000 x

1000

+ $50,000 x

1000

+ $0 x

997

1000

E(2x) =

2 x $200

E(2x) =

$400

E(2x) =

2 x E(x)

E(aX) =

aE(x)

What about SD(2X) ?
When a constant is added SD is unchanged, but when it’s doubled, SD doubles

Deviations

= 2X – E(2x)

= 2X – 2E(x)

= 2 * (X - E(x))

= 2 * (X - µ)

SD is doubled

SD(cX)

= I c I * SD(x)

Var(2x)

= E [ deviations2 ]

= E [ (22 * (X - µ)2 ]

= 22 * E[ (X - µ)2 ]

= 22 * var(X)

Var(aX)

= a2 * var(x)

SD(aX)

= I a I * SD(x)

Variance is not a linear operator, it’s a quadratic operator
Insurance company
X = annual payout per policy
E(x) = $200; SD(x) ≈ $3,868
Suppose they are independent
Payout to Mr. Ecks = X
Payout to Ms. Wye = Y

E(X + Y) = E(X) + E(Y)

Addition rule for expected value

SD (X +Y):

Intuitively = SD(X + Y) ≤ SD(x) + SD(Y)

Only when they are independent

SD(X) + SD(Y) = risk of insuring the same person twice
If X and Y are independent:

Var(X + Y)

= var(X) + var(Y)

= $2 14,960,000 + $2 14,960,000

= $2 29,920,00

SD(X + Y)

= $5,470

The risk of insuring two people is lower than insuring the same person twice
Is best to have as much clients that are independent from each other to reduce risk
By “pooling risk” you can reduce the overall risk

Covariance

Covariance: measures how two random variables X and Y vary together
(co = together)

E(X) = µ

E(Y) = V

cov(X,Y) = E[ (X - µ) x (Y – v) ]

What if X and Y are not independent

Cov(X,Y)

Var(X )

Var(X,X)

Cov(X,Y)

= E

= E [ (X - E(X))2 ]

= E [ (X – E(X)) x (X – E(X)) ]

= E [ (X – E(X)) x (X – E(Y)) ]

Cov(X,Y) ≤ 0 or Cov (X,Y) ≥ 0

Bernoulli Trials

There are only two outcomes per trial, called success or failure
The probability of success p is the same on every trial, the probability of failure 1 – p is often called q
The trials must ben independent, but we can use the 10% condition
If the number of trials, or sample size is less than 10% of the population size, we can assume that the trials are independent
Bernoulli trials:
Two outcomes: success or failure
p is the same for all trials / draws as we draw with replacement
The draws/ trials must be independent (as tickets have no memory)

Discrete probability models

The uniform model: probability of each outcome is always exactly the same
If X is a random variable with possible outcomes 1, 2, 3, 4….n and all outcomes have exactly the same probability then X has a discrete uniform distribution
Binomial model:
If X is a random variable whose outcomes are the number of q-success in a series of Bernoulli trials
Two parameters are necessary to define a binomial probability model, the number of trials n and the probability of success p
We denote it all as Binom(n, p)

Uniform model

Suppose you roll a fair die
Probability distribution: a list of all possible outcomes + corresponding probability

Sum

P(X = x)

1/6

Uniform discrete probability distribution because all the probabilities are the same

Binomial model

Random draw 1 ticket from the box (with replacement)
P(“success”) = 0.10 = p
Bernoulli trials:
Two outcomes success or failure
p is the same for all draws/ trials as we draw with replacement
The draws/ trials must be independent (tickets have no memory)
Draw 5 tickets with replacement
P( exactly 2 success in 5 trials) = ?

S S F F F

F S S F F

F F S S F

F F F S S

S F S F F

F S F S F

S F F S F

F S F F S

S F F F S

It can happen in 10 ways
Number of ways to have k success in trials:
Number of ways to have 2 successes in 5 trials:

2! (3!)

5 x 4 x 3 x 2 x 1

(2 x 1) x (3 x 2 x 1)

5 x 4

2 x 1

10 ways

P(exactly 2 successes in 5 trials)
P( S S F F F or S F S F F or … or F F F S S )
Mutually exclusive because if one event happens, the other events cant happen (they are disjoint events)
Add value
P (SSFFF) + P(SFSFF) + P(SFFSF) + … + P(FFFSS)
= P(1st is S and 2nd is S and 3rd is F and 4th is F and 5th is F) + ….
Multiplication rule

P( A AND B) = P(A) x P(B | A)

= P(1st is S) x P(2nd is S) x P(3rd is F) x P(4th is F) x P(5th is F) + …

= 0.10 x 0.10 x 0.90 x 0.90 x 0.90 + …

= (0.10)2 x (0.90)3 + …

= 10 x (0.10)2 x (0.90)3

= 0.0729

AND = multiplication rule
Mutually exclusive ≡ disjoint
10 because it’s the number of ways
0.10 = P, 2 = k, 0.90 = 1 – P, 3 = n – k, … = 9 more such terms
Binomial probability model for Bernoulli trials:
n= number of trials
k= number of success
p= probability of success
Must be a Bernoulli Trial

P(exactly k success in n trials =

k! (n – k)!

pk (1 - p)n - k

k = 0, 1, 2, 3, 4, …, n

E(X) = n x p

SD(X) =

Sampling: drawing without replacement
Draws are often without replacement
Binomial model does not apply
Binomial model is a good approximation

If n < 10% of N

n= number of trials (sample size)

N= size of population

TI-84 : DISTR binompdf (n, p, x)

TI-84 : DISTR binompdf (5, 0.10, 2)

To make a binomial model focus on 1 thing and label it as success and the rest as failure
Sample size = number of trials
Example Exercise 51 pg. 235
Game with joystick (right handed)
13% of the population are left handed
P( there are left handed people in a sample of 5) = ?
1 – P (there are no left-handed people in a sample of 5)
Bernoulli trials
Left handed = success
No, draw is not with replacement
We are not sure if it is met
= 1 – Binompdf (3,0.13,0) ≈ 1 – 0.4984 ≈ 0.5016

What can go wrong

Random variables have a probability model, actual sample data has a distribution
Be aware of the notation you use: E(X) is used in different context than
Be very cautious with the properties of the expected value and the standard deviation (use formula sheet)
Ensure Bernoulli trials is met

Chapter 7: The Normal and Other Continuous Distribution

Standard Deviation as a ruler

We use z-scores as a way to compare value
Standardizing a variable:
Therefore, we use the standard deviation as a ruler asking how many standard deviations is a value different form the mean
How? Through the 68 – 95 – 99.7 Rule

The 68 – 95 – 99.7 Rule

In a unimodal symmetric distribution, about 68% of the values fall within one standard deviation of the mean, about 95% of the values fall within two standard deviations of the mean and about 99.7% of the values fall within three standard deviations away from the mean

-3σ and -2σ very rarely low
-2σ and -1σ unusually low
1σ and 2σ unusually high
2σ and 3σ very rarely high

68 – 95 – 99.7 rule: for unimodal, symmetric distribution
About 68% of all values fall within 1 SD(x) of the mean
About 95% of all values fall within 2 SD(x) of the mean
About 99.7% of all values fall within 3 SD(x) of the mean
Follows from the Normal Distribution

Use the 68-95-99.7 Rule to identify outliers

Negative value: below the mean
Positive value: above the mean
Is P/E of 40.4 (Feb 2010) exceptional?
Standardize:

x -

≈

40.4 – 17.9

6.8

≈ 3.3

Price earnings ratio of February 2010 is 3.3 above average (SD(x) is above)
40.4 is more than 3 SD(x) above the mean
Fewer than 0.3/2 = 0.15% of values are this large
If the data follow a unimodal, symmetric distribution, values for which z < -3 or z > 3 are considered outliers

The Normal Distribution

Normal Distribution: a theoretical probability model used for continuous random variables
Properties:
1. - ∞ < z > + ∞
2. Probability density function is bell-shaped and unimodal
3. µ = 0
4. σ = 1
Area under pdf = 1
Pdf = probability density function
Check if histogram is:
Symmetric
Bell-shaped Normal model
Unimodal
To characterize normal model, we need to know the mean (µ) and the standard deviation(σ)
N( µ. σ)
Standard Normal Model: N(0, 1)
But only if the data is standardized

Areas Under the Normal pdf

P( 0 ≤ z ≤ 1)

Because 1 SD(x) = 68%, between -1 and 1, then you need to divide 68% by 2 to get 0 and 1 = 34%
≈ 0.3413
P( 0 ≤ z ≤ 1 ) = area under Normal pdf between 0 and 1
Finding the area
TI – 84: DISTR normalcdf (0,1)
P( 0 ≤ z ≤ 1 ) ≈ 0.3414
P( -2 ≤ z ≤ 2 ) ≈ 0.9545

How it works

Normal model:
N(500,100)
µ = 500
σ = 100
Standard Normal Model:
N(0,1)

µ = 0
σ = 1
Goal: to find the graded area in %
What percentage of the normal distribution is graded (before (2))
Knowing the graded area in % which is the value z (Critical value)
Standardizing does not change the shape of the distribution

Normal distribution as an approximation data

P/E data = not so good approximation
GMATT scores approximately follow the Normal Distribution
With µ = 500 and σ = 100

GMATT scores N(500,100)

X N(500,1000)

= “follow”

Properties of the Normal Distribution

Property 1
If X and Y are nominal random variables
Property 2:
If the number of trials n is sufficiently large (*), the nominal pdf is a good approximation
(*) np ≥ 10 and n (1-p) ≥ 10

Percentiles of the Normal pdf

Lower quartile (Q1)
= 25% of all values is ≤ Q1
Q1 is also called: the 25th percentile
So Q3 is the 75th percentile
Median is the 50th percentile
90th percentile: that value so that 90% of all values ≤ 90th percentile
Find the 25th percentile of GMATT scores:
TI -84: DISTR invNorm (0.25, 500, 100)
25th percentile ≈ 432.5510
Find the 90th percentile invNorm (0.90, 500, 100)
90th percentile ≈ 628.1552

Normal Approximation for Binomial

Chapter 6 Binomial model Binom(n,p)
Now we can use the normal model if:
n x p ≥ 10
n x (1 – p) ≥ 10
this determines if its safe using nominal approximation (check with above rules)
N (µ, σ) where:
µ = n x p
σ =

Normal Quantile – Quantile – plot (Q-Q plot)

Q-Q plot: method to assess whether the normal model is a good approximation of the data
Ex. P/E data (n = 1780)
1. Sort the values in ascending order
2. Find for every value the theoretical quantile
Ex. 1st of the 1780 values ( y1 ≈ 4.78; z scores ≈ -1.79)
If the variable were normally distributed; the 1st of 1780 values would have a z score ≈ -3.43
1st point of Q-Q plot (-3.43; 4.78)
Repeat for nth values
3. If the points are approximately on a straight line, the normal distribution is a good approximation of the data
Alternative version, with z scores of data on vertical axis
Point of reference is 45° line
If the points are approximately on the 45° line, the normal distribution is a good approximation of the data

What can go wrong

Probability models are still just models, Question probability as you would question data
Don’t assume everything is Normal, check normality plot histogram
Don’t use the nominal approximation with lowercase n

Chapter 11: Sampling Distribution and confidence interval mean

Population
µ = population mean (unknown)
Sample,
= sample mean
as an estimate for µ
95% confidence interval (CI) : ± 2 SE()
2 = z*
Only work if n is large (rule of thumb: n ≥ 50) if sample has unimodal symmetric distribution

The distribution of a sample mean

For quantitative variables

Full population

Sample 1

Sample 2

Sample 3

µ = 20

= 22

= 21

= 19

σ = 1.2

s = 0.98

s = 1.3

s = 2.03

Central limit theorem

Central limit theorem (CLT): the sample mean () of a random sample has a probability distribution that can be approximated by the normal distribution as the sample (n) grows
() N
How good is the approximation?
The larger the sample (n) the better the approximation
Regardless of the shape of the population
Population = box with 1.2 million tickets

200,000

tickets

200,000

tickets

200,000

tickets

200,000

tickets

200,000

tickets

200,000

tickets

Draw 2 tickets without replacement (n= 2) and find the mean

7 / 2 = 3.5

= 3.5

Draw 5 tickets without replacement (n= 5) and find the mean
Draw 20 tickets without replacement (n= 20) and find the mean
= 3.3
Repeat this 10,000 times – density histogram
Distribution gets taller
Tails get thinner
Unimodal
Symmetric
CLT: the sample mean of a random sample has a probability distribution that can be approximated by the normal distribution
N (approx.)
The larger the sample the better the approximation
Sample mean are approximately distributed like a normal curve

CTL: N (approx.)

95% CI: ± 2 SE()

Property: SE() =

(no proof)

Confidence Interval of the mean

Very often “true” mean (µ) and “true” standard deviation ( σ / ) are unknown
Solution: we estimate SD() as
Where SE() is the standard error

n is large

n is not large

Confidence interval

± z* * SE()

95% confidence interval

± 2 x SE(

Confidence interval

± t* x SE(

What can we say about the population mean (µ)?
We are 95% confident that the population mean (ex. age) is between

- 2 x SE( and + 2 x SE()

Ex. Research on toxic residuals in farm
y = concentration (in ppm) of Pesticide Mirex
n = 150; = 0.093 ppm ; s = 0.0495
CLT: N (approx.)
But E() = µ and SD() = ( σ / ) are unknown to the researcher
Researcher estimates SD() as:
Confidence interval for a mean:

± z* * SE()

Margin of error

Only if n is large can we use z*; otherwise use t*
(Student t distribution (STATS II))

Assumptions and conditions to use N (apply normal dist.)

Independence assumption
The sampled values must be independent of each other
Sample size assumption
The sample size, n, must be large enough
Randomization condition
Sample must be a simple random sample of the population
10% condition
n smaller to 10% of population (sample without replacement)
No success / Failure condition

What can go wrong

Don’t confuse proportions and means. Use Normal models with proportions (and when N is large). Use student t methods with means
Don’t confuse s (standard deviation of sample) with σ (standard deviation of population)
Beware of multimodality. If you see this, try to separate the data into groups
Beware of skewed data
Investigate outliers. If they are clearly in error remove them. If they can’t be removed, you might run the analysis with or without the outlier
Make sure data are independent. Consider whether they are likely violations of independence in the data collection methods
The Central Limit Theorem doesn’t talk about the distribution of the data from the sample. It talks about the sample means and sample proportions of many different random samples drawn from the same population

PreviousIntroduction to Statistics NextIntroduction to Macroeconomics

Last updated 2 years ago