Lecture notes
Statistics I
Chapter 1: Data and Decisions
What are Data?
Statistics is about data and decisions
Statistics are quantities calculated from data
Statistics- a toolbox and a way of thinking
Its purpose is to gather data that is relevant to the problem
Statistics: a collection of tools and the associated reasoning to model, summarize and understand data.
Data: it’s not just information but also values that are measured and observed, values with their context.
Data warehouses: vast digital repositories where data is recorded and stored
Big Data: data sets so large that traditional methods of storage and analysis are inadequate
The “Five W’s”
Who: who is the subject or case
Who is being measured
What: what are the variables (information)
What you have
Who and what are part of data and are essential
Where: where are these values recorded
When: when are these values recorded
Why: why are these values recorded
How: how are these values recorded
Why, when, where, how are part of the metadata
Context of data we are collecting
Organize the information gathered on tables or spreadsheets (collections of data)

Rows of the table answer for who, each row is a case
Columns of data tables answer for what, each column is a variable
Variable types
Categorical variables: the values are names of categories
Nominal: just a label without a particular order (no intrinsic order)
Ordinal: labels with a given order
Some nominal variables are used as identifiers
Identifiers: purpose is to assign a unique identifier code to each individual
special case of nominal categorical variables
Variables identify a category for each case
Answers such as yes or no which can’t be added
Qualitative, labels, tittles
Quantitative variables: the values that variables take in a numerical quantity with units
Must be a number quantity and have units
Variables record measurements, amounts or something else but they must have units
Cross-sectional data: when you record something at a given point in time for different units
Time-series data: collects information for 1 subject over different points in time
If there is displacing, then there is no time-series
Chapter 2: Displaying and Describing Categorical Data
Displaying and describing Categorical data
Descriptive statistics: summarizes data and displays it
In statistics, instead of using the whole population, is best to just take a sample of people from the population and making inferences on that sample
Inferential statistics: making inferences on a sample
Samples are chosen randomly
Display and summarize data to see:
Patterns
Relationships
Exceptions in values and observations
Summarizing a Categorical variable
IP Number
Time
Source
245.240.221.71
1 / Feb / 2013 13:15:08
196.345.281.51
1 / Feb / 2013 14:56:23
Direct
Variable source:
IP number: Categorical, ordinal
Time: quantitative
Source: categorical, nominal
Variable source: how their customers found their way to the website
Source is a categorical variable
If the list is too long and does not fit in a page, create a new category called “others”
Frequency table: how often a variable occurs
Source
Count of visits
Visits as a percentage %
Direct
Bing
Others
130,158
52,969
16,084
9,581
6,740
57.36
23.34
7.09
4.22
2.97
Total
226,925
XXX
Express each frequency as a percentage
Compute the proportion
Percentage = proportion x 100%
Make and interpret a frequency table for a categorical variable
Frequency table
Count of the number of cases in each category
Count
Relative Frequency Table
Count of the number of cases in each category divided by the total number of cases
Percentage
Cases = people answering the questions
Displaying and describing categorical data
Display the percentage as a bar chart
Keep it proportional
Categorical bar chart must have gaps to indicate different categories
Make and interpret a bar chart or pie chart
Pie chart gives less information
Horizontal axis = categories
Vertical axis = counts
Residual category is always placed last in a bar chart regardless of the count
Area principal: area of the bars are proportional to the amount that is wanted to show
Pie chart: shows how the whole group breaks into several categories and shows all the cases as a circle sliced into pieces whose areas are proportional to the fraction of cases in each category
Keep order from biggest to smallest (residual goes last)
Include the percentages
Bar charts are better than pie charts to represent frequency tables
Exploring Two Categorical Variables: Contingency Tables
Example: survey of 5039 people in 5 countries (do you use social networks?)
Represent ID
Social Networking
Country
0001
Yes
Egypt
0002
No access
Egypt
…
…
…
1000
No access
Egypt
1001
Yes
UK
1002
Yes
UK
1003
No
UK
…
…
…
5039
yes
U.S
Frequency table for variable social networking
Social Networking
Count
Relative frequency
No
1,249
24.8
Yes
2,175
43.2
No access
1,615
32.1
Total
5,039
100.1
Counts must add up to the total number of cases (rows)
Does social networking differ from country to country?
Contingency tables
UK
EG
DE
RU
US
Total
No
336
70
460
90
293
1,249
Yes
529
300
340
500
506
2,175
No Access
153
630
200
420
212
1,615
Total
1,018
1,000
1,000
1,010
1,011
5,039
Marginal distribution: the frequency distribution of either one of the variables
The 300 in the Egypt (EG) column means that there are 300 respondents from Egypt that use social networking
There are 300 respondents out of 5,039 respondents who are from Egypt and who answered yes to the question, meaning they use social networking
For every cell we can compute percentage change in three different ways
Total percentage
Total percentage =
300
5039
x 100%
≈ 5.9%
Approximately 5.9% of the total number of respondents are from Egypt and answered yes
Row percentage
Row percentage =
300
2175
x 100%
≈ 13.8%
Approximately 13.8% out of the total number of respondents who answered yes are from Egypt
Column Percentage
Total percentage =
300
1000
x 100%
≈ 30%
30% out of the total respondents that are from Egypt answered yes to the survey question
Compare social networking country to country
Use column percentages
Conditional distribution: “in what we conditioned”
Make and interpret contingency tables
Contingency Table
Show how individuals are distributed along variables depending on the value of the other variables
Count
Marginal Distribution
The frequency distribution of either one of the variables
Percentage
Chapter 4: Correlation
Scatterplots
Scatterplots: plots two quantitative variables together showing the relationship (if any) between them
Study if a relationship exists
Only in scatterplots it doesn’t matter which variable we choose for the x and y axis
Analysis of the relationships between the quantitative variables
Direction
Form
Strength
Outliers
Positive, negative or neither
Straight, curved, something exotic or no pattern
How clustered is the data? Closer to the line of best fit?
Are there any unusual observations or subgroups?
Assign roles to the variables
Variable can be defined as explanatory or as response variables
Explanatory variables x-axis
Response variables y-axis
Measuring correlation
Case
Sales people
x
Sales ($)
y
zx
zy
zxzy
1
2
$10,000
-1.49
-1.42
2.12
2
3
$11,000
-1.31
-1.24
1.62
3
7
$13,000
-0.60
-0.86
0.52
4
9
$14,000
-0.25
-0.67
0.17
5
10
$18,000
-0.07
0.07
-0.01
6
10
$20,000
-0.07
0.45
-0.03
7
12
$20,000
0.28
0.45
0.13
8
15
$22,000
0.82
0.82
0.67
9
16
$22,000
0,99
0.82
0.82
10
20
$26,000
1.70
1.57
2.68
Total
104
$176,000
0.00
0.00
8.69
(mean) = 10.4 people
(SD(x)) 5x ≈ 5.6 people
zx = (x - ) / 5x
(mean) = $17,600
(SD(x)) 5y ≈ $5,337
zy = ( y - / 5y
z =
x -
s
r =
∑ zxzy
n - 1
r =
8.69
10 - 1
≈ 0.97
x: explanatory (or predictor) variable (independent) (cause)
y: response variable (dependent) (effect)
Sum of all z scores is always 0
Correlation Conditions
Only for quantitative variables
Only meaningful if relationship is approximate linear
Straight lines not curves
Beware of outliers
Measuring Correlation

A positive linear relationship, most points are in:
Lower left quadrant: x < and y < , zx < 0 and zy < 0
zxzy > 0
Upper right quadrant: x > and y > , zx > 0 and zy > 0
zxzy > 0
Mean of zxzy > 0 = correlation coefficient = r
A negative linear relationship, so most points are in:
Upper left quadrant: x < and y > , zx < 0 and zy > 0
zxzy < 0
Lower right quadrant: x > and y < , zx > 0 and zy < 0
zxzy < 0
Mean of zxzy < 0
Case
Sales people
Sales ($)
x
y
zx
zy
zxzy
1
2
$10,000
-1.5
-1.42
2.12
Case 1:
zx = -1.5
zy ≈ -1.4240
zxzy ≈ 2.14
zx =
2 people – 10.4 people
5.6 people
= -1.5 standard deviations below from the mean
zy =
$10,000 - $17,600
$5,537
= -1.4 standard deviations below from the mean
zxzy ≈ 2.1
Sum of zxzy
r =
∑ zxzy
n - 1
≈
8.69
9
≈ 0.97
Understanding Correlation
The sign of the correlation gives the direction of the relationship
Positive or negative
Boundaries of correlation
A correlation of 1 or -1 is a perfect linear relationship
A correlation of 0 is a lack of linear relationship
Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value
A large correlation is not a sign of a classical relationship
Is necessary that relationship is linear and no outliers
Properties of r
Positive r indicates positive linear relationships and Negative r indicates negative linear relationships
-1 ≤ r ≤ 1
r has no units
r isn’t affected by changes in center or scale of x or y since it’s computed with z-scores
r measures strength of linear relationship
r ≈ 0 doesn’t mean there is no relationship
r is sensitive to outliers
Non-linear correlations
r is not a valid measure when the relationship is not linear
Options
Transform the variables using square roots base 10 logarithms (it becomes more linear)
Correlation of ranks
Lurking Variables
Lurking variables: an unobserved variable that is affecting both of the variables being observed
Examples of lurking variables
Positive relationship between number of doctors and life expectancy
Lurking variable: better living standards?
Positive relationship between consumption of diet soda drinks and traffic accidents
Lurking variable: higher population?
Positive relationship between the consumption of ice cream and number of drownings
Lurking variable: hot weather
The relationship is not between our observed variables but with each one of the observed variables and the lurking variables
What can go wrong
Don’t say correlation when you mean association
To say correlation, first calculate r
Don’t calculate correlation of categorical variables
Make sure the variables ae associated in a linear way
Be careful to see if there are any outliers
Don’t confuse correlation with causation
Lurking variables, spurious correlation
Chapter 6: Random variables and Probability Models
Random Variables
Random variable: variable which value is based on the outcome of a random event
Types of random variables
Discrete random variables: can take values such as: 1, 2, 3 ….
Continuous random variables: can take values such as: 0.1, 0.11, 0.12 …
For both discrete and continuous variables, the set of all possible values and their associated probabilities is called the probability model
Example: Insurance company
Outcome
x
Probability
P ( X = x )
Death
$100,000
1 / 1000
Disability
$50,000
2 / 1000
Neither
$0
997 / 1000
Sum
---
1000 / 1000
Expected Value of a Random Variable
When the probability value is known, then the expected value can be calculated
The long-term expectation of the model
Expected value = outcome * probability of each outcome
E(x) can be µ but not
For discrete random variables, probability models assign a probability to each possible outcome
Example with previous table
Which payout per policy can an insurance company expect?
Assume they have 1000 clients
E(x) =
($100,000 x 1) + ($50,000 x 2) + ($0 x 997)
1000
= $200
E(x) =
$100,000 x
1
1000
+ $50,000 x
2
1000
+ $0 x
997
1000
= $200
Properties of Expected values
If X and Y are random variables and C is a constant:
The expectation of a constant “c” is the constant
Adding a constant “c” increases the expected value by “c”
E(c) = c
E(c + x) = c ± E(x)
Multiplying by “c” multiplies the expected value by “c”
The expected value of the sum of two random variables
E(cx) = c * E(x)
E( x ± y) = E(x) ± E(y)
Standard Deviation of a Random Variable
Variance: is the expected value of the squared deviations
var (x) = ∑ ( x - µ )2 x P(x)
var(x) can be written as σ2 but not as s2
Standard deviation: indicator of variability of our random variables
SD(x) can be written as σ but not as s
Outcome
x
Probability
P ( X = x )
Deviation
x – E(x)
Death
$100,000
1 / 1000
($100,000 – $200) = $99,800
Disability
$50,000
2 / 1000
($50,000 – $200) = $49,800
Neither
$0
997 / 1000
($0 – $200) = -$200
Sum
---
1000 / 1000
E(x) instead of mean
E(x) = $200
Square the deviations: ($99,800)2, ($49,800)2, (-$200)2
Find the expected value of the squared deviation
var(x) =
$99,8002 x
1
1000
+ $49,8002 x
2
1000
+ -$200 2 x
997
1000
= $214,960,000
Take the square root
var(x) = σ2 = ∑ (x - µ)2 x P(x)
Properties of Standard deviations
If all values of x are the same, the standard deviation is zero
Adding a constant “c” does not change the standard deviation
SD(x) = c
SD(c + x) = c ± SD(x)
Multiplying by “c” multiplies the standard deviation by I c I
The SD of the sum of X and Y is equal to or less than the sum of the standard deviations of X and Y
SD(cx) = I c I * SD(x)
SD ( x + y ) =
Properties and Examples
Chebyshev’s inequality: also holds for random variables
For data ( , s) most values are between
In repeated random experiments, most X will fall between this rule
E(x) – 3 x SD(x) and E(x) + 3 x SD(x)
E(x) = $200 ; SD(x) ≈ $3,868
-$11,404 and $11,804
What if insurance company increases payouts by $100 to:
$100,100 (Death); $50,100 (Disability); $100 (Neither)
E(x + $100) =
$100,100 x
1
1000
+ $50,100 x
2
1000
+ $100 x
997
1000
E(x + 100) =
$100,000 x
1
1000
+ $100 +
$50,000 x
2
1000
+ $100 +
$0 x
997
1000
+ $100
E(x + $100) =
$200 + 100
E(x) + c
E(x + $100) =
$300
E( x ± c) =
E(x) ± c
What if SD (x + $100)
Deviations
= (x + $100) – E(x + $100)
= x + $100 – E(x) - $100
= x - E(x)
SD is unchanged
SD(x ± c)
= SD(x)
Insurance company
X = annual payout per policy
E(x) = $200; SD(x) ≈ $3,686
What if insurance company doubles payouts to
$200,000 (Death); $100,000 (Disability); $0 (Neither)
X = payouts; 2x = double payouts
E(2x) =
$200,000 x
1
1000
+ $100,000 x
2
1000
+ $0 x
997
1000
E(2x) =
2 x
$100,000 x
1
1000
+ $50,000 x
2
1000
+ $0 x
997
1000
E(2x) =
2 x $200
E(2x) =
$400
E(2x) =
2 x E(x)
E(aX) =
aE(x)
What about SD(2X) ?
When a constant is added SD is unchanged, but when it’s doubled, SD doubles
Deviations
= 2X – E(2x)
= 2X – 2E(x)
= 2 * (X - E(x))
= 2 * (X - µ)
SD is doubled
SD(cX)
= I c I * SD(x)
Var(2x)
= E [ deviations2 ]
= E [ (22 * (X - µ)2 ]
= 22 * E[ (X - µ)2 ]
= 22 * var(X)
Var(aX)
= a2 * var(x)
SD(aX)
= I a I * SD(x)
Variance is not a linear operator, it’s a quadratic operator
Insurance company
X = annual payout per policy
E(x) = $200; SD(x) ≈ $3,868
Suppose they are independent
Payout to Mr. Ecks = X
Payout to Ms. Wye = Y
E(X + Y) = E(X) + E(Y)
Addition rule for expected value
SD (X +Y):
Intuitively = SD(X + Y) ≤ SD(x) + SD(Y)
Only when they are independent
SD(X) + SD(Y) = risk of insuring the same person twice
If X and Y are independent:
Var(X + Y)
= var(X) + var(Y)
= $2 14,960,000 + $2 14,960,000
= $2 29,920,00
SD(X + Y)
=
= $5,470
The risk of insuring two people is lower than insuring the same person twice
Is best to have as much clients that are independent from each other to reduce risk
By “pooling risk” you can reduce the overall risk
Covariance
Covariance: measures how two random variables X and Y vary together
(co = together)
E(X) = µ
E(Y) = V
cov(X,Y) = E[ (X - µ) x (Y – v) ]
What if X and Y are not independent
Cov(X,Y)
Var(X )
Var(X,X)
Cov(X,Y)
= E
= E [ (X - E(X))2 ]
= E [ (X – E(X)) x (X – E(X)) ]
= E [ (X – E(X)) x (X – E(Y)) ]
Cov(X,Y) ≤ 0 or Cov (X,Y) ≥ 0
Bernoulli Trials
There are only two outcomes per trial, called success or failure
The probability of success p is the same on every trial, the probability of failure 1 – p is often called q
The trials must ben independent, but we can use the 10% condition
If the number of trials, or sample size is less than 10% of the population size, we can assume that the trials are independent
Bernoulli trials:
Two outcomes: success or failure
p is the same for all trials / draws as we draw with replacement
The draws/ trials must be independent (as tickets have no memory)
Discrete probability models
The uniform model: probability of each outcome is always exactly the same
If X is a random variable with possible outcomes 1, 2, 3, 4….n and all outcomes have exactly the same probability then X has a discrete uniform distribution
Binomial model:
If X is a random variable whose outcomes are the number of q-success in a series of Bernoulli trials
Two parameters are necessary to define a binomial probability model, the number of trials n and the probability of success p
We denote it all as Binom(n, p)
Uniform model
Suppose you roll a fair die
Probability distribution: a list of all possible outcomes + corresponding probability
x
1
2
3
4
5
6
Sum
P(X = x)
1/6
1/6
1/6
1/6
1/6
1/6
1
Uniform discrete probability distribution because all the probabilities are the same
Binomial model
S
F
F
F
F
F
F
F
F
F
Random draw 1 ticket from the box (with replacement)
P(“success”) = 0.10 = p
Bernoulli trials:
Two outcomes success or failure
p is the same for all draws/ trials as we draw with replacement
The draws/ trials must be independent (tickets have no memory)
Draw 5 tickets with replacement
P( exactly 2 success in 5 trials) = ?
S S F F F
F S S F F
F F S S F
F F F S S
S F S F F
F S F S F
F S F S F
S F F S F
F S F F S
S F F F S
It can happen in 10 ways
Number of ways to have k success in trials:
Number of ways to have 2 successes in 5 trials:
=
5!
2! (3!)
=
5 x 4 x 3 x 2 x 1
(2 x 1) x (3 x 2 x 1)
=
5 x 4
2 x 1
=
20
2
=
10 ways
P(exactly 2 successes in 5 trials)
P( S S F F F or S F S F F or … or F F F S S )
Mutually exclusive because if one event happens, the other events cant happen (they are disjoint events)
Add value
P (SSFFF) + P(SFSFF) + P(SFFSF) + … + P(FFFSS)
= P(1st is S and 2nd is S and 3rd is F and 4th is F and 5th is F) + ….
Multiplication rule
P( A AND B) = P(A) x P(B | A)
= P(1st is S) x P(2nd is S) x P(3rd is F) x P(4th is F) x P(5th is F) + …
= 0.10 x 0.10 x 0.90 x 0.90 x 0.90 + …
= (0.10)2 x (0.90)3 + …
= 10 x (0.10)2 x (0.90)3
= 0.0729
AND = multiplication rule
Mutually exclusive ≡ disjoint
10 because it’s the number of ways
0.10 = P, 2 = k, 0.90 = 1 – P, 3 = n – k, … = 9 more such terms
Binomial probability model for Bernoulli trials:
n= number of trials
k= number of success
p= probability of success
Must be a Bernoulli Trial
P(exactly k success in n trials =
n!
k! (n – k)!
pk (1 - p)n - k
k = 0, 1, 2, 3, 4, …, n
E(X) = n x p
SD(X) =
Sampling: drawing without replacement
Draws are often without replacement
Binomial model does not apply
Binomial model is a good approximation
If n < 10% of N
n= number of trials (sample size)
N= size of population
TI-84 : DISTR binompdf (n, p, x)
TI-84 : DISTR binompdf (5, 0.10, 2)
To make a binomial model focus on 1 thing and label it as success and the rest as failure
Sample size = number of trials
Example Exercise 51 pg. 235
Game with joystick (right handed)
13% of the population are left handed
P( there are left handed people in a sample of 5) = ?
1 – P (there are no left-handed people in a sample of 5)
Bernoulli trials
Left handed = success
No, draw is not with replacement
We are not sure if it is met
= 1 – Binompdf (3,0.13,0) ≈ 1 – 0.4984 ≈ 0.5016
What can go wrong
Random variables have a probability model, actual sample data has a distribution
Be aware of the notation you use: E(X) is used in different context than
Be very cautious with the properties of the expected value and the standard deviation (use formula sheet)
Ensure Bernoulli trials is met
Chapter 7: The Normal and Other Continuous Distribution
Standard Deviation as a ruler
We use z-scores as a way to compare value
Standardizing a variable:
Therefore, we use the standard deviation as a ruler asking how many standard deviations is a value different form the mean
How? Through the 68 – 95 – 99.7 Rule
The 68 – 95 – 99.7 Rule
In a unimodal symmetric distribution, about 68% of the values fall within one standard deviation of the mean, about 95% of the values fall within two standard deviations of the mean and about 99.7% of the values fall within three standard deviations away from the mean

-3σ and -2σ very rarely low
-2σ and -1σ unusually low
1σ and 2σ unusually high
2σ and 3σ very rarely high

68 – 95 – 99.7 rule: for unimodal, symmetric distribution
About 68% of all values fall within 1 SD(x) of the mean
About 95% of all values fall within 2 SD(x) of the mean
About 99.7% of all values fall within 3 SD(x) of the mean
Follows from the Normal Distribution
Use the 68-95-99.7 Rule to identify outliers
Negative value: below the mean
Positive value: above the mean
Is P/E of 40.4 (Feb 2010) exceptional?
Standardize:
z=
x -
s
≈
40.4 – 17.9
6.8
≈ 3.3
Price earnings ratio of February 2010 is 3.3 above average (SD(x) is above)
40.4 is more than 3 SD(x) above the mean
Fewer than 0.3/2 = 0.15% of values are this large
If the data follow a unimodal, symmetric distribution, values for which z < -3 or z > 3 are considered outliers
The Normal Distribution
Normal Distribution: a theoretical probability model used for continuous random variables
Properties:
1. - ∞ < z > + ∞
2. Probability density function is bell-shaped and unimodal
3. µ = 0
4. σ = 1
Area under pdf = 1
Pdf = probability density function
Check if histogram is:
Symmetric
Bell-shaped Normal model
Unimodal
To characterize normal model, we need to know the mean (µ) and the standard deviation(σ)
N( µ. σ)
Standard Normal Model: N(0, 1)
But only if the data is standardized
Areas Under the Normal pdf
P( 0 ≤ z ≤ 1)

Because 1 SD(x) = 68%, between -1 and 1, then you need to divide 68% by 2 to get 0 and 1 = 34%
≈ 0.3413
P( 0 ≤ z ≤ 1 ) = area under Normal pdf between 0 and 1
Finding the area
TI – 84: DISTR normalcdf (0,1)
P( 0 ≤ z ≤ 1 ) ≈ 0.3414
P( -2 ≤ z ≤ 2 ) ≈ 0.9545
How it works
Normal model:
N(500,100)
µ = 500
σ = 100
Standard Normal Model:
N(0,1)

µ = 0
σ = 1
Goal: to find the graded area in %
What percentage of the normal distribution is graded (before (2))
Knowing the graded area in % which is the value z (Critical value)
Standardizing does not change the shape of the distribution
Normal distribution as an approximation data
P/E data = not so good approximation
GMATT scores approximately follow the Normal Distribution
With µ = 500 and σ = 100

GMATT scores N(500,100)
X N(500,1000)
= “follow”
Properties of the Normal Distribution
Property 1
If X and Y are nominal random variables
Property 2:
If the number of trials n is sufficiently large (*), the nominal pdf is a good approximation
(*) np ≥ 10 and n (1-p) ≥ 10
Percentiles of the Normal pdf
Lower quartile (Q1)
= 25% of all values is ≤ Q1
Q1 is also called: the 25th percentile
So Q3 is the 75th percentile
Median is the 50th percentile
90th percentile: that value so that 90% of all values ≤ 90th percentile
Find the 25th percentile of GMATT scores:
TI -84: DISTR invNorm (0.25, 500, 100)
25th percentile ≈ 432.5510
Find the 90th percentile invNorm (0.90, 500, 100)
90th percentile ≈ 628.1552
Normal Approximation for Binomial
Chapter 6 Binomial model Binom(n,p)
Now we can use the normal model if:
n x p ≥ 10
n x (1 – p) ≥ 10
this determines if its safe using nominal approximation (check with above rules)
N (µ, σ) where:
µ = n x p
σ =
Normal Quantile – Quantile – plot (Q-Q plot)
Q-Q plot: method to assess whether the normal model is a good approximation of the data
Ex. P/E data (n = 1780)
1. Sort the values in ascending order
2. Find for every value the theoretical quantile
Ex. 1st of the 1780 values ( y1 ≈ 4.78; z scores ≈ -1.79)
If the variable were normally distributed; the 1st of 1780 values would have a z score ≈ -3.43
1st point of Q-Q plot (-3.43; 4.78)
Repeat for nth values
3. If the points are approximately on a straight line, the normal distribution is a good approximation of the data
Alternative version, with z scores of data on vertical axis
Point of reference is 45° line
If the points are approximately on the 45° line, the normal distribution is a good approximation of the data
What can go wrong
Probability models are still just models, Question probability as you would question data
Don’t assume everything is Normal, check normality plot histogram
Don’t use the nominal approximation with lowercase n
Chapter 11: Sampling Distribution and confidence interval mean
Population
µ = population mean (unknown)
Sample,
= sample mean
as an estimate for µ
95% confidence interval (CI) : ± 2 SE()
2 = z*
Only work if n is large (rule of thumb: n ≥ 50) if sample has unimodal symmetric distribution
The distribution of a sample mean
For quantitative variables
Full population
Sample 1
Sample 2
Sample 3
µ = 20
= 22
= 21
= 19
σ = 1.2
s = 0.98
s = 1.3
s = 2.03
Central limit theorem
Central limit theorem (CLT): the sample mean () of a random sample has a probability distribution that can be approximated by the normal distribution as the sample (n) grows
() N
How good is the approximation?
The larger the sample (n) the better the approximation
Regardless of the shape of the population
Population = box with 1.2 million tickets
1
200,000
tickets
2
200,000
tickets
3
200,000
tickets
4
200,000
tickets
5
200,000
tickets
6
200,000
tickets
Draw 2 tickets without replacement (n= 2) and find the mean
5
+
2
=
7 / 2 = 3.5
= 3.5
Draw 5 tickets without replacement (n= 5) and find the mean
Draw 20 tickets without replacement (n= 20) and find the mean
= 3.3
Repeat this 10,000 times – density histogram
Distribution gets taller
Tails get thinner
Unimodal
Symmetric
CLT: the sample mean of a random sample has a probability distribution that can be approximated by the normal distribution
N (approx.)
The larger the sample the better the approximation
Sample mean are approximately distributed like a normal curve
CTL: N (approx.)
95% CI: ± 2 SE()
Property: SE() =
s
(no proof)
Confidence Interval of the mean
Very often “true” mean (µ) and “true” standard deviation ( σ / ) are unknown
Solution: we estimate SD() as
Where SE() is the standard error
n is large
n is not large
Confidence interval
± z* * SE()
95% confidence interval
± 2 x SE(
Confidence interval
± t* x SE(
What can we say about the population mean (µ)?
We are 95% confident that the population mean (ex. age) is between
- 2 x SE( and + 2 x SE()
Ex. Research on toxic residuals in farm
y = concentration (in ppm) of Pesticide Mirex
n = 150; = 0.093 ppm ; s = 0.0495
CLT: N (approx.)
But E() = µ and SD() = ( σ / ) are unknown to the researcher
Researcher estimates SD() as:
Confidence interval for a mean:
± z* * SE()
Margin of error
Only if n is large can we use z*; otherwise use t*
(Student t distribution (STATS II))
Assumptions and conditions to use N (apply normal dist.)
Independence assumption
The sampled values must be independent of each other
Sample size assumption
The sample size, n, must be large enough
Randomization condition
Sample must be a simple random sample of the population
10% condition
n smaller to 10% of population (sample without replacement)
No success / Failure condition
What can go wrong
Don’t confuse proportions and means. Use Normal models with proportions (and when N is large). Use student t methods with means
Don’t confuse s (standard deviation of sample) with σ (standard deviation of population)
Beware of multimodality. If you see this, try to separate the data into groups
Beware of skewed data
Investigate outliers. If they are clearly in error remove them. If they can’t be removed, you might run the analysis with or without the outlier
Make sure data are independent. Consider whether they are likely violations of independence in the data collection methods
The Central Limit Theorem doesn’t talk about the distribution of the data from the sample. It talks about the sample means and sample proportions of many different random samples drawn from the same population
Last updated