Lecture notes

Statistics I

Chapter 1: Data and Decisions

  1. What are Data?

  • Statistics is about data and decisions

  • Statistics are quantities calculated from data

  • Statistics- a toolbox and a way of thinking

  • Its purpose is to gather data that is relevant to the problem

  • Statistics: a collection of tools and the associated reasoning to model, summarize and understand data.

  • Data: it’s not just information but also values that are measured and observed, values with their context.

  • Data warehouses: vast digital repositories where data is recorded and stored

  • Big Data: data sets so large that traditional methods of storage and analysis are inadequate

  • The “Five W’s”

  • Who: who is the subject or case

  • Who is being measured

  • What: what are the variables (information)

  • What you have

  • Who and what are part of data and are essential

  • Where: where are these values recorded

  • When: when are these values recorded

  • Why: why are these values recorded

  • How: how are these values recorded

  • Why, when, where, how are part of the metadata

  • Context of data we are collecting

  • Organize the information gathered on tables or spreadsheets (collections of data)

  • Rows of the table answer for who, each row is a case

  • Columns of data tables answer for what, each column is a variable

  1. Variable types

  • Categorical variables: the values are names of categories

  • Nominal: just a label without a particular order (no intrinsic order)

  • Ordinal: labels with a given order

  • Some nominal variables are used as identifiers

  • Identifiers: purpose is to assign a unique identifier code to each individual

  • special case of nominal categorical variables

  • Variables identify a category for each case

  • Answers such as yes or no which can’t be added

  • Qualitative, labels, tittles

  • Quantitative variables: the values that variables take in a numerical quantity with units

  • Must be a number quantity and have units

  • Variables record measurements, amounts or something else but they must have units

  • Cross-sectional data: when you record something at a given point in time for different units

  • Time-series data: collects information for 1 subject over different points in time

  • If there is displacing, then there is no time-series

Chapter 2: Displaying and Describing Categorical Data

  1. Displaying and describing Categorical data

  • Descriptive statistics: summarizes data and displays it

  • In statistics, instead of using the whole population, is best to just take a sample of people from the population and making inferences on that sample

  • Inferential statistics: making inferences on a sample

  • Samples are chosen randomly

  • Display and summarize data to see:

  • Patterns

  • Relationships

  • Exceptions in values and observations

  1. Summarizing a Categorical variable

IP Number

Time

Source

245.240.221.71

1 / Feb / 2013 13:15:08

Google

196.345.281.51

1 / Feb / 2013 14:56:23

Direct

  • Variable source:

  • IP number: Categorical, ordinal

  • Time: quantitative

  • Source: categorical, nominal

  • Variable source: how their customers found their way to the website

  • Source is a categorical variable

  • If the list is too long and does not fit in a page, create a new category called “others”

  • Frequency table: how often a variable occurs

Source

Count of visits

Visits as a percentage %

Google

Direct

E-mail

Bing

Others

130,158

52,969

16,084

9,581

6,740

57.36

23.34

7.09

4.22

2.97

Total

226,925

XXX

  • Express each frequency as a percentage

  • Compute the proportion

  • Percentage = proportion x 100%

  • Make and interpret a frequency table for a categorical variable

Frequency table

Count of the number of cases in each category

Count

Relative Frequency Table

Count of the number of cases in each category divided by the total number of cases

Percentage

  • Cases = people answering the questions

  1. Displaying and describing categorical data

  • Display the percentage as a bar chart

  • Keep it proportional

  • Categorical bar chart must have gaps to indicate different categories

  • Make and interpret a bar chart or pie chart

  • Pie chart gives less information

  • Horizontal axis = categories

  • Vertical axis = counts

  • Residual category is always placed last in a bar chart regardless of the count

  • Area principal: area of the bars are proportional to the amount that is wanted to show

  • Pie chart: shows how the whole group breaks into several categories and shows all the cases as a circle sliced into pieces whose areas are proportional to the fraction of cases in each category

  • Keep order from biggest to smallest (residual goes last)

  • Include the percentages

  • Bar charts are better than pie charts to represent frequency tables

  1. Exploring Two Categorical Variables: Contingency Tables

  • Example: survey of 5039 people in 5 countries (do you use social networks?)

Represent ID

Social Networking

Country

0001

Yes

Egypt

0002

No access

Egypt

1000

No access

Egypt

1001

Yes

UK

1002

Yes

UK

1003

No

UK

5039

yes

U.S

  • Frequency table for variable social networking

Social Networking

Count

Relative frequency

No

1,249

24.8

Yes

2,175

43.2

No access

1,615

32.1

Total

5,039

100.1

  • Counts must add up to the total number of cases (rows)

  • Does social networking differ from country to country?

  1. Contingency tables

UK

EG

DE

RU

US

Total

No

336

70

460

90

293

1,249

Yes

529

300

340

500

506

2,175

No Access

153

630

200

420

212

1,615

Total

1,018

1,000

1,000

1,010

1,011

5,039

  • Marginal distribution: the frequency distribution of either one of the variables

  • The 300 in the Egypt (EG) column means that there are 300 respondents from Egypt that use social networking

  • There are 300 respondents out of 5,039 respondents who are from Egypt and who answered yes to the question, meaning they use social networking

  1. For every cell we can compute percentage change in three different ways

  • Total percentage

Total percentage =

300

5039

x 100%

≈ 5.9%

  • Approximately 5.9% of the total number of respondents are from Egypt and answered yes

  • Row percentage

Row percentage =

300

2175

x 100%

≈ 13.8%

  • Approximately 13.8% out of the total number of respondents who answered yes are from Egypt

  • Column Percentage

Total percentage =

300

1000

x 100%

≈ 30%

  • 30% out of the total respondents that are from Egypt answered yes to the survey question

  • Compare social networking country to country

  • Use column percentages

  • Conditional distribution: “in what we conditioned”

  • Make and interpret contingency tables

Contingency Table

Show how individuals are distributed along variables depending on the value of the other variables

Count

Marginal Distribution

The frequency distribution of either one of the variables

Percentage

Chapter 4: Correlation

  1. Scatterplots

  • Scatterplots: plots two quantitative variables together showing the relationship (if any) between them

  • Study if a relationship exists

  • Only in scatterplots it doesn’t matter which variable we choose for the x and y axis

  • Analysis of the relationships between the quantitative variables

Direction

Form

Strength

Outliers

Positive, negative or neither

Straight, curved, something exotic or no pattern

How clustered is the data? Closer to the line of best fit?

Are there any unusual observations or subgroups?

  1. Assign roles to the variables

  • Variable can be defined as explanatory or as response variables

  • Explanatory variables x-axis

  • Response variables y-axis

  1. Measuring correlation

Case

Sales people

x

Sales ($)

y

zx

zy

zxzy

1

2

$10,000

-1.49

-1.42

2.12

2

3

$11,000

-1.31

-1.24

1.62

3

7

$13,000

-0.60

-0.86

0.52

4

9

$14,000

-0.25

-0.67

0.17

5

10

$18,000

-0.07

0.07

-0.01

6

10

$20,000

-0.07

0.45

-0.03

7

12

$20,000

0.28

0.45

0.13

8

15

$22,000

0.82

0.82

0.67

9

16

$22,000

0,99

0.82

0.82

10

20

$26,000

1.70

1.57

2.68

Total

104

$176,000

0.00

0.00

8.69

(mean) = 10.4 people

(SD(x)) 5x ≈ 5.6 people

zx = (x - ) / 5x

(mean) = $17,600

(SD(x)) 5y ≈ $5,337

zy = ( y - / 5y

z =

x -

s

r =

∑ zxzy

n - 1

r =

8.69

10 - 1

≈ 0.97

  • x: explanatory (or predictor) variable (independent) (cause)

  • y: response variable (dependent) (effect)

  • Sum of all z scores is always 0

  1. Correlation Conditions

  • Only for quantitative variables

  • Only meaningful if relationship is approximate linear

  • Straight lines not curves

  • Beware of outliers

  1. Measuring Correlation

  • A positive linear relationship, most points are in:

  • Lower left quadrant: x < and y < , zx < 0 and zy < 0

  • zxzy > 0

  • Upper right quadrant: x > and y > , zx > 0 and zy > 0

  • zxzy > 0

  • Mean of zxzy > 0 = correlation coefficient = r

  • A negative linear relationship, so most points are in:

  • Upper left quadrant: x < and y > , zx < 0 and zy > 0

  • zxzy < 0

  • Lower right quadrant: x > and y < , zx > 0 and zy < 0

  • zxzy < 0

  • Mean of zxzy < 0

Case

Sales people

Sales ($)

x

y

zx

zy

zxzy

1

2

$10,000

-1.5

-1.42

2.12

  • Case 1:

  • zx = -1.5

  • zy ≈ -1.4240

  • zxzy ≈ 2.14

zx =

2 people – 10.4 people

5.6 people

= -1.5 standard deviations below from the mean

zy =

$10,000 - $17,600

$5,537

= -1.4 standard deviations below from the mean

zxzy ≈ 2.1

Sum of zxzy

r =

∑ zxzy

n - 1

8.69

9

≈ 0.97

  1. Understanding Correlation

  • The sign of the correlation gives the direction of the relationship

  • Positive or negative

  • Boundaries of correlation

  • A correlation of 1 or -1 is a perfect linear relationship

  • A correlation of 0 is a lack of linear relationship

  • Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value

  • A large correlation is not a sign of a classical relationship

  • Is necessary that relationship is linear and no outliers

  1. Properties of r

  2. Positive r indicates positive linear relationships and Negative r indicates negative linear relationships

  3. -1 ≤ r ≤ 1

  4. r has no units

  5. r isn’t affected by changes in center or scale of x or y since it’s computed with z-scores

  6. r measures strength of linear relationship

  • r ≈ 0 doesn’t mean there is no relationship

  1. r is sensitive to outliers

  2. Non-linear correlations

  • r is not a valid measure when the relationship is not linear

  • Options

  • Transform the variables using square roots base 10 logarithms (it becomes more linear)

  • Correlation of ranks

  1. Lurking Variables

  • Lurking variables: an unobserved variable that is affecting both of the variables being observed

  • Examples of lurking variables

  • Positive relationship between number of doctors and life expectancy

  • Lurking variable: better living standards?

  • Positive relationship between consumption of diet soda drinks and traffic accidents

  • Lurking variable: higher population?

  • Positive relationship between the consumption of ice cream and number of drownings

  • Lurking variable: hot weather

  • The relationship is not between our observed variables but with each one of the observed variables and the lurking variables

  1. What can go wrong

  • Don’t say correlation when you mean association

  • To say correlation, first calculate r

  • Don’t calculate correlation of categorical variables

  • Make sure the variables ae associated in a linear way

  • Be careful to see if there are any outliers

  • Don’t confuse correlation with causation

  • Lurking variables, spurious correlation

Chapter 6: Random variables and Probability Models

  1. Random Variables

  • Random variable: variable which value is based on the outcome of a random event

  • Types of random variables

  • Discrete random variables: can take values such as: 1, 2, 3 ….

  • Continuous random variables: can take values such as: 0.1, 0.11, 0.12 …

  • For both discrete and continuous variables, the set of all possible values and their associated probabilities is called the probability model

  • Example: Insurance company

Outcome

x

Probability

P ( X = x )

Death

$100,000

1 / 1000

Disability

$50,000

2 / 1000

Neither

$0

997 / 1000

Sum

---

1000 / 1000

  1. Expected Value of a Random Variable

  • When the probability value is known, then the expected value can be calculated

  • The long-term expectation of the model

Expected value = outcome * probability of each outcome

  • E(x) can be µ but not

  • For discrete random variables, probability models assign a probability to each possible outcome

  • Example with previous table

  • Which payout per policy can an insurance company expect?

  • Assume they have 1000 clients

E(x) =

($100,000 x 1) + ($50,000 x 2) + ($0 x 997)

1000

= $200

E(x) =

$100,000 x

1

1000

+ $50,000 x

2

1000

+ $0 x

997

1000

= $200

  1. Properties of Expected values

  • If X and Y are random variables and C is a constant:

The expectation of a constant “c” is the constant

Adding a constant “c” increases the expected value by “c”

E(c) = c

E(c + x) = c ± E(x)

Multiplying by “c” multiplies the expected value by “c”

The expected value of the sum of two random variables

E(cx) = c * E(x)

E( x ± y) = E(x) ± E(y)

  1. Standard Deviation of a Random Variable

  • Variance: is the expected value of the squared deviations

var (x) = ∑ ( x - µ )2 x P(x)

  • var(x) can be written as σ2 but not as s2

  • Standard deviation: indicator of variability of our random variables

  • SD(x) can be written as σ but not as s

Outcome

x

Probability

P ( X = x )

Deviation

x – E(x)

Death

$100,000

1 / 1000

($100,000 – $200) = $99,800

Disability

$50,000

2 / 1000

($50,000 – $200) = $49,800

Neither

$0

997 / 1000

($0 – $200) = -$200

Sum

---

1000 / 1000

  • E(x) instead of mean

  • E(x) = $200

  1. Square the deviations: ($99,800)2, ($49,800)2, (-$200)2

  2. Find the expected value of the squared deviation

var(x) =

$99,8002 x

1

1000

+ $49,8002 x

2

1000

+ -$200 2 x

997

1000

= $214,960,000

  1. Take the square root

var(x) = σ2 = ∑ (x - µ)2 x P(x)

  1. Properties of Standard deviations

If all values of x are the same, the standard deviation is zero

Adding a constant “c” does not change the standard deviation

SD(x) = c

SD(c + x) = c ± SD(x)

Multiplying by “c” multiplies the standard deviation by I c I

The SD of the sum of X and Y is equal to or less than the sum of the standard deviations of X and Y

SD(cx) = I c I * SD(x)

SD ( x + y ) =

  1. Properties and Examples

  • Chebyshev’s inequality: also holds for random variables

  • For data ( , s) most values are between

  • In repeated random experiments, most X will fall between this rule

E(x) – 3 x SD(x) and E(x) + 3 x SD(x)

E(x) = $200 ; SD(x) ≈ $3,868

-$11,404 and $11,804

  • What if insurance company increases payouts by $100 to:

  • $100,100 (Death); $50,100 (Disability); $100 (Neither)

E(x + $100) =

$100,100 x

1

1000

+ $50,100 x

2

1000

+ $100 x

997

1000

E(x + 100) =

$100,000 x

1

1000

+ $100 +

$50,000 x

2

1000

+ $100 +

$0 x

997

1000

+ $100

E(x + $100) =

$200 + 100

E(x) + c

E(x + $100) =

$300

E( x ± c) =

E(x) ± c

  • What if SD (x + $100)

Deviations

= (x + $100) – E(x + $100)

= x + $100 – E(x) - $100

= x - E(x)

SD is unchanged

SD(x ± c)

= SD(x)

  • Insurance company

  • X = annual payout per policy

  • E(x) = $200; SD(x) ≈ $3,686

  • What if insurance company doubles payouts to

  • $200,000 (Death); $100,000 (Disability); $0 (Neither)

  • X = payouts; 2x = double payouts

E(2x) =

$200,000 x

1

1000

+ $100,000 x

2

1000

+ $0 x

997

1000

E(2x) =

2 x

$100,000 x

1

1000

+ $50,000 x

2

1000

+ $0 x

997

1000

E(2x) =

2 x $200

E(2x) =

$400

E(2x) =

2 x E(x)

E(aX) =

aE(x)

  • What about SD(2X) ?

  • When a constant is added SD is unchanged, but when it’s doubled, SD doubles

Deviations

= 2X – E(2x)

= 2X – 2E(x)

= 2 * (X - E(x))

= 2 * (X - µ)

SD is doubled

SD(cX)

= I c I * SD(x)

Var(2x)

= E [ deviations2 ]

= E [ (22 * (X - µ)2 ]

= 22 * E[ (X - µ)2 ]

= 22 * var(X)

Var(aX)

= a2 * var(x)

SD(aX)

= I a I * SD(x)

  • Variance is not a linear operator, it’s a quadratic operator

  • Insurance company

  • X = annual payout per policy

  • E(x) = $200; SD(x) ≈ $3,868

  • Suppose they are independent

  • Payout to Mr. Ecks = X

  • Payout to Ms. Wye = Y

E(X + Y) = E(X) + E(Y)

Addition rule for expected value

  • SD (X +Y):

Intuitively = SD(X + Y) ≤ SD(x) + SD(Y)

Only when they are independent

  • SD(X) + SD(Y) = risk of insuring the same person twice

  • If X and Y are independent:

Var(X + Y)

= var(X) + var(Y)

= $2 14,960,000 + $2 14,960,000

= $2 29,920,00

SD(X + Y)

=

= $5,470

  • The risk of insuring two people is lower than insuring the same person twice

  • Is best to have as much clients that are independent from each other to reduce risk

  • By “pooling risk” you can reduce the overall risk

  1. Covariance

  • Covariance: measures how two random variables X and Y vary together

  • (co = together)

E(X) = µ

E(Y) = V

cov(X,Y) = E[ (X - µ) x (Y – v) ]

  • What if X and Y are not independent

Cov(X,Y)

Var(X )

Var(X,X)

Cov(X,Y)

= E

= E [ (X - E(X))2 ]

= E [ (X – E(X)) x (X – E(X)) ]

= E [ (X – E(X)) x (X – E(Y)) ]

Cov(X,Y) ≤ 0 or Cov (X,Y) ≥ 0

  1. Bernoulli Trials

  • There are only two outcomes per trial, called success or failure

  • The probability of success p is the same on every trial, the probability of failure 1 – p is often called q

  • The trials must ben independent, but we can use the 10% condition

  • If the number of trials, or sample size is less than 10% of the population size, we can assume that the trials are independent

  • Bernoulli trials:

  • Two outcomes: success or failure

  • p is the same for all trials / draws as we draw with replacement

  • The draws/ trials must be independent (as tickets have no memory)

  1. Discrete probability models

  • The uniform model: probability of each outcome is always exactly the same

  • If X is a random variable with possible outcomes 1, 2, 3, 4….n and all outcomes have exactly the same probability then X has a discrete uniform distribution

  • Binomial model:

  • If X is a random variable whose outcomes are the number of q-success in a series of Bernoulli trials

  • Two parameters are necessary to define a binomial probability model, the number of trials n and the probability of success p

  • We denote it all as Binom(n, p)

  1. Uniform model

  • Suppose you roll a fair die

  • Probability distribution: a list of all possible outcomes + corresponding probability

x

1

2

3

4

5

6

Sum

P(X = x)

1/6

1/6

1/6

1/6

1/6

1/6

1

  • Uniform discrete probability distribution because all the probabilities are the same

  1. Binomial model

S

F

F

F

F

F

F

F

F

F

  • Random draw 1 ticket from the box (with replacement)

  • P(“success”) = 0.10 = p

  • Bernoulli trials:

  • Two outcomes success or failure

  • p is the same for all draws/ trials as we draw with replacement

  • The draws/ trials must be independent (tickets have no memory)

  • Draw 5 tickets with replacement

  • P( exactly 2 success in 5 trials) = ?

S S F F F

F S S F F

F F S S F

F F F S S

S F S F F

F S F S F

F S F S F

S F F S F

F S F F S

S F F F S

  • It can happen in 10 ways

  • Number of ways to have k success in trials:

  • Number of ways to have 2 successes in 5 trials:

=

5!

2! (3!)

=

5 x 4 x 3 x 2 x 1

(2 x 1) x (3 x 2 x 1)

=

5 x 4

2 x 1

=

20

2

=

10 ways

  • P(exactly 2 successes in 5 trials)

  • P( S S F F F or S F S F F or … or F F F S S )

  • Mutually exclusive because if one event happens, the other events cant happen (they are disjoint events)

  • Add value

  • P (SSFFF) + P(SFSFF) + P(SFFSF) + … + P(FFFSS)

  • = P(1st is S and 2nd is S and 3rd is F and 4th is F and 5th is F) + ….

  • Multiplication rule

P( A AND B) = P(A) x P(B | A)

= P(1st is S) x P(2nd is S) x P(3rd is F) x P(4th is F) x P(5th is F) + …

= 0.10 x 0.10 x 0.90 x 0.90 x 0.90 + …

= (0.10)2 x (0.90)3 + …

= 10 x (0.10)2 x (0.90)3

= 0.0729

  • AND = multiplication rule

  • Mutually exclusive ≡ disjoint

  • 10 because it’s the number of ways

  • 0.10 = P, 2 = k, 0.90 = 1 – P, 3 = n – k, … = 9 more such terms

  • Binomial probability model for Bernoulli trials:

  • n= number of trials

  • k= number of success

  • p= probability of success

  • Must be a Bernoulli Trial

P(exactly k success in n trials =

n!

k! (n – k)!

pk (1 - p)n - k

k = 0, 1, 2, 3, 4, …, n

E(X) = n x p

SD(X) =

  • Sampling: drawing without replacement

  • Draws are often without replacement

  • Binomial model does not apply

  • Binomial model is a good approximation

If n < 10% of N

n= number of trials (sample size)

N= size of population

TI-84 : DISTR binompdf (n, p, x)

TI-84 : DISTR binompdf (5, 0.10, 2)

  • To make a binomial model focus on 1 thing and label it as success and the rest as failure

  • Sample size = number of trials

  • Example Exercise 51 pg. 235

  • Game with joystick (right handed)

  • 13% of the population are left handed

  • P( there are left handed people in a sample of 5) = ?

  • 1 – P (there are no left-handed people in a sample of 5)

  • Bernoulli trials

  • Left handed = success

  • No, draw is not with replacement

  • We are not sure if it is met

  • = 1 – Binompdf (3,0.13,0) ≈ 1 – 0.4984 ≈ 0.5016

  1. What can go wrong

  • Random variables have a probability model, actual sample data has a distribution

  • Be aware of the notation you use: E(X) is used in different context than

  • Be very cautious with the properties of the expected value and the standard deviation (use formula sheet)

  • Ensure Bernoulli trials is met

Chapter 7: The Normal and Other Continuous Distribution

  1. Standard Deviation as a ruler

  • We use z-scores as a way to compare value

  • Standardizing a variable:

  • Therefore, we use the standard deviation as a ruler asking how many standard deviations is a value different form the mean

  • How? Through the 68 – 95 – 99.7 Rule

  1. The 68 – 95 – 99.7 Rule

  • In a unimodal symmetric distribution, about 68% of the values fall within one standard deviation of the mean, about 95% of the values fall within two standard deviations of the mean and about 99.7% of the values fall within three standard deviations away from the mean

  • -3σ and -2σ very rarely low

  • -2σ and -1σ unusually low

  • 1σ and 2σ unusually high

  • 2σ and 3σ very rarely high

  • 68 – 95 – 99.7 rule: for unimodal, symmetric distribution

  • About 68% of all values fall within 1 SD(x) of the mean

  • About 95% of all values fall within 2 SD(x) of the mean

  • About 99.7% of all values fall within 3 SD(x) of the mean

  • Follows from the Normal Distribution

  1. Use the 68-95-99.7 Rule to identify outliers

  • Negative value: below the mean

  • Positive value: above the mean

  • Is P/E of 40.4 (Feb 2010) exceptional?

  • Standardize:

z=

x -

s

40.4 – 17.9

6.8

≈ 3.3

  • Price earnings ratio of February 2010 is 3.3 above average (SD(x) is above)

  • 40.4 is more than 3 SD(x) above the mean

  • Fewer than 0.3/2 = 0.15% of values are this large

  • If the data follow a unimodal, symmetric distribution, values for which z < -3 or z > 3 are considered outliers

  1. The Normal Distribution

  • Normal Distribution: a theoretical probability model used for continuous random variables

  • Properties:

  • 1. - ∞ < z > + ∞

  • 2. Probability density function is bell-shaped and unimodal

  • 3. µ = 0

  • 4. σ = 1

  • Area under pdf = 1

  • Pdf = probability density function

  • Check if histogram is:

  • Symmetric

  • Bell-shaped Normal model

  • Unimodal

  • To characterize normal model, we need to know the mean (µ) and the standard deviation(σ)

  • N( µ. σ)

  • Standard Normal Model: N(0, 1)

  • But only if the data is standardized

  1. Areas Under the Normal pdf

  • P( 0 ≤ z ≤ 1)

  • Because 1 SD(x) = 68%, between -1 and 1, then you need to divide 68% by 2 to get 0 and 1 = 34%

  • ≈ 0.3413

  • P( 0 ≤ z ≤ 1 ) = area under Normal pdf between 0 and 1

  • Finding the area

  • TI – 84: DISTR normalcdf (0,1)

  • P( 0 ≤ z ≤ 1 ) ≈ 0.3414

  • P( -2 ≤ z ≤ 2 ) ≈ 0.9545

  1. How it works

  • Normal model:

  • N(500,100)

  • µ = 500

  • σ = 100

  • Standard Normal Model:

  • N(0,1)

  • µ = 0

  • σ = 1

  • Goal: to find the graded area in %

  • What percentage of the normal distribution is graded (before (2))

  • Knowing the graded area in % which is the value z (Critical value)

  • Standardizing does not change the shape of the distribution

  1. Normal distribution as an approximation data

  • P/E data = not so good approximation

  • GMATT scores approximately follow the Normal Distribution

  • With µ = 500 and σ = 100

GMATT scores N(500,100)

X N(500,1000)

= “follow”

  1. Properties of the Normal Distribution

  • Property 1

  • If X and Y are nominal random variables

  • Property 2:

  • If the number of trials n is sufficiently large (*), the nominal pdf is a good approximation

  • (*) np ≥ 10 and n (1-p) ≥ 10

  1. Percentiles of the Normal pdf

  • Lower quartile (Q1)

  • = 25% of all values is ≤ Q1

  • Q1 is also called: the 25th percentile

  • So Q3 is the 75th percentile

  • Median is the 50th percentile

  • 90th percentile: that value so that 90% of all values ≤ 90th percentile

  • Find the 25th percentile of GMATT scores:

  • TI -84: DISTR invNorm (0.25, 500, 100)

  • 25th percentile ≈ 432.5510

  • Find the 90th percentile invNorm (0.90, 500, 100)

  • 90th percentile ≈ 628.1552

  1. Normal Approximation for Binomial

  • Chapter 6 Binomial model Binom(n,p)

  • Now we can use the normal model if:

  • n x p ≥ 10

  • n x (1 – p) ≥ 10

  • this determines if its safe using nominal approximation (check with above rules)

  • N (µ, σ) where:

  • µ = n x p

  • σ =

  1. Normal Quantile – Quantile – plot (Q-Q plot)

  • Q-Q plot: method to assess whether the normal model is a good approximation of the data

  • Ex. P/E data (n = 1780)

  • 1. Sort the values in ascending order

  • 2. Find for every value the theoretical quantile

  • Ex. 1st of the 1780 values ( y1 ≈ 4.78; z scores ≈ -1.79)

  • If the variable were normally distributed; the 1st of 1780 values would have a z score ≈ -3.43

  • 1st point of Q-Q plot (-3.43; 4.78)

  • Repeat for nth values

  • 3. If the points are approximately on a straight line, the normal distribution is a good approximation of the data

  • Alternative version, with z scores of data on vertical axis

  • Point of reference is 45° line

  • If the points are approximately on the 45° line, the normal distribution is a good approximation of the data

  1. What can go wrong

  • Probability models are still just models, Question probability as you would question data

  • Don’t assume everything is Normal, check normality plot histogram

  • Don’t use the nominal approximation with lowercase n

Chapter 11: Sampling Distribution and confidence interval mean

  • Population

  • µ = population mean (unknown)

  • Sample,

  • = sample mean

  • as an estimate for µ

  • 95% confidence interval (CI) : ± 2 SE()

  • 2 = z*

  • Only work if n is large (rule of thumb: n ≥ 50) if sample has unimodal symmetric distribution

  1. The distribution of a sample mean

  • For quantitative variables

Full population

Sample 1

Sample 2

Sample 3

µ = 20

= 22

= 21

= 19

σ = 1.2

s = 0.98

s = 1.3

s = 2.03

  1. Central limit theorem

  • Central limit theorem (CLT): the sample mean () of a random sample has a probability distribution that can be approximated by the normal distribution as the sample (n) grows

  • () N

  • How good is the approximation?

  • The larger the sample (n) the better the approximation

  • Regardless of the shape of the population

  • Population = box with 1.2 million tickets

1

200,000

tickets

2

200,000

tickets

3

200,000

tickets

4

200,000

tickets

5

200,000

tickets

6

200,000

tickets

  • Draw 2 tickets without replacement (n= 2) and find the mean

5

+

2

=

7 / 2 = 3.5

= 3.5

  • Draw 5 tickets without replacement (n= 5) and find the mean

  • Draw 20 tickets without replacement (n= 20) and find the mean

  • = 3.3

  • Repeat this 10,000 times – density histogram

  • Distribution gets taller

  • Tails get thinner

  • Unimodal

  • Symmetric

  • CLT: the sample mean of a random sample has a probability distribution that can be approximated by the normal distribution

  • N (approx.)

  • The larger the sample the better the approximation

  • Sample mean are approximately distributed like a normal curve

CTL: N (approx.)

95% CI: ± 2 SE()

Property: SE() =

s

(no proof)

  1. Confidence Interval of the mean

  • Very often “true” mean (µ) and “true” standard deviation ( σ / ) are unknown

  • Solution: we estimate SD() as

  • Where SE() is the standard error

n is large

n is not large

Confidence interval

± z* * SE()

95% confidence interval

± 2 x SE(

Confidence interval

± t* x SE(

  • What can we say about the population mean (µ)?

  • We are 95% confident that the population mean (ex. age) is between

- 2 x SE( and + 2 x SE()

  • Ex. Research on toxic residuals in farm

  • y = concentration (in ppm) of Pesticide Mirex

  • n = 150; = 0.093 ppm ; s = 0.0495

  • CLT: N (approx.)

  • But E() = µ and SD() = ( σ / ) are unknown to the researcher

  • Researcher estimates SD() as:

  • Confidence interval for a mean:

± z* * SE()

Margin of error

  • Only if n is large can we use z*; otherwise use t*

  • (Student t distribution (STATS II))

  1. Assumptions and conditions to use N (apply normal dist.)

  • Independence assumption

  • The sampled values must be independent of each other

  • Sample size assumption

  • The sample size, n, must be large enough

  • Randomization condition

  • Sample must be a simple random sample of the population

  • 10% condition

  • n smaller to 10% of population (sample without replacement)

  • No success / Failure condition

  1. What can go wrong

  • Don’t confuse proportions and means. Use Normal models with proportions (and when N is large). Use student t methods with means

  • Don’t confuse s (standard deviation of sample) with σ (standard deviation of population)

  • Beware of multimodality. If you see this, try to separate the data into groups

  • Beware of skewed data

  • Investigate outliers. If they are clearly in error remove them. If they can’t be removed, you might run the analysis with or without the outlier

  • Make sure data are independent. Consider whether they are likely violations of independence in the data collection methods

  • The Central Limit Theorem doesn’t talk about the distribution of the data from the sample. It talks about the sample means and sample proportions of many different random samples drawn from the same population

Last updated