Introduction to Statistics

In Data Science, we aim to do different experiments with raw data and finds some good insights from the data. To drive any business on the right path, data is very important or we can say that “Data is the fuel”. It can at least provide some actionable insights that can help to:

•Strategize current campaigns,

•Easily organize the launch of new products, or

•Try out different experiments.

In all the above-mentioned things, the one common driving component is Data. We are entering into the digital era where we produce a lot of Data every day.

For Example, On a daily basis, a company like Flipkart produces more than 2- TB of data.

Due to so much importance of data in our life, it becomes very crucial to properly store and process this data without any error. While dealing with datasets, the data type or category of the data plays an important role to find the answer to the questions below:

•Which preprocessing strategy would work for a particular set to get the right results, or

• Which type of statistical analysis should be applied for the best results.

Introduction to Data Types in Statistics

In Statistics, Data Types play a very crucial and important role, which needs to be understood, to apply statistical measurements correctly to your data so that we can correctly conclude certain assumptions about the data.

Similarly, we need to know which data analysis and its type you are working on to select the correct perception technique since different data types are considered as an approach to arrange various types of variables.

While doing Exploratory Data Analysis (EDA) in a general data science project, it becomes crucial to have a good understanding of the different data types since we can use certain statistical measurements only for specific data types.

It is also known as the Measurement Scale.

While dealing with any of the data types, we also need to know which visualization method fits the particular data type. We can think of data types as a way to categorize different types of variables.

Qualitative data

Qualitative data is information that cannot be counted, measured or easily expressed using numbers. It is collected from text, audio and images and shared through data visualization tools, such as word clouds, concept maps, graph databases, timelines and infographics.

  1. Nominal data

1.1. This data type is used just for labeling variables, without having any quantitative value. Here, the term ‘nominal’ comes from the Latin word “nomen” which means ‘name’.

1.2. It just names a thing without applying for any particular order. The nominal data sometimes referred to as “labels”.

Examples of Nominal Data:

•Hair color (Blonde, Brown, Brunette, Red, etc.)

•Marital status (Married, Single, Widowed)

As you can observe from the examples there is no intrinsic ordering to the variables.

Eye color is a nominal variable having a few levels or categories such as Blue, Green, Brown, etc and there is no possible way to order these categories in a rank-wise manner i.e, from highest to lowest or vice-versa.

  1. Ordinal data

2.1. The crucial difference from nominal types of data is that Ordinal Data shows where a number is present in a particular order.

2.2. This type of data is placed into some kind of order by their position on a scale. Ordinal data may indicate superiority.

2.3. We cannot do arithmetic operations with ordinal data because they only show the sequence.

2.4. Ordinal variables are considered as “in-between” qualitative and quantitative variables.

2.5. In simple words, we can understand the ordinal data as qualitative data for which the values are ordered.

2.6. In comparison with nominal data, the second one is qualitative data for which the values cannot be placed in an order.

2.7. Based on the relative position, we can also assign numbers to ordinal data. But we cannot do math with those numbers. For example, “first, second, third… etc.”

Examples of Ordinal Data:

•Ranking of users in a competition: The first, second, and third, etc.

•Rating of a product taken by the company on a scale of 1-10.

•Economic status: low, medium, and high.

Quantitative data

  1. Discrete data

1. It shows the count that involves only integers and we cannot subdivide the discrete values into parts.

For Example, the number of students in a class is an example of discrete data since we can count whole individuals but can’t count like 2.5, 3.75, kids.

2. In simple words, discrete data can take only certain values and the data variables cannot be divided into smaller parts.

3. It has a limited number of possible values e.g. days of the month.

Examples of discrete data:

•The number of students in a class.

•The number of workers in a company.

•The number of test questions you answered correctly.

  1. Continuous data

1.It represents the information that could be meaningfully divided into its finer levels. It can be measured on a scale or continuum and can have almost any numeric value. For Example, We can measure our height at very precise scales in different units such as meters, centimeters, millimeters, etc.

  1. The key difference between continuous and discrete types of data is that in the former, we can record continuous data at so many different measurements such as width, temperature, time, etc.

  2. The continuous variables can take any value between two numbers. For Example, between the range of 60 and 82 inches, there are millions of possible heights such as 62.04762 inches, 79.948376 inches, etc.

  3. A good great rule for defining if data is continuous or discrete is that if the point of measurement can be reduced in half and still make sense, the data is continuous.

Examples of continuous data:

•The amount of time required to complete a project.

Introduction of Statistics and its Types

Statistics simply means numerical data, and is field of math that generally deals with collection of data, tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis that uses different quantitative models to produce a set of experimental data or studies of real life. It is an area of applied mathematics concern with data collection analysis, interpretation, and presentation. Statistics deals with how data can be used to solve complex problems. Some people consider statistics to be a distinct mathematical science rather than a branch of mathematics.

Statistics makes work easy and simple and provides a clear and clean picture of work you do on a regular basis.

Basic terminology of Statistics :

  • Population – It is actually a collection of set of individuals or objects or events whose properties are to be analyzed.

  • Sample – It is the subset of a population.

1. Descriptive Statistics : Descriptive statistics uses data that provides a description of the population either through numerical calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing objects, etc. There are two categories in this as following below.

(a). Measure of central tendency – Measure of central tendency is also known as summary statistics that is used to represents the center point or a particular value of a data set or sample set. In statistics, there are three common measures of central tendency as shown below:

  • (i) Mean : It is measure of average of all value in a sample set. For example,

(ii) Median : It is measure of central value of a sample set. In these, data set is ordered from lowest to highest value and then finds exact middle. For example,

(iii) Mode : It is value most frequently arrived in sample set. The value repeated most of time in central set is actually mode. For example,

(b) Measure of Variability – Measure of Variability is also known as measure of dispersion and used to describe variability in a sample or population. In statistics, there are three common measures of variability as shown below:

  • (i) Range : It is given measure of how to spread apart values in sample set or data set.

    Range = Maximum value - Minimum value

  • (ii) Variance : It simply describes how much a random variable defers from expected value and it is also computed as square of deviation.

In these formula, n represent total data points, ͞x represent mean of data points and x_i represent individual data points.

  • (iii) Dispersion : It is measure of dispersion of set of data from its mean.

2. Inferential Statistics : Inferential Statistics makes inference and prediction about population based on a sample of data taken from population. It generalizes a large dataset and applies probabilities to draw a conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to analyze, interpret result, and draw conclusion. Inferential Statistics is mainly related to and associated with hypothesis testing whose main target is to reject null hypothesis.

Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate and assess credibility of a hypothesis about a population. Inferential statistics are generally used to determine how strong relationship is within sample. But it is very difficult to obtain a population list and draw a random sample.

Inferential statistics can be done with help of various steps as given below:

  1. 1.

    Obtain and start with a theory.

  2. 2.

    Generate a research hypothesis.

  3. 3.

    Operationalize or use variables

  4. 4.

    Identify or find out population to which we can apply study material.

  5. 5.

    Generate or form a null hypothesis for these population.

  6. 6.

    Collect and gather a sample of children from population and simply run study.

  7. 7.

    Then, perform all tests of statistical to clarify if obtained characteristics of sample are sufficiently different from what would be expected under null hypothesis so that we can be able to find and reject null hypothesis.

Types of inferential statistics – Various types of inferential statistics are used widely nowadays and are very easy to interpret. These are given below:

  • One sample test of difference/One sample hypothesis test

  • Contingency Tables and Chi-Square Statistic

Please see the following sources for more details;

Introduction of Statistical Data Distributions

Distribution simply means collection or gathering of data, or scores, on variable. Generally, all these scores are arranged in specific order from smallest to largest. Then these scores can be presented graphically. Many data comply with rules of well-known and highly understood functions of mathematics.

A function can usually fit data with some modifications and changes in parameters of functions. As soon as distribution function is known and identified, it can be used as shorthand for describing and calculating related quantities. These quantities can be likelihood of observations, and plotting relationship between observations in domain.

Distributions are generally described in terms of their density or density functions. Density functions are simply described as functions that explain how proportion of data or likelihood of proportion of observations change over wide range of distribution. Density functions are of two types –

  • Probability Density Function (PDF) – It calculates probability of observing given value.

  • Cumulative Density Function (CDF) – It calculates probability of an observation equal or less than value.

Both PDFs and CDFs are type of continuous functions. For discrete distribution, equivalent of PDF is called Probability Mass Function (PMF).

Types of Statistical Data Distributions :

  1. 1.

    Gaussian Distribution

    It is named after Carl Friedrich Gauss. Gaussian Distribution is focus of much of field of statistics. It is also known as Normal Distribution. With use of Gaussian Distribution, data from different study fields can be described. Generally, Gaussian Distribution is described using two parameters :

    • Mean : It is denoted with Greek lowercase letter “mu”. It is expected value of distribution.

    • Variance : It is denoted with Greek lowercase letter “sigma” raised to second power (this is because units of variables are squared.). It generally describes spread of observation from mean.

      It is very common and easy to use normalized calculation of variance called Standard Deviation. Standard Deviation is denoted with Greek lowercase letter “sigma”. It generally describes normalized spread of observations from mean.

    Example – The example given below creates Gaussian PDF with sample space from -5 to 5, mean of 0, and standard deviation of 1. Such type of gaussian with these values of mean and standard deviation is called Standard Gaussian.

    Python Code for Line Plot of Gaussian Probability Density Function :

    from numpy import arrange

    from matplotlib import pyplot

    from scipy.stats import norm

    # define the distribution parameters

    sample_space= arange (-5, 5, 0.001)

    pdf= norm.pdf (sample_space, mean, stdev)

    pyplot.plot (sample_space, pdf)

    When we run above example, it creates lined plot that shows sample space in x-axis and likelihood of each value of Y-axis. Line plot generally shows and represents familiar bell shape for gaussian distribution.

    In this plot, top of bell shows expected value or mean, which in this is zero, as we have already specified it while creating distribution.

  2. 2.

    T- Distribution

    It is named after Willian Sealy Gosset. T- distribution generally arises when we attempt to find out mean of normal distribution with different sized samples. It is very helpful when describing uncertainty or error related to estimating or finding out population statistics for data drawn from Gaussian Distributions when size of sample must be considered. T-distribution can be described using single parameter. Number of Degrees of Freedom : It is denoted with Greek lowercase letter “nu (v)”. It simply denotes number of degrees of freedom. Number of degrees of freedom generally explains number of pieces of information that is used to describe population quantity.

    Example – The example given below creates t-distribution with sample space from -5 to 5 and (10, 000-1) degrees of freedom.

    Python Code for Line Plot of Student’s t-distribution Probability Density Function :

    # plot the t-distribution pdf

    from matplotlib import pyplot

    from scipy.stats import t

    # define the distribution parameters

    sample_space= arange (-5, 5, 0.001)

    dof= len(sample_space) - 1

    pdf= t.pdf (sample_space, dof)

    pyplot.plot (sample_space, pdf)

    When we run above example, it creates and plots t-distribution PDF.

    You can see similar bell-shape to distribution much like normal. The main difference is fatter tails in distribution, highlighting increased likelihood of observations in tails as compared to that of Gaussian Distribution.

If you are a part of machine learning world, you will certainly be familiar with the term central limit theorem commonly know as CLT. This theorem makes data scientist life simple but often misunderstood and is mostly confused with the law of large numbers. It is a powerful statistical concept that every data scientist MUST know. Now, why is that? Let’s see!!!

Why should you read this subsection? After completion you will know:

  • The CLT describes the shape of the distribution of sample means as a Gaussian commonly known as normal distribution, which is very popular in statistics.

  • An example of simulated dice rolls in Python to demonstrate the central limit theorem.

  • And finally with CLT knowledge of the Gaussian distribution is used to make inferences about model performance in applied machine learning.

Before going to statistical definition lets consider an example to understand better.

Consider there are 200 students in a class of Computer Science, Our task is to calculate the average marks scored by the class in the Data structure subject. Sounds simple right?

One way will be consider every student's marks sum them all and then divide by total number of students to find out the average.

But what if the size of data is massive, let’s say you are asked to do a study on all the computer science student population of India. Now considering each students marks will be tiresome and long process. So what can be an alternate approach?

First draw groups of student in random from the class(population), which are called as sample. We will draw a multiple sample each consisting of 30 students.

Note: Well why 30, CLT theorem will be true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If population is normal distribution , you can consider sample size less then 30 and CLT theorem will hold good which also will be a true representation of population.

Depiction of distribution of marks changes with sample size

  • Calculate the individual mean of these samples

  • Calculate the mean of these sample means

  • This value will give us the approximate mean marks of the students in the computer science department

  • Additionally, the histogram of the sample mean marks of students will resemble a bell curve (or normal distribution)

The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population.

Unpacking the definition, Irrespective of the population distribution if samples size is sufficient. The distribution of sample means, calculated from repeated sampling, will tend to normality as the size of your samples gets larger.

It is important that each trial that results in an observation be independent and performed in the same way. This is to ensure that the sample is drawing from the same underlying population distribution. More formally, this expectation is referred to as independent and identically distributed, or iid.

Firstly, the central limit theorem is impressive, especially as this will occur no matter the shape of the population distribution from which we are drawing samples. It demonstrates that the distribution of errors from estimating the population mean fit a distribution that the field of statistics knows a lot about.

Secondly, this estimate of the Gaussian distribution will be more accurate as the size of the samples drawn from the population is increased. This means that if we use our knowledge of the Gaussian distribution in general to start making inferences about the means of samples drawn from a population, that these inferences will become more useful as we increase our sample size.

Last updated