Padhai Time

Distribution shapes

Another way of analyzing your data is to check distribution shape

First understand what does distribution mean?

Distribution means summarizing the data by showing possible values that your data can have and how frequently those data values can occur.

Didn't get it? No issues!! Take a look at below example:

When you roll out a dice, there are only 6 possible values (1, 2, 3, 4, 5, 6) that can come up.
When you flip a coin, there are only 2 possible values (Heads, Tails) that can come up.

Also for a dice or for a coin, each outcome has an equally-likely chance of occurrence.

When we throw a coin 10 times,

we may get Heads 4 times and Tails 6 times
Or, we may get Heads 5 times and Tails 5 times
Or, we may get Heads 6 times and Tails 4 times

But the chances of coming Heads as 10 times and Tails as 0 times is very very low.

So, we can plot these outcomes and their frequency through charts and these outcome-frequency representation is called as Distribution

Now we will understand what does distribution shape mean?

When we plot outcome-frequency charts, the resulting distribution follows some shape. And by checking these distribution shapes helps in our business in making right decisions. Let us look into various types of distributions one by one:

Modality
Skewness
Kurtosis
Central Limit Theorem

1) Modality:

Modal means mode and we have already learnt about mode (it is the observation whose occurrence is highest in data)

Unimodal: Distribution which has just one peak

Passengers travel through “Economy” class most over “Business” and “Business Economy” class, so here the “Economy” class is the peak and hence the distribution will be unimodal

Bimodal: Distribution which has two peaks

Cinema halls are booked mostly on weekends (Saturday and Sunday)

9 to 10 AM and 6 to 7 PM are two peak hours where traffic gets high on the roads

Multimodal: Distribution which has multiple peaks

Uniform Distribution: Distribution which has no peak

E.g. A Dice or a coin. All the possible outcomes have equal chance of occurrence

2) Skewness

Skewness speaks about the symmetric or asymmetric nature of probability distribution of a variable. As we have seen that unimodal distribution contains a singular peak, but distribution can differ in shape. Some unimodal distribution can be symmetric and some may be not.

Symmetric Distribution:

A distribution where the left half of the distribution is the mirror image of the right half of the distribution, it is known as Symmetric Distribution.

Uniform, Unimodal, Bimodal or Multimodal, all these distributions can be symmetric in nature.

Now let us understand, what is asymmetric distribution?

A distribution whose left half is not a mirror image of the right half.

Skewness tells about an asymmetry measure of a probability distribution of a random variable. There are 3 categories:

Zero Skewed
Positively skewed
Negatively skewed

Mode being the highest frequency value,

Mean gets affected by outliers, therefore if the distribution is right skewed, mean value will shift towards right and vice versa.

Median lies somewhere between Mode and Mean

3) Kurtosis:

Kurtosis talks about the presence of outliers in your data. When the amount of outliers are more, then the distribution tails become heavier, when the outliers are less, tails become lighter.

There are three major types of distributions with varying kurtosis value:

Leptokurtic: kurtosis is +ve

Mesokurtic: kurtosis is 0

Platykurtic: kurtosis is -ve

1) Mesokurtic:

Mesokurtic distributions have kurtosis nearly zero.

2) Leptokurtic:

Leptokurtic distributions have higher kurtosis than Mesokurtic distributions where tails are thick due to high number of outliers. Also, it has a higher peak than Mesokurtic.

3) Platykurtic:

Platykurtic distributions have lower kurtosis than Mesokurtic distributions where tails are thin due to presence of low number of outliers. Also, it has a lower peak than the Mesokurtic curve.

Python Exercise: Calculating Skewness and Kurtosis

import numpy as np

import seaborn as sns

from scipy.stats import kurtosis, skew

x = np.random.normal(0, 2, 100000) # Normal Distribution or Mesokurtic Distribution

sns.displot(x)

plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))

x = np.random.normal(0, 2, 50)

sns.displot(x)

plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))

x = np.random.normal(0, 2, 50)

sns.displot(x)

plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))

4) Central Limit Theorem:

According to the Central limit theorem, the sampling distribution of the sample means tends towards a normal distribution when the sample size keeps on increasing. Whether the population distribution is normal or not, the sampling distribution of the sample means will show the normal behaviour.

Considering the fact that sample size taken should be greater than or equal to 30. If you are taking your sample size lesser than 30, then population distribution should be normal so as to get sampling distribution as normal in nature.

Population Distribution:

When we plot all the data points on a histogram, the distribution we get is population distribution. This distribution may be normal or may be not.

Sample:

When we choose some random data points from population, then it is called as sample

Sampling Distribution of sample means:

When we take one sample having sample size ‘n’ chosen from the population and calculate its mean, this is our first observation,
When we take second sample having sample size ‘n’ from the population and calculate its mean, this is our second observation,
Similarly, when we keep on taking ‘K’ samples one by one having sample size ‘n’ each and later find out their means, we are finally left with ‘K’ sample means.
Now when we plot these ‘k’ sample means on a histogram, the distribution we get is called the Sampling Distribution of sample means.

We have considered one data set of marks for 1000 students. This data set can be downloaded from this link

This is the probability distribution of the students marks given in the csv. Their mean score is 66.089

Now suppose we are asked to find the mean score of the students, but we cannot use a complete data set. We can only draw samples, but we need to come up with one reliable metric for mean.

So we are going to use the Central Limit Theorem for this scenario.

Draw multiple samples from the population with replacement, calculate the mean of each sample, finally take the average of your sample means. This will give us one reliable metric which will be closely equal to the population mean.

Let us look at the samples mean distribution

It looks somewhat normally distributed. When we keep on increasing the number of iterations, it will become perfectly normal distributed. Look at below image when we have drawn 10,000 samples of size 40 each.

Here also the standard deviation is the same which is 2.4 but the distribution looks normal.

If you want to decrease the standard deviation (means to gain more confidence over the statistic), you need to increase the sample size. Look at the image below, it is where we have increased the sample size from 40 to 100 and the standard deviation has come down from 2.4 to 1.53 only.

So we can use the Central limit theorem when we don’t have access to a complete population. And we can try multiple pairs of sample size and iterations to refine our final statistic.

Bengaluru, India

contact.padhaitime@gmail.com