Another way of analyzing your data is to check distribution shape
First understand what does distribution mean?
Distribution means summarizing the data by showing possible values that your data can have and how frequently those data values can occur.
Didn't get it? No issues!! Take a look at below example:
Also for a dice or for a coin, each outcome has an equally-likely chance of occurrence.
When we throw a coin 10 times,
But the chances of coming Heads as 10 times and Tails as 0 times is very very low.
So, we can plot these outcomes and their frequency through charts and these outcome-frequency representation is called as Distribution
Now we will understand what does distribution shape mean?
When we plot outcome-frequency charts, the resulting distribution follows some shape. And by checking these distribution shapes helps in our business in making right decisions. Let us look into various types of distributions one by one:
1) Modality:
Modal means mode and we have already learnt about mode (it is the observation whose occurrence is highest in data)
Unimodal: Distribution which has just one peak
Passengers travel through “Economy” class most over “Business” and “Business Economy” class, so here the “Economy” class is the peak and hence the distribution will be unimodal
Bimodal: Distribution which has two peaks
Cinema halls are booked mostly on weekends (Saturday and Sunday)
9 to 10 AM and 6 to 7 PM are two peak hours where traffic gets high on the roads
Multimodal: Distribution which has multiple peaks
Uniform Distribution: Distribution which has no peak
E.g. A Dice or a coin. All the possible outcomes have equal chance of occurrence
2) Skewness
Skewness speaks about the symmetric or asymmetric nature of probability distribution of a variable. As we have seen that unimodal distribution contains a singular peak, but distribution can differ in shape. Some unimodal distribution can be symmetric and some may be not.
Symmetric Distribution:
A distribution where the left half of the distribution is the mirror image of the right half of the distribution, it is known as Symmetric Distribution.
Uniform, Unimodal, Bimodal or Multimodal, all these distributions can be symmetric in nature.
Now let us understand, what is asymmetric distribution?
A distribution whose left half is not a mirror image of the right half.
Skewness tells about an asymmetry measure of a probability distribution of a random variable. There are 3 categories:
Mode being the highest frequency value,
Mean gets affected by outliers, therefore if the distribution is right skewed, mean value will shift towards right and vice versa.
Median lies somewhere between Mode and Mean
3) Kurtosis:
Kurtosis talks about the presence of outliers in your data. When the amount of outliers are more, then the distribution tails become heavier, when the outliers are less, tails become lighter.
There are three major types of distributions with varying kurtosis value:
Leptokurtic: kurtosis is +ve
Mesokurtic: kurtosis is 0
Platykurtic: kurtosis is -ve
1) Mesokurtic:
Mesokurtic distributions have kurtosis nearly zero.
2) Leptokurtic:
Leptokurtic distributions have higher kurtosis than Mesokurtic distributions where tails are thick due to high number of outliers. Also, it has a higher peak than Mesokurtic.
3) Platykurtic:
Platykurtic distributions have lower kurtosis than Mesokurtic distributions where tails are thin due to presence of low number of outliers. Also, it has a lower peak than the Mesokurtic curve.
Python Exercise: Calculating Skewness and Kurtosis
import numpy as np
import seaborn as sns
from scipy.stats import kurtosis, skew
x = np.random.normal(0, 2, 100000) # Normal Distribution or Mesokurtic Distribution
sns.displot(x)
plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))
x = np.random.normal(0, 2, 50)
sns.displot(x)
plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))
x = np.random.normal(0, 2, 50)
sns.displot(x)
plt.title("Skewness: {0} Kurtosis: {1}".format(round(skew(x), 4), round(kurtosis(x), 4)))
4) Central Limit Theorem:
According to the Central limit theorem, the sampling distribution of the sample means tends towards a normal distribution when the sample size keeps on increasing. Whether the population distribution is normal or not, the sampling distribution of the sample means will show the normal behaviour.
Considering the fact that sample size taken should be greater than or equal to 30. If you are taking your sample size lesser than 30, then population distribution should be normal so as to get sampling distribution as normal in nature.
Population Distribution:
When we plot all the data points on a histogram, the distribution we get is population distribution. This distribution may be normal or may be not.
Sample:
When we choose some random data points from population, then it is called as sample
Sampling Distribution of sample means:
We have considered one data set of marks for 1000 students. This data set can be downloaded from this link
This is the probability distribution of the students marks given in the csv. Their mean score is 66.089
Now suppose we are asked to find the mean score of the students, but we cannot use a complete data set. We can only draw samples, but we need to come up with one reliable metric for mean.
So we are going to use the Central Limit Theorem for this scenario.
Draw multiple samples from the population with replacement, calculate the mean of each sample, finally take the average of your sample means. This will give us one reliable metric which will be closely equal to the population mean.
Let us look at the samples mean distribution
It looks somewhat normally distributed. When we keep on increasing the number of iterations, it will become perfectly normal distributed. Look at below image when we have drawn 10,000 samples of size 40 each.
Here also the standard deviation is the same which is 2.4 but the distribution looks normal.
If you want to decrease the standard deviation (means to gain more confidence over the statistic), you need to increase the sample size. Look at the image below, it is where we have increased the sample size from 40 to 100 and the standard deviation has come down from 2.4 to 1.53 only.
So we can use the Central limit theorem when we don’t have access to a complete population. And we can try multiple pairs of sample size and iterations to refine our final statistic.
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here