Padhai Time

Graphical Visualizations Part III

7) Q-Q plot

Q-Q (quantile-quantile) plot helps us to analyze and compare two probability distributions by plotting their quantiles against each other.

Suppose we have one variable and we need to check whether this variable is uniformly distributed, normally distributed or possesses any other distribution, to answer this we can simply use Q-Q plot.

If two distributions are closely similar in nature, then the data points will fall on a 45 degree straight line. And if the data set has some outliers or skewness, then the data points will move away from the 45 degree straight line.

To draw data points on Q-Q plot, we first sort the data in ascending order. Then we divide the dataset into quantiles and if the two samples come from the same distribution, the items at a given position in the sorted samples will tend to be "similar".

For example, 10% quantiles of first sample will match with 10% quantile of second sample. Plot one sample on the x axis and the other on the y axis. If the samples have been generated from the same distribution, these data points will fall on a straight line.

data_values_1 = np.random.normal(0, 1, 10) # Mean 0, Std_dev 1

We are generating 10 random values having mean as 0 and standard deviation as 1

And now we can compare if the values in the array “data_values_1” form a normal distribution by plotting a q-q plot with inbuilt metric "stats.distributions.norm”.

If your data values form a uniform distribution, then we can use the inbuilt “stats.distributions.uniform” metric to compare our set.

Code:

import numpy as np

import statsmodels.api as sm

import scipy.stats as stats

import pylab

data_values_1 = np.random.normal(0, 1, 10) # Mean 0, Std_dev 1

sm.qqplot(data_values_1, stats.distributions.norm, line='45')

data_values_1 = np.random.normal(0, 1, 100) # 100 data points

sm.qqplot(data_values_1, stats.distributions.norm, line='45')

data_values_1 = np.random.normal(0, 1, 1000) # 1000 data points

sm.qqplot(data_values_1, stats.distributions.norm, line='45')

data_values_1 = np.random.normal(0, 1, 10000) # 10000 data points

sm.qqplot(data_values_1, stats.distributions.norm, line='45')

pylab.show()

Output:

As you can see, as and when the data size increases, distribution looks more normal and data points fall strictly on a straight line.

X axis shows the theoretical quantiles because we have used inbuilt metric for it. Y axis is the actual data values of our sample.

Let us try the same thing for Uniform Distribution:

Code:

data_values_2 = np.random.uniform(0, 1, 10) # low 0, high 1, size 10

sm.qqplot(data_values_2, stats.distributions.uniform, line='45')

data_values_2 = np.random.uniform(0, 1, 100) # 100 data points

sm.qqplot(data_values_2, stats.distributions.uniform, line='45')

data_values_2 = np.random.uniform(0, 1, 1000) # 1000 data points

sm.qqplot(data_values_2, stats.distributions.uniform, line='45')

data_values_2 = np.random.uniform(0, 1, 10000) # 10000 data points

sm.qqplot(data_values_2, stats.distributions.uniform, line='45')

pylab.show()

Output:

Let us verify once the distributions of both the data values 1 and 2 (having size 10,000 each)

Code:

import matplotlib.pyplot as plt

import seaborn as sns

sns.displot(data = data_values_1, kind = 'hist')

sns.displot(data = data_values_2, kind = 'hist')

plt.show()

It is clear from this image that our data set 1 follows normal distribution and the second set follows uniform distribution. And by plotting their qq plot we are more confident that our distributions do not have any skewness or outliers, as the data points fall perfectly on the straight line in 4th sub-chart of both the above images.

Finally, let us try to see if the data set has a uniform distribution, but when we choose “stats.distributions.norm” as a metric, then how do the data points look on the chart?

Code:

sm.qqplot(data_values_2, stats.distributions.norm, line='45')

Output:

Data points do not fall on the straight line which means it is clear that the dataset "data_values_2" does not follow normal distribution.

Bengaluru, India

contact.padhaitime@gmail.com