Padhai Time

Measure of Variability or Dispersion

We have studied about finding the central value of a dataset, here in this article we are going to discuss about variability/spread in data. It shows how far the data is from the center.

The farther the data from the center, the greater is the variability. The closer the data towards the center, the lesser is the variability.
Take a look at the below picture:

There are 3 kinds of datasets and each of them has a different kind of distribution.

- Set A is more concentrated towards the center

- Set C is widely spread across the axis

Why is variability in a data set a concern?
When the variability or spread is lesser in a data set (Set A), it is more stable and trustworthy. But when we have more variability, then we start doubting the stability.

For Example, Suppose as part of my daily commute in Bengaluru, I travel from Koramangala to Indiranagar daily by choosing one of the two roads. Below is the history of my past 75 trips on each road:

Have you understood which road is better to choose for my upcoming ride?

Road A? Or Road B?

Blueline is more concentrated towards the center and the orange line is widely spread across the axis. It means we are highly confident that if we travel through Road A, it will take 30 to 35 mins on average, but if we travel through Road B there are good chances that it may take more than 40 mins also. Hence Road A may prove faster than Road B for an upcoming ride. We are going to prove this concept later in this tutorial through metrics. Keep Reading.

1) Variance

2) Standard Deviation

3) InterQuartile Range (IQR)

Note: You can download the data set from this link and try finding out the metrics from your end also

1) Variance:

It is the average squared deviation of data points from the mean.

Steps:

Calculate the mean of a dataset
Subtract a data point from mean
Square it
Repeat for all the data points and take the sum
Finally divide the result by no. of elements present in the dataset

This gives us the variability value, the higher the value, the greater is the variance in the dataset.

The variance of the time taken through Road A for given 75 observations is 36.24

The variance of the time taken through Road B for given 75 observations is 97.81

Hence, as the variance of Road B is high means data points are varying much from the average time taken, so if we choose Road B, we are not much confident on the stability part, it may take a higher commute time.

2) Standard Deviation:

It is the square root value of the variance.

The standard deviation of the time taken through Road A for 75 observations is 6.02

The standard deviation of the time taken through Road B for 75 observations is 9.89

Standard deviation also tells you the same thing but it is just the squared-root value of the Variance.

In our case, it is clear that if we choose Road B, it may take 9.8 minutes extra than the usual travel time or it may take 9.8 minutes less than the usual travel time. Deviation is high for Road B than Road A, hence Road B is less reliable than A.

Difference between Variance and Standard Deviation:

Standard Deviation is interpretable but Variance is difficult to interpret. As the unit of Standard Deviation is the same as the data value’s unit. But in Variance, the unit is in the form of a Square.
Suppose Data is given in cm (centimeter), the Standard Deviation will give the result value in cms only. However, Variance will give you cm^2

3. InterQuartile Range:

Before IQR, let us understand Range

The range is the difference between the highest and lowest value of a dataset.

According to our above example:

Range for Road A is 45 mins - 15 mins = 30 mins

Range for Road B is 55 mins - 10 mins = 45 mins

So, when Range is higher, variance is higher.

But note that Range can easily get impacted by outliers.

E.g. Range of 15, 16, 17, 17, 18, 20, 20, 2000 = 2000 - 15 = 1985

As we can see actual data points are varying from 15 to 20, and 2000 is just an outlier.

Hence, there is a refined metric called IQR which does not get influenced through outliers.

IQR (Interquartile Range) is the difference between Q3 and Q1

So IQR helps us in providing a central 50% range and it is unaffected by outliers in the dataset.

Example:

Data = [0, 2, 4, 6, 8, 10, 12, 14, 16, . . . . . , 194, 196, 198]

print("Mean is:", np.mean(x))

print("Median is:", np.median(x))

print("Variance is:", np.var(x))

print("Variance is:", np.std(x))

print("Q3:", np.percentile(x, 75, interpolation='midpoint'))

print("Q1:", np.percentile(x, 25, interpolation='midpoint'))

print("IQR:", np.percentile(x, 75, interpolation='midpoint') - np.percentile(x, 25, interpolation='midpoint'))

Output:

Mean is: 99

Median is: 99

Variance is: 3333

Variance is: 57.73

Q3: 149

Q1: 49

IQR: 100

Bengaluru, India

contact.padhaitime@gmail.com