## Introduction

Recall that the **Interquartile range (IQR)** is the difference between the 75th percentile (0.75 quantile) and the 25th percentile (0.25 quantile). In the last tutorial, we learned how to compute the interquartile range from scratch.

**But how is the IQR going to help you for Data Science?**

One practical use of the IQR is to detect outliers in your data. The general rule is that outliers are observations that fall:

- below
**25th percentile – 1.5 * IQR**, or - above
**75th percentile + 1.5 * IQR**

In fact, when you create a box plot from the data, this is exactly what you see

*Image source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51*

In this tutorial, we will use the IQR method to see which countries in the World are outliers in terms of wealth.

## Practice

import pandas as pd import numpy as np import matplotlib.pyplot as plt

### 1 – Dataset

According to Investopedia, Per capita GDP is a global measure for gauging the prosperity of nations and is used by economists, along with GDP, to analyze the prosperity of a country based on its economic growth.

df = pd.read_csv("../data/gdp_per_capita.csv") # using the GDP per capita in 2018, latest year available df = df[['Country Name', '2018']] df = df.dropna() df = df.rename(columns={ 'Country Name': 'country', '2018':'gdp' }) # Display the first rows of the dataframe df.head()

country | gdp | |
---|---|---|

1 | Afghanistan | 520.896603 |

2 | Angola | 3432.385736 |

3 | Albania | 5268.848504 |

4 | Andorra | 42029.762737 |

5 | United Arab Emirates | 43004.953358 |

### 2 – IQR Outlier Detection Method

def get_percentile(df, percentile_rank): # First, sort by ascending gdp, reset the indices df = df.sort_values(by='gdp').reset_index() # Rule of three to get the index of the gdp index = (len(df.index)-1) * percentile_rank / 100.0 index = int(index) # Return the gdp corresponding to the percentile rank # as well as the name of the corresponding country return (df.at[index, 'country'], df.at[index, 'gdp']) def interquartile_range(df): c75, p75 = get_percentile(df, 75) # 75th percentile country and gdp c25, p25 = get_percentile(df, 25) # 25th percentile country and gdp iqr = p75 - p25 # Interquartile Range return iqr

First, comptute the interquartile range in terms of GDP per Capita. In 2017, the difference between the 25th country and the 75th country in terms of GDP per capita was around USD$ 17,306 per person.

iqr = interquartile_range(df) iqr # output: 17137.727817263032

Now detect the outliers using the IQR method

def get_outliers(df): # Compute the 25th percentile, the 75th percentile and the IQR c25, p25 = get_percentile(df, 25) c75, p75 = get_percentile(df, 75) iqr = interquartile_range(df) # "Minimum non-outlier value": 25th percentile - 1.5 * IQR min_val = p25 - 1.5*iqr # "Maximum non-outlier value": 75th percentile + 1.5 * IQR max_val = p75 + 1.5*iqr # Anything below the minimum or above the maximum becomes # an outlier outliers = df[(df['gdp'] < min_val) | (df['gdp'] > max_val)] return outliers

outliers = get_outliers(df) outliers = outliers.sort_values(by='gdp', ascending=False)

*Printing the first 10 outliers*

country | gdp | |
---|---|---|

122 | Monaco | 185741.279992 |

117 | Luxembourg | 116597.295638 |

119 | Macao SAR, China | 87208.535913 |

33 | Switzerland | 82828.797372 |

144 | Norway | 81734.465574 |

92 | Ireland | 78582.948100 |

95 | Iceland | 73368.117464 |

161 | Qatar | 68793.784437 |

168 | Singapore | 64581.944018 |

198 | United States | 62886.836485 |

**rich**countries. None of the less rich countries was considered to be an outlier. This is because the median GDP is at USD$ 6,631.

c50, p50 = get_percentile(df, 50) print("Median GDP per capita on earth = USD$ %s (%s)" % (int(p50), c50)) # Median GDP per capita on earth = USD$ 6344 (Ecuador)

Because the IQR is at $17,306 (see above), the minimum non-outlier value is negative, which is impossible for a a Gross Domestic Product.

# "Minimum non-outlier value": 25th percentile - 1.5 * IQR c25, p25 = get_percentile(df, 25) min_val = p50 - 1.5*iqr print("Minimum non-outlier value = $ %s" % int(min_val)) # Minimum non-outlier value = $ -19361

However, the maximum non-outlier value is exceeded multiple times.

plt.figure(figsize=(12,4)) plt.hist(df['gdp'], bins=25) plt.title("Total number of countries: %s" % len(df.index)) plt.xlabel("GDP at Purchasing Power Parity (PPP) per capita in 2017") c75, p75 = get_percentile(df, 75) # 75th percentile country and gdp c50, p50 = get_percentile(df, 50) # median country and gdp c25, p25 = get_percentile(df, 25) # 25th percentile country and gdp # Vertical lines for each percentile of interest plt.axvline(p25, linestyle='--', color='#ffa700', label='25th percentile') plt.axvline(p50, linestyle='--', color='#5bc0de', label='50th percentile (median)') plt.axvline(p75, linestyle='--', color='#5cb85c', label='75th percentile') # Maximum non-outlier value maxval = p75 + 1.5*iqr plt.axvline(maxval, linestyle='--', color='#d9534f', label='Maximum non-outlier') plt.legend(loc='best') plt.show()

### In Conclusion

Using the IQR rule to detect outliers, we can see that, in 2018

- no country in the world was
*abnormally*poor compared to the rest, but - several countries were
*abnormally*rich compared to the rest in terms of GDP per capita

Also notice how the median (in light blue) is closer to the lower quartile (25th percentile) than the upper quartile (75th percentile). This is because the data is skewed to the right: very few countries have a very high GDP per capita. This is an example of how statistics can show us that our generation still has a lot of work to do to address the problem disparity of wealth.

For an end-to-end version of this code see my GitHub notebook.