Category

Statistics

Quartiles, Quantiles and Percentiles

Introduction

Suppose you have a list of countries and their population. The quantile rank (and percentile rank) of your country correspond the fraction of countries with populations lower or equal than your country.

The difference is that the quantile goes from 0 to 1, and the percentile goes from 0% to 100%.

  • 0.25 quantile = 25th percentile = lower quartile
  • 0.5 quantile = 50th percentile = median
  • 0.75 quantile = 75th percentile = upper quartile
  • etc.

So if your country has more inhabitants than 75% of the other countries in the world, it is

  • in the 0.75 quantile
  • in the 75th percentile
  • in the upper quartile.

Let’s compute the quantile rank of your country.

Practice

import pandas as pd
import numpy as np

We will use a simplified version of the WorldBank population per country dataset – the original csv file is available here.

df = pd.read_csv("../data/countries-population-2018.csv")
df = df.dropna()
df['population'] = df['population'].apply(lambda x: int(x))
df.to_csv("../data/countries-population-2018.csv", index=False)
df.head(3)
  country population
0 aruba 105845
1 afghanistan 37172386
2 angola 30809762
def QuantileRank(df, country):
    
    # your country's population
    population = int(df[df['country']==country]['population'])
    # countries with population lower or equal than your country
    lower = df[df['population'] <= population]
    # number of such countries
    n_lower = len(lower.index)
    # total number of countries
    n_countries = len(df.index)
    # percntile rank
    quantile_rank = n_lower/n_countries
    return quantile_rank

def PercentileRank(df, country):
    
    # This is just the quantile rank, times 100
    quantile_rank = QuantileRank(df, country)
    percentile_rank = 100.0*quantile_rank
    return percentile_rank

Canada is the 81th percentile

PercentileRank(df, 'canada')
81.73076923076923

India is in the 99th percentile

PercentileRank(df, 'india')
99.51923076923077

Full code on my Github here.

The Curse of Dimensionality – Illustrated With Matplotlib

Maybe you already came across this famous quote in Machine Learning by Charles Lee Isbell

“As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.”

Here is another explanation from Wikipedia

 “When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. (…) In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.”

I think the  “Curse of Dimensionality” is easier to understand when visualized. Suppose you have 50 data points between 0 and 100.

1- Let’s try with one dimension first

import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np

fig = plt.figure()
ax  = plt.axes()
fig.set_size_inches(12, 1)

x = random.sample(range(0, 100), 50) 
y = [0 for xval in x]
plt.scatter(x, y)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')

ax.set_xlim((0,100))
ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("")
plt.yticks([], [])
plt.title("1D")
plt.show()

With 5 intervals in our first dimension, there will be an average of 50/5 = 10 points per cell, which is already low if you’d like to do any statistical analysis for each interval.

2- Moving to two dimensions

fig = plt.figure()
ax  = plt.axes()
fig.set_size_inches(8, 8)

# Now each point has 2 dimensions (x,y)
x = random.sample(range(0, 100), 50) 
y = random.sample(range(0, 100), 50) 

plt.scatter(x, y)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')
    plt.axhline(y=grid_pt, color='#D8D8D8')

ax.set_xlim((0,100))
ax.set_ylim((0,100))
ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("Dimension #2", fontsize=14)
plt.title("2D")
plt.show()

With 5 intervals on the first dimension and 5 intervals on the 2nd dimension, we now have 50/(5×5) = 2 points per cell on average. In fact, we are already starting to see cells that do not have any data to work with.

3- Adding a third dimension

from mpl_toolkits import mplot3d

fig = plt.figure()
ax  = fig.add_subplot(1,1,1,projection='3d')
fig.set_size_inches(10, 8)

# Now each point has 3 dimensions (x,y,x)
x = random.sample(range(0, 100), 50) 
y = random.sample(range(0, 100), 50) 
z = random.sample(range(0, 100), 50)

ax.scatter(x, y, z)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')
    plt.axhline(y=grid_pt, color='#D8D8D8')

ax.set_xlim(0,100)
ax.set_ylim(0,100)
ax.set_zlim(0,100)

ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("Dimension #2", fontsize=14)
ax.set_zlabel("Dimension #3", fontsize=14)
plt.title("3D")
plt.show()

With 5 intervals on the third dimension, we have 50/(5x5x5) = 0.4 points per cell on average!

In Conclusion

As you add new dimensions, you create “new space” that is usually not filled properly by your initial data. 

In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.

 

Creating an R Markdown PDF output (command line version)

 

Introduction

Today, I will be talking about how to generate a nice pdf report with text, code, plots, and formulas using R markdown.

For those of you who are in a hurry, you will find the entire code at the end of this post. Simply skip to Step 2 to learn how to convert it into pdf.

Once again, I will explain how to do this in command line. Because why would anyone need graphical interfaces when they have Vim ?

 

Step 1 : Create a basic .Rmd file

Save the following lines in a file named, say, “my_report.Rmd” :

---
title: "My PDF Report with R Markdown"
author:
- First Name, Last Name
output: pdf_document
fontsize: 12pt
---

Step 2 : Convert .Rmd -> PDF

Command line in the same directory

>> Rscript -e “rmarkdown::render(‘./my_report.Rmd’)”

You sould find a file named my_report.pdf in the same directory.

The file should look like this :

Step 3 : Add some text, and a formula

Simple Linear Regression :

$$
\begin{aligned}
         &X \beta &&= Y \\
\implies &X^{T} X \beta &&= X^{T} Y \\
\end{aligned}
$$

Step 2 to update the PDF

Formulas are written using LaTex formatting.

Step 4 : Add some R code, and a plot

Here is some R code :

```{r}
# Define the cars vector with 5 values
cars <- c(1, 3, 6, 4, 9)

# Graph cars using blue points overlayed by a line 
plot(cars, type="o", col="blue")

# Create a title with a red, bold/italic font
title(main="Autos", col.main="red", font.main=4)
```

Step 2 to update the PDF

More “generic” plot ideas here.

Step 5 : Change the plot size

In the above code, change

```{r}

for

```{r, fig.width=8, fig.height=4}

Step 2 to update the PDF

Step 5 : Summary

Here is the entire sample code and the resulting PDF you can expect to have.

---
title: "My PDF Report with R Markdown"
author:
- My Name
output: pdf_document
fontsize: 12pt
---

Simple Linear Regression :

$$
\begin{aligned}
         &X \beta &&= Y \\
\implies &X^{T} X \beta &&= X^{T} Y \\
\end{aligned}
$$

Here is some R code :

```{r, fig.width=8, fig.height=4}
# Define the cars vector with 5 values
cars <- c(1, 3, 6, 4, 9)

# Graph cars using blue points overlayed by a line 
plot(cars, type="o", col="blue")

# Create a title with a red, bold/italic font
title(main="Autos", col.main="red", font.main=4)