Category

Artificial Intelligence

The Curse of Dimensionality – Illustrated With Matplotlib

Maybe you already came across this famous quote in Machine Learning by Charles Lee Isbell

“As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.”

Here is another explanation from Wikipedia

 “When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. (…) In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.”

I think the  “Curse of Dimensionality” is easier to understand when visualized. Suppose you have 50 data points between 0 and 100.

1- Let’s try with one dimension first

import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np

fig = plt.figure()
ax  = plt.axes()
fig.set_size_inches(12, 1)

x = random.sample(range(0, 100), 50) 
y = [0 for xval in x]
plt.scatter(x, y)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')

ax.set_xlim((0,100))
ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("")
plt.yticks([], [])
plt.title("1D")
plt.show()

With 5 intervals in our first dimension, there will be an average of 50/5 = 10 points per cell, which is already low if you’d like to do any statistical analysis for each interval.

2- Moving to two dimensions

fig = plt.figure()
ax  = plt.axes()
fig.set_size_inches(8, 8)

# Now each point has 2 dimensions (x,y)
x = random.sample(range(0, 100), 50) 
y = random.sample(range(0, 100), 50) 

plt.scatter(x, y)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')
    plt.axhline(y=grid_pt, color='#D8D8D8')

ax.set_xlim((0,100))
ax.set_ylim((0,100))
ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("Dimension #2", fontsize=14)
plt.title("2D")
plt.show()

With 5 intervals on the first dimension and 5 intervals on the 2nd dimension, we now have 50/(5×5) = 2 points per cell on average. In fact, we are already starting to see cells that do not have any data to work with.

3- Adding a third dimension

from mpl_toolkits import mplot3d

fig = plt.figure()
ax  = fig.add_subplot(1,1,1,projection='3d')
fig.set_size_inches(10, 8)

# Now each point has 3 dimensions (x,y,x)
x = random.sample(range(0, 100), 50) 
y = random.sample(range(0, 100), 50) 
z = random.sample(range(0, 100), 50)

ax.scatter(x, y, z)

# Grid lines
for grid_pt in [20, 40, 60, 80]:
    plt.axvline(x=grid_pt, color='#D8D8D8')
    plt.axhline(y=grid_pt, color='#D8D8D8')

ax.set_xlim(0,100)
ax.set_ylim(0,100)
ax.set_zlim(0,100)

ax.set_xlabel("Dimension #1", fontsize=14)
ax.set_ylabel("Dimension #2", fontsize=14)
ax.set_zlabel("Dimension #3", fontsize=14)
plt.title("3D")
plt.show()

With 5 intervals on the third dimension, we have 50/(5x5x5) = 0.4 points per cell on average!

In Conclusion

As you add new dimensions, you create “new space” that is usually not filled properly by your initial data. 

In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.

 

Visualize a Decision Tree with Sklearn

Step 1: Install the libraries
sudo apt-get install graphviz

pip install graphviz
pip install pydotplus
pip install sklearn
pip install pydot pip install pandas

Do the imports

import pydotplus
import pandas as pd
from sklearn import tree
from io import StringIO
import pydot
Step 2: Initialize the dataframe
data = [ 
    (0, 5, 0), 
    (1, 6, 0), 
    (2, 7, 1), 
    (3, 8, 1), 
    (4, 9, 1)
]
df = pd.DataFrame(data, index=range(5), columns=['x1','x2','y'])
Step 3: Train the decision tree
x_columns = ['x1','x2']

model = tree.DecisionTreeClassifier()
trained_model = model.fit(df[x_columns], df['y'])
Step 4: Display the decision tree

Two options

Option A: You want to save the decision tree as a file

dotfile = StringIO()

tree.export_graphviz(
    trained_model,  
    out_file        = dotfile,
    feature_names   = x_columns, 
    class_names     = ['[y=0]', '[y=1]'], # Ascending numerical order
    filled          = True,
    rounded         = True
)

(graph,) = pydot.graph_from_dot_data(dotfile.getvalue())
graph.write_png("tree.png")

 

This should generate an image named “tree.png” in your current directory

Option B: You want to display the decision tree in your Jupyter notebook

from IPython.display import Image

out_file = tree.export_graphviz(
    trained_model,
    feature_names   = x_columns,
    class_names     = ['[y=0]', '[y=1]'],# Ascending numerical order
    filled          = True,
    rounded         = True
)
graph = pydotplus.graph_from_dot_data(out_file)
Image(graph.create_png())

In either case this is the tree you should get

 

References:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html