Tutorial on Classification II: Dimension reduction#
Fisher Linear discriminant analysis (dimension reduction)
Classification by month
In this tutorial, we are going to use the same data set as in the previous tutorial except that we are now going to explore the concept of dimension reduction. We still consider the same problem of classifying the days into rainy and dry days. In the original data set, there are 10 input features and we already saw that pressure is a good predictor. What about the other variables?
Let’s first prepare the data set as we did in the previous tutorial: we load meteorological variables for paris for between 2000 and 2009 (subset of the ERA5 data set). As before, we assign each day to a class “rainy day” or “dry day”. We also split this data set into a training data set and a test data set with the method train_test_split
of the scikit-learn library.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data/era5_paris_sf_2000_2009.csv", index_col='time', parse_dates=True)
df_norm = (df - df.mean()) /df.std()
df_day = df_norm.resample("D").mean()
# normalized threshold
precip_th = -0.2
# add tag
df_day['tag'] = df_day['tp'].where(df_day['tp']> precip_th, 0)
df_day['tag'] = df_day['tag'].where(df_day['tp']<= precip_th, 1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_day.drop(columns=['tp', 'tag']),df_day['tag'],test_size=.3,random_state=0)
In the previous tutorial, we used the LDA method to predict the class of the data set. If you assume that the data set is Gaussian, Fisher-LDA is actually equivalent to the LDA and so we core functions are the same in Scikit-learn. There is however one extra interesting feature that we derived in Fisher-LDA because it can be used as a dimension reduction method. These new dimensions are the one that maximize the between-class variance and minimize the within-class variance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
Fit the train data set with the LDA model?
# your code here
Make a prediction for the train and test data sets.
# your code here
You can visualize a summary of the results with classification_report
#from sklearn.metrics import classification_report
# your code here
Question
Do you feel like this model with more input features is doing better perdiction than the previous model that had only 2 input features?
So maybe one (or more) of these extra features is helping doing a better prediction. F-LDA can help us shrink the data set to only 1 dimension that is the best dimension for classification. This dimension no longer has a physical meaning but we can use it to compact a data set while still retaining the cluster separation.
Question
With the function
lda.transform
, project the test data onto that particular dimension.
# your code here
# X_lda =
Question
What is the dimension of X_lda? Why?
Add the
X_lda
variable toX_test
DataFrame.
# your code here
In the same figure, do a boxplot for each class. As before, you can use the by=y_train.values
argument in order to separate the data set into the rainy class and dry class. You can remove the labels below the figure with plt.xlabel("")
# your answer here
Question
Plot the histogram of this projected data for each class.
Hint: In order to do this plot, you’ll have to to first .groupby(y_train)
and then use .plot.hist(...)
# your answer here
So we shrinked our 10-dimensional data set into a 1d data set… and this dimension is even better than pressure only to perform a classification. If we need to represent the data and highlight the class separation, we probably need to use this dimension.
Fisher LDA in 2D#
By design, the maximum number of dimensions that you can get is \(K-1\), where \(K\) is the number of classes. We are going to explore how 3 different months really belong to 3 distinct categories.
Adjust the parameters in the code below to select 3 months: you can pick months that are close in time or instead far apart.
#mo1 = 1
#mo2 = 7
#mo3 = 8
#X = df_day[df_day.index.month.isin([mo1,mo2, mo3])]
#y = df_day[df_day.index.month.isin([mo1,mo2, mo3])].index.month
Follow the exact same methodology as before and do a scatter plot of the projected data in 2d. Use a different color code for each class.
# your code here
In this 2d space, you can visualize two elements:
The between class spread (how classes are separated from each other)
The within class spread (how points within a class are separated from each other).
Of course months that are well separated in time will exhibit a large between class gap compared to months that are close.
Questions
Which variables are responsible for that separation?
try to remove these variables from your model, and see if you can you still separate months appart?
Credit#
Contributors include Bruno Deremble and Alexis Tantet.