Tutorial: Introduction to Unsupervised Learning with a Focus on PCA#
Tutorial to the class Introduction to Unsupervised Learning with a Focus on PCA based on the same case study as in Tutorial: Regularization, Model Selection and Evaluation.
Apply Principal Component Analysis (PCA) to climate data to analyze patterns of variability
(Combine PCA reduction/\(k\)-means clustering to Ordinary Least Squares (OLS) to predict climate variables)
(Use cross-validation to regularize the OLS with the number of retained Empirical Orthogonal Functions (EOFs) or clusters).
Dataset presentation#
Input:
Geopotential height at 500hPa
Domain: North Atlantic
Spatial resolution: \(0.5° \times 0.625°\)
Time resolution: monthly
Period: 1980-2021
Units: m
Source: MERRA-2 reanalysis
Target:
Onshore wind capacity factors
Domain: Metropolitan France
Spatial resolution: regional mean
Time resolution: daily
Period: 2014-2021
Units:
Source: RTE
Getting ready#
Reading the wind capacity factor and geopotential height data#
We follow the same procedure as in # Tutorial: Regularization, Model Selection and Evaluation.
# Import modules
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', size=14)
# Data directory
DATA_DIR = Path('data')
# Filename to geopotential height at 500hPa from MERRA-2 reanalysis
START_DATE = '19800101'
END_DATE = '20220101'
filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_label = 'Geopotential height (m)'
Question
Read the geopotential height data using
xarray.load_dataset
and print it.
# answer cell
Question (optional)
Coarsen the grid resolution of the geopotential height field to reduce the number of variables.
# answer cell
Representing the first moments of the geopotential height field#
Question
Compute the mean and the variance of the geopotential height with the
mean
andvar
methods.Plot the mean with the
plot
method.Do a filled-contour plot of the variance with the
plot.contourf
method.
# answer cell
Question
Plot the variance of the geopotential height.
# answer cell
Question
Scale the geopotential-height deviations to account for variations in the area represented by each grid point.
Plot the variance of the scaled geopotential height.
Qualitatively describe the mean and variance of the geopotential height.
# answer cell
Answer:
PCA of the geopotential height field#
Question
Estimate the covariance matrix of the scaled geopotential height using the
stack
method of data arrays.
# answer cell
Question
Compute EOFs and corresponding variances using
np.linalg.eigh
.
# answer cell
Question
Sort the EOFs and corresponding variances by decreasing variances.
Plot the fraction of variance “explained” by the leading 20 EOFs.
Interpret your results.
# answer cell
Answer:
Question
Plot the leading EOF on a map.
To what physical phenomenon could this pattern be associated to?
# answer cell
Answer:
Question
Compute the principal component associated with the leading EOF.
Compare its variance to the corresponding eigenvalue and explain your result.
Plot this principal component.
Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.
# answer cell
Answer:
Dealing with the seasonal cycle#
Question (optional)
Use the
scipy.signal.welch
to estimate the Power Spectral Density (PSD) of the leading principal component and plot it.Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.
# answer cell
Question
Compute the seasonal cycle of the geopotential height (averages over all years of the same month of the year for each month) with the
groupby
of data arrays.Plot all 12 months. You can use the
col
option of theplot
method of data arrays.Also plot the variance of the seasonal cycle on a map.
Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.
# answer cell
Answer:
Question
Compute seasonal anomalies (deviations from the seasonal cycle) of the geopotential height with
groupby
.Plot the variance of the seasonal anomalies on a map.
How does it compare to the variance of the data with the seasonal cycle?
# answer cell
Answer:
Representing and interpreting the EOFs#
Question
Estimate the covariance matrix of the anomalies (with the seasonal cycle subtracted).
Compute the EOFs and corresponding variances.
Plot the explained variances associated with the EOFs together with the cumulative sum of the explained variances.
What is the minimum number of EOFs that one needs to keep to explain at least 90% of the variance.
# answer cell
Answer:
Question
Plot the leading 4 EOFs and principal components.
Can you associate these patterns to known climate phenomena?
# answer cell
Reconstructing the geopotential height field from the EOFs and PCs#
Answer:
Question
Reconstruct the inputs from the leading 4 EOFs only.
Compare the original time series at a few arbitrary locations to the corresponding reconstructed time series.
Plot the variance of the reconstruction on a map.
Same question but keeping more EOFs
Interpret your results in terms of filtering.
# answer cell
Answer:
Using PCA to extract features for prediction#
Question (optional)
Design a linear model that best predicts present (not future) wind capacity factors in
data/reseaux_energies_capacityfactor_wind-onshore.csv
using geopotential-height principal components as inputs. To do, use cross-validation to regularize based on the number of leading principal components retained.
# answer cell
Question (optional)
Use \(k\)-means clustering with
sklearn.cluster.KMeans
to detect “atmospheric regimes” from the geopotential-height data and compare the result to the EOFs obtained above.Design a linear model as above but based on clusters rather than EOFs.
# answer cell
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.