Tutorial: Introduction to Unsupervised Learning with a Focus on PCA#

Binder

Tutorial to the class Introduction to Unsupervised Learning with a Focus on PCA based on the same case study as in Tutorial: Regularization, Model Selection and Evaluation.

Tutorial Objectives
  • Apply Principal Component Analysis (PCA) to climate data to analyze patterns of variability

  • (Combine PCA reduction/\(k\)-means clustering to Ordinary Least Squares (OLS) to predict climate variables)

  • (Use cross-validation to regularize the OLS with the number of retained Empirical Orthogonal Functions (EOFs) or clusters).

Dataset presentation#

  • Input:

  • Target:

    • Onshore wind capacity factors

      • Domain: Metropolitan France

      • Spatial resolution: regional mean

      • Time resolution: daily

      • Period: 2014-2021

      • Units:

      • Source: RTE

Getting ready#

Reading the wind capacity factor and geopotential height data#

We follow the same procedure as in # Tutorial: Regularization, Model Selection and Evaluation.

# Import modules
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', size=14)

# Data directory
DATA_DIR = Path('data')

# Filename to geopotential height at 500hPa from MERRA-2 reanalysis
START_DATE = '19800101'
END_DATE = '20220101'
filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_label = 'Geopotential height (m)'

Question

  • Read the geopotential height data using xarray.load_dataset and print it.

# answer cell

Question (optional)

  • Coarsen the grid resolution of the geopotential height field to reduce the number of variables.

# answer cell

Representing the first moments of the geopotential height field#

Question

  • Compute the mean and the variance of the geopotential height with the mean and var methods.

  • Plot the mean with the plot method.

  • Do a filled-contour plot of the variance with the plot.contourf method.

# answer cell

Question

  • Plot the variance of the geopotential height.

# answer cell

Question

  • Scale the geopotential-height deviations to account for variations in the area represented by each grid point.

  • Plot the variance of the scaled geopotential height.

  • Qualitatively describe the mean and variance of the geopotential height.

# answer cell

Answer:

PCA of the geopotential height field#

Question

  • Estimate the covariance matrix of the scaled geopotential height using the stack method of data arrays.

# answer cell

Question

  • Compute EOFs and corresponding variances using np.linalg.eigh.

# answer cell

Question

  • Sort the EOFs and corresponding variances by decreasing variances.

  • Plot the fraction of variance “explained” by the leading 20 EOFs.

  • Interpret your results.

# answer cell

Answer:

Question

  • Plot the leading EOF on a map.

  • To what physical phenomenon could this pattern be associated to?

# answer cell

Answer:

Question

  • Compute the principal component associated with the leading EOF.

  • Compare its variance to the corresponding eigenvalue and explain your result.

  • Plot this principal component.

  • Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

# answer cell

Answer:

Dealing with the seasonal cycle#

Question (optional)

  • Use the scipy.signal.welch to estimate the Power Spectral Density (PSD) of the leading principal component and plot it.

  • Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

# answer cell

Question

  • Compute the seasonal cycle of the geopotential height (averages over all years of the same month of the year for each month) with the groupby of data arrays.

  • Plot all 12 months. You can use the col option of the plot method of data arrays.

  • Also plot the variance of the seasonal cycle on a map.

  • Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

# answer cell

Answer:

Question

  • Compute seasonal anomalies (deviations from the seasonal cycle) of the geopotential height with groupby.

  • Plot the variance of the seasonal anomalies on a map.

  • How does it compare to the variance of the data with the seasonal cycle?

# answer cell

Answer:

Representing and interpreting the EOFs#

Question

  • Estimate the covariance matrix of the anomalies (with the seasonal cycle subtracted).

  • Compute the EOFs and corresponding variances.

  • Plot the explained variances associated with the EOFs together with the cumulative sum of the explained variances.

  • What is the minimum number of EOFs that one needs to keep to explain at least 90% of the variance.

# answer cell

Answer:

Question

  • Plot the leading 4 EOFs and principal components.

  • Can you associate these patterns to known climate phenomena?

# answer cell

Reconstructing the geopotential height field from the EOFs and PCs#

Answer:

Question

  • Reconstruct the inputs from the leading 4 EOFs only.

  • Compare the original time series at a few arbitrary locations to the corresponding reconstructed time series.

  • Plot the variance of the reconstruction on a map.

  • Same question but keeping more EOFs

  • Interpret your results in terms of filtering.

# answer cell

Answer:

Using PCA to extract features for prediction#

Question (optional)

  • Design a linear model that best predicts present (not future) wind capacity factors in data/reseaux_energies_capacityfactor_wind-onshore.csv using geopotential-height principal components as inputs. To do, use cross-validation to regularize based on the number of leading principal components retained.

# answer cell

Question (optional)

  • Use \(k\)-means clustering with sklearn.cluster.KMeans to detect “atmospheric regimes” from the geopotential-height data and compare the result to the EOFs obtained above.

  • Design a linear model as above but based on clusters rather than EOFs.

# answer cell

Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.


Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS