# Tutorial: Introduction to Unsupervised Learning with a Focus on PCA#

Tutorial to the class Introduction to Unsupervised Learning with a Focus on PCA based on the same case study as in Tutorial: Regularization, Model Selection and Evaluation.

**Tutorial Objectives**

Apply Principal Component Analysis (PCA) to climate data to analyze patterns of variability

(Combine PCA reduction/\(k\)-means clustering to Ordinary Least Squares (OLS) to predict climate variables)

(Use cross-validation to regularize the OLS with the number of retained Empirical Orthogonal Functions (EOFs) or clusters).

## Dataset presentation#

Input:

Geopotential height at 500hPa

Domain: North Atlantic

Spatial resolution: \(0.5° \times 0.625°\)

Time resolution: monthly

Period: 1980-2021

Units: m

Source: MERRA-2 reanalysis

Target:

Onshore wind capacity factors

Domain: Metropolitan France

Spatial resolution: regional mean

Time resolution: daily

Period: 2014-2021

Units:

Source: RTE

## Getting ready#

### Reading the wind capacity factor and geopotential height data#

We follow the same procedure as in # Tutorial: Regularization, Model Selection and Evaluation.

```
# Import modules
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', size=14)
# Data directory
DATA_DIR = Path('data')
# Filename to geopotential height at 500hPa from MERRA-2 reanalysis
START_DATE = '19800101'
END_DATE = '20220101'
filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_label = 'Geopotential height (m)'
```

Question

Read the geopotential height data using

`xarray.load_dataset`

and print it.

```
# answer cell
```

Question (optional)

Coarsen the grid resolution of the geopotential height field to reduce the number of variables.

```
# answer cell
```

### Representing the first moments of the geopotential height field#

Question

Compute the mean and the variance of the geopotential height with the

`mean`

and`var`

methods.Plot the mean with the

`plot`

method.Do a filled-contour plot of the variance with the

`plot.contourf`

method.

```
# answer cell
```

Question

Plot the variance of the geopotential height.

```
# answer cell
```

Question

Scale the geopotential-height deviations to account for variations in the area represented by each grid point.

Plot the variance of the scaled geopotential height.

Qualitatively describe the mean and variance of the geopotential height.

```
# answer cell
```

Answer:

## PCA of the geopotential height field#

Question

Estimate the covariance matrix of the scaled geopotential height using the

`stack`

method of data arrays.

```
# answer cell
```

Question

Compute EOFs and corresponding variances using

`np.linalg.eigh`

.

```
# answer cell
```

Question

Sort the EOFs and corresponding variances by decreasing variances.

Plot the fraction of variance “explained” by the leading 20 EOFs.

Interpret your results.

```
# answer cell
```

Answer:

Question

Plot the leading EOF on a map.

To what physical phenomenon could this pattern be associated to?

```
# answer cell
```

Answer:

Question

Compute the principal component associated with the leading EOF.

Compare its variance to the corresponding eigenvalue and explain your result.

Plot this principal component.

Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

```
# answer cell
```

Answer:

### Dealing with the seasonal cycle#

Question (optional)

Use the

`scipy.signal.welch`

to estimate the Power Spectral Density (PSD) of the leading principal component and plot it.Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

```
# answer cell
```

Question

Compute the seasonal cycle of the geopotential height (averages over all years of the same month of the year for each month) with the

`groupby`

of data arrays.Plot all 12 months. You can use the

`col`

option of the`plot`

method of data arrays.Also plot the variance of the seasonal cycle on a map.

Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF.

```
# answer cell
```

Answer:

Question

Compute seasonal anomalies (deviations from the seasonal cycle) of the geopotential height with

`groupby`

.Plot the variance of the seasonal anomalies on a map.

How does it compare to the variance of the data with the seasonal cycle?

```
# answer cell
```

Answer:

### Representing and interpreting the EOFs#

Question

Estimate the covariance matrix of the anomalies (with the seasonal cycle subtracted).

Compute the EOFs and corresponding variances.

Plot the explained variances associated with the EOFs together with the cumulative sum of the explained variances.

What is the minimum number of EOFs that one needs to keep to explain at least 90% of the variance.

```
# answer cell
```

Answer:

Question

Plot the leading 4 EOFs and principal components.

Can you associate these patterns to known climate phenomena?

```
# answer cell
```

### Reconstructing the geopotential height field from the EOFs and PCs#

Answer:

Question

Reconstruct the inputs from the leading 4 EOFs only.

Compare the original time series at a few arbitrary locations to the corresponding reconstructed time series.

Plot the variance of the reconstruction on a map.

Same question but keeping more EOFs

Interpret your results in terms of filtering.

```
# answer cell
```

Answer:

## Using PCA to extract features for prediction#

Question (optional)

Design a linear model that best predicts present (not future) wind capacity factors in

`data/reseaux_energies_capacityfactor_wind-onshore.csv`

using geopotential-height principal components as inputs. To do, use cross-validation to regularize based on the number of leading principal components retained.

```
# answer cell
```

Question (optional)

Use \(k\)-means clustering with

`sklearn.cluster.KMeans`

to detect “atmospheric regimes” from the geopotential-height data and compare the result to the EOFs obtained above.Design a linear model as above but based on clusters rather than EOFs.

```
# answer cell
```

## Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.