{ "cells": [ { "cell_type": "markdown", "id": "586165b9", "metadata": {}, "source": [ "# Tutorial: Introduction to Unsupervised Learning with a Focus on PCA\n", "\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F07_tutorial_unsupervised_learning_pca.ipynb)\n", "\n", "Tutorial to the class [Introduction to Unsupervised Learning with a Focus on PCA](07_unsupervised_learning_pca.ipynb) based on the same case study as in [Tutorial: Regularization, Model Selection and Evaluation](05_tutorial_regularization_selection_evaluation.ipynb)." ] }, { "cell_type": "markdown", "id": "30780159", "metadata": {}, "source": [ "
\n", " Tutorial Objectives\n", " \n", "- Apply Principal Component Analysis (PCA) to climate data to analyze patterns of variability\n", "- (Combine PCA reduction/$k$-means clustering to Ordinary Least Squares (OLS) to predict climate variables)\n", "- (Use cross-validation to regularize the OLS with the number of retained Empirical Orthogonal Functions (EOFs) or clusters).\n", "
" ] }, { "cell_type": "markdown", "id": "384478ed", "metadata": {}, "source": [ "## Dataset presentation\n", "\n", "- Input:\n", " - [Geopotential height](https://en.wikipedia.org/wiki/Geopotential_height) at 500hPa\n", " - Domain: North Atlantic\n", " - Spatial resolution: $0.5° \\times 0.625°$\n", " - Time resolution: monthly\n", " - Period: 1980-2021\n", " - Units: m\n", " - Source: [MERRA-2 reanalysis](https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/)\n", "- Target:\n", " - Onshore wind capacity factors\n", " - Domain: Metropolitan France\n", " - Spatial resolution: regional mean\n", " - Time resolution: daily\n", " - Period: 2014-2021\n", " - Units:\n", " - Source: [RTE](https://opendata.reseaux-energies.fr/)" ] }, { "cell_type": "markdown", "id": "f3aa8fd2-c75e-4abc-b02f-55c2de8aeec5", "metadata": {}, "source": [ "## Getting ready\n", "\n", "### Reading the wind capacity factor and geopotential height data\n", "\n", "We follow the same procedure as in [# Tutorial: Regularization, Model Selection and Evaluation](05_tutorial_regularization_selection_evaluation.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "id": "ef0a89b2", "metadata": {}, "outputs": [], "source": [ "# Import modules\n", "from pathlib import Path\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.rc('font', size=14)\n", "\n", "# Data directory\n", "DATA_DIR = Path('data')\n", "\n", "# Filename to geopotential height at 500hPa from MERRA-2 reanalysis\n", "START_DATE = '19800101'\n", "END_DATE = '20220101'\n", "filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)\n", "z500_label = 'Geopotential height (m)'" ] }, { "cell_type": "markdown", "id": "c08db7e2", "metadata": {}, "source": [ "> ***Question***\n", "> - Read the geopotential height data using `xarray.load_dataset` and print it." ] }, { "cell_type": "code", "execution_count": null, "id": "f3ad78a4", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "ff3f5aaf-26bd-42d9-ab4f-0b031ad81ce8", "metadata": {}, "source": [ "> ***Question (optional)***\n", "> - Coarsen the grid resolution of the geopotential height field to reduce the number of variables." ] }, { "cell_type": "code", "execution_count": null, "id": "a9ebcb05-3e6b-46e4-8f9f-45b2cde49635", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "4736521f-81ee-4ea7-a98b-000a30211304", "metadata": {}, "source": [ "### Representing the first moments of the geopotential height field" ] }, { "cell_type": "markdown", "id": "2ffc0d21", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute the mean and the variance of the geopotential height with the `mean` and `var` methods.\n", "> - Plot the mean with the `plot` method.\n", "> - Do a filled-contour plot of the variance with the `plot.contourf` method." ] }, { "cell_type": "code", "execution_count": null, "id": "609adb58", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "2f13ffdd", "metadata": {}, "source": [ "> ***Question***\n", "> - Plot the variance of the geopotential height." ] }, { "cell_type": "code", "execution_count": null, "id": "abc00fc3", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "d19aabd3", "metadata": {}, "source": [ "> ***Question***\n", "> - Scale the geopotential-height deviations to account for variations in the area represented by each grid point.\n", "> - Plot the variance of the scaled geopotential height.\n", "> - Qualitatively describe the mean and variance of the geopotential height." ] }, { "cell_type": "code", "execution_count": null, "id": "46594c02", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "7539d8f5-0d8c-4a36-a7ea-3cef16959d12", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "53117645-9411-447b-b8bf-7253fb6f38f6", "metadata": {}, "source": [ "## PCA of the geopotential height field" ] }, { "cell_type": "markdown", "id": "e7b40f91", "metadata": {}, "source": [ "> ***Question***\n", "> - Estimate the covariance matrix of the scaled geopotential height using the `stack` method of data arrays." ] }, { "cell_type": "code", "execution_count": null, "id": "380d72c5", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "5d4641d8", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute EOFs and corresponding variances using `np.linalg.eigh`." ] }, { "cell_type": "code", "execution_count": null, "id": "655fa439", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "53d95ec3", "metadata": {}, "source": [ "> ***Question***\n", "> - Sort the EOFs and corresponding variances by decreasing variances.\n", "> - Plot the fraction of variance \"explained\" by the leading 20 EOFs.\n", "> - Interpret your results." ] }, { "cell_type": "code", "execution_count": null, "id": "058676f3", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "76655df7", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "99649fdf", "metadata": {}, "source": [ "> ***Question***\n", "> - Plot the leading EOF on a map.\n", "> - To what physical phenomenon could this pattern be associated to?" ] }, { "cell_type": "code", "execution_count": null, "id": "c753578a", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "d300a943", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "426067eb", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute the principal component associated with the leading EOF.\n", "> - Compare its variance to the corresponding eigenvalue and explain your result.\n", "> - Plot this principal component.\n", "> - Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF." ] }, { "cell_type": "code", "execution_count": null, "id": "9498b5e1", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "ceaab9ed", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "84e4a8d7-dc5c-4f89-843e-b9c469e9a17d", "metadata": {}, "source": [ "### Dealing with the seasonal cycle" ] }, { "cell_type": "markdown", "id": "9d9f8dc6", "metadata": {}, "source": [ "> ***Question (optional)***\n", "> - Use the `scipy.signal.welch` to estimate the Power Spectral Density (PSD) of the leading principal component and plot it.\n", "> - Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF." ] }, { "cell_type": "code", "execution_count": null, "id": "a019076f", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "1cebff8a", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute the seasonal cycle of the geopotential height (averages over all years of the same month of the year for each month) with the `groupby` of data arrays.\n", "> - Plot all 12 months. You can use the `col` option of the `plot` method of data arrays.\n", "> - Also plot the variance of the seasonal cycle on a map.\n", "> - Confirm or reconsider you previous answer on the physical phenomenon that could be associated to the leading EOF." ] }, { "cell_type": "code", "execution_count": null, "id": "54364627", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "f2287a22", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "0570003f", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute seasonal anomalies (deviations from the seasonal cycle) of the geopotential height with `groupby`.\n", "> - Plot the variance of the seasonal anomalies on a map.\n", "> - How does it compare to the variance of the data with the seasonal cycle?" ] }, { "cell_type": "code", "execution_count": null, "id": "9d63b8fd", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "ff3b448e", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "8c1cc30a-e710-404b-920b-fdf9ffc2443f", "metadata": {}, "source": [ "### Representing and interpreting the EOFs" ] }, { "cell_type": "markdown", "id": "6eaa9909", "metadata": {}, "source": [ "> ***Question***\n", "> - Estimate the covariance matrix of the anomalies (with the seasonal cycle subtracted).\n", "> - Compute the EOFs and corresponding variances.\n", "> - Plot the explained variances associated with the EOFs together with the cumulative sum of the explained variances.\n", "> - What is the minimum number of EOFs that one needs to keep to explain at least 90% of the variance." ] }, { "cell_type": "code", "execution_count": null, "id": "bd49c6b5", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "1d5de3dd", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "86df8b59", "metadata": {}, "source": [ "> ***Question***\n", "> - Plot the leading 4 EOFs and principal components.\n", "> - Can you associate these patterns to known climate phenomena?" ] }, { "cell_type": "code", "execution_count": null, "id": "a3a43eec", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "03cb0e7c-21b0-477f-9c5f-2291dd5f8c74", "metadata": {}, "source": [ "### Reconstructing the geopotential height field from the EOFs and PCs" ] }, { "cell_type": "markdown", "id": "8e0a79e7", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "5a111f97", "metadata": {}, "source": [ "> ***Question***\n", "> - Reconstruct the inputs from the leading 4 EOFs only.\n", "> - Compare the original time series at a few arbitrary locations to the corresponding reconstructed time series.\n", "> - Plot the variance of the reconstruction on a map.\n", "> - Same question but keeping more EOFs\n", "> - Interpret your results in terms of filtering." ] }, { "cell_type": "code", "execution_count": null, "id": "f8158ded", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "3d31acaf", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "9d50858d-1609-4b0e-99bb-125745e687cc", "metadata": {}, "source": [ "## Using PCA to extract features for prediction" ] }, { "cell_type": "markdown", "id": "ccf49dae", "metadata": {}, "source": [ "> ***Question (optional)***\n", "> - Design a linear model that best predicts present (not future) wind capacity factors in `data/reseaux_energies_capacityfactor_wind-onshore.csv` using geopotential-height principal components as inputs. To do, use cross-validation to regularize based on the number of leading principal components retained." ] }, { "cell_type": "code", "execution_count": null, "id": "7ff42392-247b-4566-877c-9257081f3edc", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "6d7c0e41", "metadata": {}, "source": [ "> ***Question (optional)***\n", "> - Use $k$-means clustering with `sklearn.cluster.KMeans` to detect \"atmospheric regimes\" from the geopotential-height data and compare the result to the EOFs obtained above.\n", "> - Design a linear model as above but based on clusters rather than EOFs." ] }, { "cell_type": "code", "execution_count": null, "id": "ae622613-4f1f-491f-9b4d-20d4a17a0abf", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "4eed55f4", "metadata": {}, "source": [ "***\n", "## Credit\n", "\n", "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n", "Contributors include Bruno Deremble and Alexis Tantet.\n", "Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).\n", "\n", "
\n", "\n", "
\n", " \n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " \"Creative\n", "
This work is licensed under a   Creative Commons Attribution-ShareAlike 4.0 International License.\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }