{ "cells": [ { "cell_type": "markdown", "id": "5eb8c31b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Tutorial on Classification I: Generative models" ] }, { "cell_type": "markdown", "id": "3a485dc1", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " Context\n", " \n", "- Era5 data set: surface data at paris.\n", "- Linear discriminant analysis\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "ebf19950", "metadata": {}, "source": [ "### Prediction of the rain\n", "\n", "Prediction of the rain remains one of the most challenging task in numerical weather prediction. In fact the rain is the result of multiple scale phenomena: from the large-scale organization of weather system to the small scale microphysics of dropplet formation. Getting the right prediction for the rain implies that we have a model that captures well all these scales.\n", "\n", "Despite the fact that rain is hard to predict, there seem to be exist a correspondance between the surface pressure and the weather conditions as shown in the picture below:\n", "\n", "\"Barometer\"\n" ] }, { "cell_type": "markdown", "id": "4cbc239b", "metadata": {}, "source": [ "## Data set\n", "\n", "The data we are going to use in this notebook comes from the [ERA5 data base](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels). To quote ECMWF: *Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics.* In the ERA5 data base, we can find 4d fields (time, latitude, longitude, height) such as temperature, wind, humidity, clouds, precipitation, etc... The time resolution is 1 hour, horizontal grid spacing is approx 20 km and vertical resolution varies with high resolution near the ground and coarse resolution near the top of the atmosphere." ] }, { "cell_type": "markdown", "id": "ecd01a21", "metadata": {}, "source": [ "To illustrate this notebook, I prepared a data set with surface variables only at a given location between 2000 and 2009 at the hourly resolution.\n", "\n", "The variables in this data set are the raw variables that you can find in the ERA reanalysis\n", "\n", "| Variable name | Description | Unit |\n", "| :------------- | :------------- | :------ |\n", "| t2m | Air temperature at 2 m above the ground | [K] |\n", "| d2m | [Dew point](https://en.wikipedia.org/wiki/Dew_point) at 2 m above the ground | [K] |\n", "| u10 | Zonal wind component at 10 m | [m/s] |\n", "| v10 | Meridional wind component at 10 m | [m/s] |\n", "| skt | Skin temperature | [K] |\n", "| tcc | Total cloud cover | [0-1] |\n", "| sp | Surface pressure | [Pa] |\n", "| tp | Total precipitation | [m] |\n", "| ssrd | Surface solar radiation (downwards) | [J/m^2] |\n", "| blh | Boundary layer height | [m] |\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "03e5af8d", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "ed2a2262", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/era5_paris_sf_2000_2009.csv\", index_col='time', parse_dates=True)" ] }, { "cell_type": "code", "execution_count": 3, "id": "00a691cd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sktu10v10t2md2mtccsptpssrdblh
count87672.00000087672.00000087672.00000087672.00000087672.00000087672.00000087672.00000087672.0000008.767200e+0487672.000000
mean283.8114691.0631730.542577284.102522280.6031310.674363100306.2153020.0000814.692709e+05592.633676
std7.2823202.9205563.0554886.6157805.5368240.356627946.7109360.0002857.169155e+05436.894806
min258.046500-8.554123-8.692932260.682980258.5807000.00000095585.5600000.000000-1.429965e-0310.763875
25%278.694565-1.201481-1.761349279.443792276.8077300.38313399755.7950000.0000000.000000e+00215.325750
50%283.5661001.1555630.384865284.094120281.0823000.835373100369.5150000.0000002.118400e+04505.917295
75%288.5813303.0458722.631916288.708078284.8390270.996002100914.7012500.0000007.436800e+05898.669800
max313.90180014.18585214.439499309.334100296.1045501.000000102814.0600000.0056383.233472e+062987.135000
\n", "
" ], "text/plain": [ " skt u10 v10 t2m d2m \\\n", "count 87672.000000 87672.000000 87672.000000 87672.000000 87672.000000 \n", "mean 283.811469 1.063173 0.542577 284.102522 280.603131 \n", "std 7.282320 2.920556 3.055488 6.615780 5.536824 \n", "min 258.046500 -8.554123 -8.692932 260.682980 258.580700 \n", "25% 278.694565 -1.201481 -1.761349 279.443792 276.807730 \n", "50% 283.566100 1.155563 0.384865 284.094120 281.082300 \n", "75% 288.581330 3.045872 2.631916 288.708078 284.839027 \n", "max 313.901800 14.185852 14.439499 309.334100 296.104550 \n", "\n", " tcc sp tp ssrd blh \n", "count 87672.000000 87672.000000 87672.000000 8.767200e+04 87672.000000 \n", "mean 0.674363 100306.215302 0.000081 4.692709e+05 592.633676 \n", "std 0.356627 946.710936 0.000285 7.169155e+05 436.894806 \n", "min 0.000000 95585.560000 0.000000 -1.429965e-03 10.763875 \n", "25% 0.383133 99755.795000 0.000000 0.000000e+00 215.325750 \n", "50% 0.835373 100369.515000 0.000000 2.118400e+04 505.917295 \n", "75% 0.996002 100914.701250 0.000000 7.436800e+05 898.669800 \n", "max 1.000000 102814.060000 0.005638 3.233472e+06 2987.135000 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "7ebfde13", "metadata": {}, "source": [ "Take a moment to explore this data set. In this tutorial, we will be interested in the precipitation variable `tp`, the surface pressure `sp`, and the air temperature near the surface `t2m`. You can plot time series of these variables for the entire data set or for limited periods of time.\n", "\n", "Remember that you can use the index `df.index` to select part of the dataset: e.g. `df.index.year == 2000` or `df.index >= '2000-01-01'`. \n", "\n", "For more advanced users, you can compute the seasonal cycle with `.groupby(df.index.month).mean()`" ] }, { "cell_type": "code", "execution_count": 4, "id": "bf08629f", "metadata": {}, "outputs": [], "source": [ "# Explore the data set here. \n", "# You can add more cells if needed." ] }, { "cell_type": "markdown", "id": "b49a4cfd", "metadata": {}, "source": [ "> ***Question (optional)***\n", "> - Which variables hexhibit a seasonal cycle? a daily cycle? Could you have anticipated this result?" ] }, { "cell_type": "code", "execution_count": 5, "id": "f3c142cc", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "63884ebc", "metadata": {}, "source": [ "Can you build a new normalized data set called `df_norm`. You can simply achieve this by subtracting the mean and dividing by the standard deviation. " ] }, { "cell_type": "code", "execution_count": 6, "id": "33c1a455", "metadata": {}, "outputs": [], "source": [ "# your code here\n", "#df_norm = " ] }, { "cell_type": "markdown", "id": "d9a117f1", "metadata": {}, "source": [ "In the same figure, plot a time series of the total precipitation `tp` and surface pressure `sp`. You can plot this time series for a month in winter and a month in summer.\n", "\n", "> ***Question***\n", "> - Do you observe any correlation between rain and pressure?" ] }, { "cell_type": "code", "execution_count": 7, "id": "a9de6bda", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "3b6ed6cf", "metadata": {}, "source": [ "As you can see (if you zoom enough), rain is very noisy data set. Indeed, if you observe the rain pattern, it is often very localized. This is also the reason why it is very hard to predict. In order to smooth the data, we are going to work with daily averages. Use the `.resample` method to to get daily averages." ] }, { "cell_type": "code", "execution_count": 8, "id": "fc24b682", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# your code here\n", "#df_day = " ] }, { "cell_type": "markdown", "id": "47f7a1c3", "metadata": {}, "source": [ "Let's classify the days into two classes: \"Rain\", \"Dry\". Because we use normalized data, the amount of precipitation of dry days is not zero but is close to $-0.2$. We use this value which corresponds roughly to $0.5$ mm/day." ] }, { "cell_type": "code", "execution_count": 9, "id": "5ee534d9", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# normalized threshold (corresponds to 0.5 mm/day)\n", "#precip_th = -0.2\n", "#df_rain['tag'] = df_rain['tp'].where(df_rain['tp']> precip_th, 0)\n", "#df_rain['tag'] = df_rain['tag'].where(df_rain['tp']<= precip_th, 1)" ] }, { "cell_type": "markdown", "id": "7d47d48e", "metadata": {}, "source": [ "Use the function `plt.scatter` to plot a scatter plot of rain classification in the (pressure, temperature) space (`sp`, `t2m`). You need to adjust the color of the points so that we can see which category they belong to. Don't forget to add labels to your plot." ] }, { "cell_type": "code", "execution_count": 10, "id": "d436dab6", "metadata": {}, "outputs": [], "source": [ "# your code here\n", "#plt.scatter()" ] }, { "cell_type": "markdown", "id": "ab76c3d4", "metadata": {}, "source": [ "Use the .boxplot to plot the percentiles of the pressure distribution for the rainy days and dry days. Keywords that could be useful here are `column`, `by` and if you feel like it `patch_artist = True`" ] }, { "cell_type": "code", "execution_count": 11, "id": "6923e58c", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "bf365f4b", "metadata": {}, "source": [ "Let's split our dataset into a training set and a testing set. For this tutorial, we will only keep surface pressure `sp` as our input feature. Uncomment the lines below to generate the training and testing data sets." ] }, { "cell_type": "code", "execution_count": 12, "id": "3a16f001", "metadata": {}, "outputs": [], "source": [ "#from sklearn.model_selection import train_test_split\n", "#X_train, X_test, y_train, y_test = train_test_split(df_day[['sp']],df_day['tag'],test_size=.3,random_state=0)" ] }, { "cell_type": "markdown", "id": "18722b7f", "metadata": {}, "source": [ "For now on, we will train our model an `X_train` and `y_train`. Later on, we will validate our results with `X_test` and `y_test`.\n", "\n", "> - What is the number of day in each class?\n", "> - What is the probability of having a rainy day in this data set?" ] }, { "cell_type": "code", "execution_count": 13, "id": "521c2984", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "3ed0c4b6", "metadata": {}, "source": [ "In order to compute the linear discriminant analysis, we need to compute the mean and covariance matrix of each class\n", "\n", "> - What is the mean of each class? (Use the method `.groupby(y_train)`)" ] }, { "cell_type": "code", "execution_count": 14, "id": "73898d86", "metadata": {}, "outputs": [], "source": [ "#your code here" ] }, { "cell_type": "markdown", "id": "4c3b7ffb", "metadata": {}, "source": [ "> - What is the covariance matrix of each class? (Same hint)" ] }, { "cell_type": "code", "execution_count": 15, "id": "f96b2958", "metadata": {}, "outputs": [], "source": [ "#your code here" ] }, { "cell_type": "markdown", "id": "152f6d75", "metadata": {}, "source": [ "> - What is the weighted sum of the two covariance matrices" ] }, { "cell_type": "code", "execution_count": 16, "id": "a7040a54", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "4401f200", "metadata": {}, "source": [ "Suppose rain is only function of pressure.\n", "\n", "> - Compute the numerical coefficients of the two discriminant functions $\\delta_k(x) = x\\frac{\\mu_k}{\\sigma^2} - \\frac{\\mu_k^2}{2\\sigma^2} + \\log P_k$" ] }, { "cell_type": "code", "execution_count": 17, "id": "fe25412e", "metadata": {}, "outputs": [], "source": [ "#your code here" ] }, { "cell_type": "markdown", "id": "bd894645", "metadata": {}, "source": [ "What is the threshold pressure to discriminate between rainy days and dry days?" ] }, { "cell_type": "code", "execution_count": 18, "id": "b83bb3fa", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "ec2df850", "metadata": {}, "source": [ "> ***(Optional)***\n", "> - Same question but in the 2d space (pressure, temperature)\n", "> - Plot this decision boundary on top of your scatter plot" ] }, { "cell_type": "markdown", "id": "d987cfa4", "metadata": {}, "source": [ "Let's analyze the Linear discriminant analysis provided by scikit-learn " ] }, { "cell_type": "code", "execution_count": 19, "id": "1a7dacb2", "metadata": {}, "outputs": [], "source": [ "#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "#lda = LinearDiscriminantAnalysis()\n", "#lda.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "id": "747001bf", "metadata": {}, "source": [ "What is the class prediction according to this Linear Discriminant Analysis?" ] }, { "cell_type": "code", "execution_count": 20, "id": "0717b075", "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "id": "0dad63bc", "metadata": {}, "source": [ "What is the overall accuracy of our predictor? You can use the `classification_report` function." ] }, { "cell_type": "code", "execution_count": 21, "id": "1f67eab7", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report\n", "# your code here" ] }, { "cell_type": "markdown", "id": "e8b7b95b", "metadata": {}, "source": [ "You can look more closely at the results with the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)" ] }, { "cell_type": "code", "execution_count": 22, "id": "6bd6e3cb", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "# your code here" ] }, { "cell_type": "markdown", "id": "87c1f1b7", "metadata": {}, "source": [ "Do you get a completely different confusion matrix with the test data than with the train data?" ] }, { "cell_type": "markdown", "id": "a19d8492", "metadata": {}, "source": [ "> ***Questions***\n", "> - Do you feel you have built a good predictor? \n", "> - What would be the score of a predictor that would predict rain every day? dry every day? \n", "> - What about a completely random predictor?\n", "> - What do you think of the [picture of the barometer](#Prediction-of-the-rain) at the beginning of this tutorial?" ] }, { "cell_type": "markdown", "id": "a183ae88", "metadata": {}, "source": [ "#### Further reading\n", "\n", "> - Read about the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and try to use the scikit-learn module to plot it.\n", "> - Does the Quadratic discriminant analysis performs better on this data set?" ] }, { "cell_type": "markdown", "id": "5e186998", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "## Credit\n", "\n", "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n", "Contributors include Bruno Deremble and Alexis Tantet.\n", "\n", "
\n", "\n", "
\n", " \n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " \"Creative\n", "
This work is licensed under a   Creative Commons Attribution-ShareAlike 4.0 International License.\n", "
" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }