Tutorial 1 : Python basics#

Learning Outcomes
  • Data manipulation in python: numpy, matplotlib, pandas

  • First steps with scikit-learn

Python is a popular computer language. We will use it along with 4 libraries: numpy for array handling, pandas for time series, scikit-learn for machine learning tools and matplotlib for plotting. There are a lot of resources on the web: the user manual remains the primary location where you should be looking for documentation about a specific function. https://stackoverflow.com/ is also a great source of information since many people before you have already had the same question as you. A good intro to numpy: https://sebastianraschka.com/blog/2020/numpy-intro.html

Even though python is a great language, don’t forget that it is an Interpreted language. That means that python can be very slow. For this reason you should

  • Rely on existing libraries when available

  • Write compact code

  • Avoid loops

In fact, most modern libraries deffer the heavy lifting part to compiled bits of codes which typically run much faster than your own implementation.

Last, don’t forget to comment your codes.

Numpy for array handling#

Numpy is the core library when it comes to manipulating “dense” arrays of numbers and you should always use it instead of raw python lists. Numpy also comes with a couple of linear algebra routines to perform basic operations (matrix multiplication, linear systems, eigenvalue solver). However for large-scale application (and sparse matrices) we will use different libraries (like scipy).

import numpy as np

Vectors are first declared as a list of number and are then converted to numpy arrays. Note that python does not make the difference between row vectors and column vectors: all vectors are treated as rows.

x = np.array([1, 2, 3])
print(x)
[1 2 3]

Matrices are declared as an array of arrays in a similar way as vectors. Note that python is case sensitive

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

In order to convert a row vector to a column vector, one must add a new dimension as shown below:

x_col = x[:,None]
print(x_col)
[[1]
 [2]
 [3]]

You can compute the transpose of a vector with the method .T. However keep in mind that the transpose of a 1d vector does not really make sense in python. If you want to transpose a vector in python, it must have at least two dimensions.

Question

  • Try to transpose x and x_col and check their dimensions.

  • Hint: You can always check the dimensions of your array with the method .shape.

# your code here

Matrix multiplication can be achieved with np.dot, np.matmul, @ or np.einsum.

Question

  • Pick the one you prefer to compute \(\mathbf A \mathbf x\), \(\mathbf x \mathbf x^\top\), and \(\mathbf x^\top \mathbf x\).

  • Hint: Did I already warn you about transposing 1d vectors in python?

  • Hint2: Einsum is the most advanced function but also the most complicated. This tutorial will tell you more about it

  • Hint3: you can add a question mark to the name of the function to access the help: np.einsum?

With the vectors and matrices defined above, you should get: \(\mathbf A \mathbf x = [14, 32, 50]^\top\)

(16)#\[\begin{align} \mathbf x \mathbf x^\top = &= \begin{bmatrix} 1 & 2 & 3\\ 2 & 4 & 6\\ 3 & 6 & 9 \\ \end{bmatrix} \end{align}\]

and \(\mathbf x^\top \mathbf x = 14\)

# your code here

You can access the first element of an array with x[0] and the last element with x[-1]. You can also select a subset using slices like x[0:2] (or x[:2]). Be careful that the last index of the slice is not part of the subset.

Question

  • Can you select the last two element of the vector \(\mathbf x\) with negative indices?

# your code here

Numpy comes with the standard math function np.exp, np.sqrt, np.sin. Each of these functions applied to an array is applied element-wise. For instance A**2 raises each element of A to the power of 2.

Question

  • Is A*A the same as A@A?

# your code here

Broadcasting#

In python, we can add, subtract, multiply, (etc.) arrays elementwise. In this context arrays have to be of the same size. But when we do A+1, we add the scalar 1 to each element of the array A. This type of operation is called broadcasting.

Question

  • Check the broadcasting rules (link above) and add x to each column of A

You should get

(17)#\[\begin{bmatrix} 2 & 3 & 4\\ 6 & 7 & 8\\ 10 & 11 & 12 \\ \end{bmatrix}\]
# your code here

Generating arrays#

Empty arrays can be generated with np.zeros. We can also generate arrays with random numbers. The most useful functions are

Question

  • Generate 2 arrays \(v_1\) and \(v_2\) with \(N=500\) observations following the normal distribution \(\mathcal N (\mu_1,\sigma_1^2)\) for \(v_1\) and \(\mathcal N (\mu_2,\sigma_2^2)\) for \(v_2\), with \(\mu_1 = 2\), \(\mu_2 = 1\), \(\sigma_1 = 5\), \(\sigma_2 = 1\)

mu1 = 2
mu2 = 1

sigma1 = 5
sigma2 = 1

N = 500

# your code here

Question

  • Compute the mean of each array.

  • How does it compare to \(\mu_1\) and \(\mu_2\)? Can you explain?

# your code here

Question

  • Do you think \(v_1\) and \(v_2\) are correlated?

  • Can you compute the actual covariance between \(v_1\) and \(v_2\) with np.cov?

  • Recall the definition of the covariance between two variables. Hint

  • What is the expected covariance matrix in terms of \(\sigma_1\) and \(\sigma_2\)

  • Do you recognize the array Cv defined below?

# your code here
Cv = np.array([[sigma1**2, 0], [0,sigma2**2]])

As you just saw, \(v_1\) and \(v_2\) are not correlated. Let’s apply the linear transformation

(18)#\[\begin{align} w_1& = \cos \theta v_1 - \sin \theta v_2\\ w_2& = \sin \theta v_1 + \cos \theta v_2 \end{align}\]

That you can also write for each realization

(19)#\[\begin{equation} \mathbf{w} = \mathbf{R}\mathbf{v} \end{equation}\]

with \(\mathbf R\) the rotation matrix

(20)#\[\begin{align} \mathbf{R} &= \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \end{align}\]

and

(21)#\[\begin{align} \mathbf{v} &= \begin{bmatrix} v_1 \\ v_2 \end{bmatrix} & \quad & \mathbf{w} &= \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} \end{align}\]

Question

  • Create the rotation matrix R with \(\theta=\pi/3\)

  • In order to compute \(\mathbf w\) it is convenient to store \(\mathbf v = [v_1, v_2]\) in a matrix V = np.array([v1,v2]). Compute V.

  • Compute W with one matrix multiplication.

Remarks:

  • The equations above are valid for one single observation.

  • layout for V and W is the opposite as the one we adopted in class. We will adopt this new layout just for this tutorial. Take a moment to check if each observation is stored in a row or in a column.

# your code here

Question

  • Compute the mean of \(\mathbf w = (w_1,w_2)\) with two different techniques:

    • direct computation with np.mean (be careful, you need to compute the mean along the second dimension only! Look at the documentation.)

    • with \(R\) and the mean of \(\mathbf v\).

For the second method, you will have to derive the formula from the definition of \(\mathbf w\)

# your answer here

Question

  • Can you guess what the covariance between \(w_1\) and \(w_2\) will be? Hint

  • Compute the theoretical covariance matrix \(C_w\) as a function of \(R\) and \(C_v\) (the covariance matrix of \(\mathbf v\)) Hint

  • Compute the covariance matrix of \(\mathbf w\) with 2 methods:

    • with np.cov,

    • with R and Cv

  • Comment your results

You should get

(22)#\[\begin{align} C_w &=\begin{bmatrix} 7 & 10.4\\ 10.4 & 19 \end{bmatrix} \end{align}\]
# your answer here

Matplotlib for plotting#

To visualize data, matplotlib is the reference tool. Matplotlib is also what runs behind the scene of high level python libraries (pandas, xarray) and so knowing the basics will also be useful there.

import matplotlib.pyplot as plt

Question

  • Can you visualize the data set V that we generated earlier. you can either pass the two variables to the function plt.plot in order to plot \(v_2\) as a function of \(v_1\) or we can use the function plt.scatter. Don’t forget to

    • adjust the type and size of markers Hint

    • adjust the axis so that both axis are equally stretched Hint.

    • add labels Hint

#your answer here

Question

  • Do you see the lack of correlation between \(v_1\) and \(v_2\)?

Let’s analyze the probability density function (pdf) of this data set. The function plt.hist provides a discrete estimate of the pdf along a given dimension.

Question

  • Use plt.hist to compute and plot an estimation of the pdf of \(w_1\) and \(w_2\). Don’t forget to adjust the number of bins.

  • (Optional) Compare it with the analytical probability density function of \(w_1\) and \(w_2\)? (Hint)

# your code here

We can also get a quick overview of the statistical distribution of the data with box plot.

Question

  • What is the meaning of a box plot?

  • Try it on \(w_1\)

# your code here

Question (optional)

  • Use np.histogram2d to compute the 2-dimensional pdf

  • (***) On a single plot, superimpose the cloud of points and the contour lines of the pdf.

# your answer here
# Solution
# The (x,y) indexing is tricky: use 2 different number of bins in each direction
#psi, xedge, yedge = np.histogram2d(W[0,:],W[1,:],bins=[5,10])

#x1 = 0.5*(xedge[1:] + xedge[:-1])
#y1 = 0.5*(yedge[1:] + yedge[:-1])

#xg,yg = np.meshgrid(x1,y1,indexing='ij')

#plt.plot(W[0,:],W[1,:],'.', alpha=0.1)
#plt.contour(xg,yg,psi)

#plt.axis('equal')

Question (Optional):

  • The Central limit theorem states that If \(X_1, X_2,\dots ,X_{N}\) are \(N\) random samples drawn from a population with overall mean \(\mu\) and finite variance \(\sigma ^{2}\) \(\bar {X}_{N}\) is the sample mean, then the limiting form of the distribution, \(Z=\lim _{N\to \infty }{\sqrt {N}}{\left({\frac {{\bar {X}}_{N}-\mu }{\sigma }}\right)}\) is a standard normal distribution. Can you propose an visual illustration of this theorem?

# your answer here

Pandas for advanced array manipulation#

Compared to numpy which already handles arrays of numbers, Pandas essentially handles metadata. This will allow you to use predefined functions in pandas especially when it comes to time series manipulations.

Pandas has a lot of other nice features like

  • Intuitive handling of missing data (NaN)

  • Input/output in many formats

  • Simplified plotting procedures (labels, axes, etc. are added automatically)

  • Better management of the memory

Data download#

If you are running a standalone version of the notebook, you will have to download the data first by running:

!mkdir data

!wget -P data/ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv

and in case wget is not installed, you can install it with

!pip install wget

!python -m wget -o ‘data/daily_in_situ_co2_mlo.csv’ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv

We are going to explore the basic functionalities of pandas with the historical CO2 measurements taken at Mona Loa also known as the Keeling curve. The data can be downloaded at scrippsco2.ucsd.edu and we are going to analyze the daily output.

Take a moment to inspect the file that is located here data/daily_in_situ_co2_mlo.csv (either use a text editor or navigate within jupyter to open the file). Read the header to understand what is the data set about (you may have to investigate more to know about the units).

Question

  • Why is there NaN in the data set?

import pandas as pd

We are going to use the function pd.read_csv to import the data in python. It is very important to do a clean import of the data so that we can work efficiently afterwards. Pay extra attention to

  • the commented part in the header (hint: comment)

  • add a name to each column. Use the names year, month, day for the corresponding columns. This way, pandas will directly recognize what they are (hint: names)

  • pandas handles spaces in cvs files in a very specific way: you may need to adjust its behavior with skipinitialspace.

Use pd.read_csv to load the content of data/daily_in_situ_co2_mlo.csv into a DataFrame called df.

df = pd.read_csv('data/daily_in_situ_co2_mlo.csv', comment='%',names=["Year", "Month", "Day", "CO2", "NB", "scale"], skipinitialspace=True)

To access your data, you will need to know the metadata, i.e. the label of each column in the csv file (in this case provided by yourself). You can get a quick overview of your data by typing df.head() or df.describe(). You will see that the Dataframe df can be treated like a spreadsheet with labeled columns and entries for each column. You can view the labels with df.columns and the rows indices with df.index. Make sure you understand the type of each column of the dataset with df.dtypes.

#look at the data set here

You can select a column of the DataFrame by passing one name to df like df["Year"]. You can also select multiple columns by passing a list of names df[["Year", "Month"]]. To select specific rows, you need to use the metho df.loc[i0:i1] with i0 and i1, the lower and upper bounds of the index. In our case we did not provide labels for the index so pandas automatically numbered the rows with a simple counter 0,1,.... In that case, i0 and i1 are integers but we will see later that i0 and i1 can also be two dates (if the index is a date).

Question

  • Create a new DataFrame that contains only the first 100 entries and with the columns Year and CO2 only.

# your answer here

One killer feature of pandas is the way it handles time series. In order to use the full potential of such tool, one need to change the index to a date that is recognized by pandas. We use pd.to_datetime to combine the information in the year, month and day columns of the dataframe. pd.to_datetime handles several formats as input and you should be careful to use the correct input format. However, because we carefully chose the name of the columns when we imported the file, we can directly assemble the columns Year, Month, Day to create a new datetime object. You can follow the method in the documentation to re-allocate df.index to a datetime type variable. Henceforth, we will use df.index to access the time dimension.

#your answer here

Question (Optional)

  • Note that one could have done this operation directly while reading the data. Use parse_dates in pd.read_csv to create a datetime object in df. Try it.

# your code here

Now df.index contain Timestamps. You can quickly access the day, month, year of the records by using df.index.day, df.index.month df.index.year (etc. see documentation). You can also construct time intervals (or timedetlas) by subtracting two timestamps

Question

  • Subtract the first index to the last index to know how many days are in this data set

#your answer here

You can now select specific rows of your dataset by passing dates to the .loc[] method, as in .loc[date0:date1]. The two dates date0 and date1 can be written as strings in the format 'YYYY-MM-DD' (e.g. '2020-09-20') or simply 'YYYY-MM'.

If date0 is only '2020-09' (you are missing the days), Pandas will interpret it as the first day of September 2020. Note that Pandas can also interpret '2020-09' as the whole month of September 2020 if this is the only argument as in .loc[date0].

Question

  • Select the whole month of September 2020 with two different methods

#your code here

Pandas has built-in plotting features that are based on matplotlib. Since you now master matplotlib, you will find it very easy to plot data within pandas.

Question

  • Plot the CO2 time series as a function of time. (Hint: select the column of interest in df and simply apply the method .plot())

# your answer here

The CO2 curve is increasing. This is not really a surprised isn’t it?. However, we also notice wiglly patterns superimposed to this trend. Are we looking at noisy data are does it have a physical meaning?

Question

  • In order to get a closer look, can you plot only 2 years of the CO2 time series between Jan 1st 2000 and Jan 1st 2002?

#your answer here

So there seem to be a lot of variability and many missing data in this data set. We can try to average all available data for each single year that we have. That way we will probably see a smoother curve. The resample function is here to create a new time series with the desired sampling frequency. The acronym for standard time intervals are listed in the Time series documentation. You need to either add the method .mean() or .sum() to the resampled data to indicate how to treat the data in the resampling interval.

Question

  • Plot the yearly average evolution of the CO2 concentration.

#your answer here

These wiggles are not random but are the actual annual cycle of the data.

Question (Optional)

  • Use the method .groupby to plot the annual cycle

  • Discuss the physical origin of this anual cycle

#your answer here

Scikit-learn for data analysis#

Scikit-learn will be our toolbox for all statistical analysis. It provides the most popular machine learning algorithm, is well documented and has an active user community. Scikit learn does not work well with missing data. So in order to have a smooth transition between pandas and scikit-learn, it is convenient to create a copy of your DataFrame without any NaN. Make sure you use (and understand) the .copy() method to create the new array because we are going to make modification to this new DataFrame. You can call this new DataFrame dfv.

dfv = df.dropna().copy()

Because we cannot do a linear regression with dates as indices, we need to create an additional column that contains the day number since the begining of the dataset. Remember how we worked with timedelta (and broadcasting) above?

Question

  • Can you create this new column and name it counter?

Be careful that this new variable must have dtype int (or float) and not a timedelta. (Hint: look at the attributes of timedelta)

#dfv["counter"] = 

We are now ready to use the linear regression module.

from sklearn.linear_model import LinearRegression

We are first going to fit the data. As explained in the documentation, this function takes at least two arguments: the predictors and the observations. Predictors must be a 2d array of shape (n_samples, n_features), even if there is only ONE feature. One simple way to achieve this is to select a list of only one variable in the DataFrame (i.e. dfv[["counter"]], note the usage of double brackets here).

Question

  • Create the input variable X (day since first record) and the output variable y (CO2 concentration)

  • Perform the linear regression and put the output in the variable reg.

#X = 
#y  
#reg = 

Question

  • What is the value of the regression coefficient?

  • What is its physical meaning?

#your code here

Question

  • Use the method .predict to construct the predicted value at all observation times and add it to the dfv DataFrame.

#dfv["CO2_fit"] =

Question

  • Plot the observed value along with your prediction on a single plot.

#your code here

It is now super easy to compute the co2 growth rate in ppm/year for several time periods.

Question

  • Perform 2 linear regressions: one between 1960 and 1980 and one between 2000 and 2020 and compare how the growth rate evolved over time.

  • Can you give an estimate of the CO2 concentration in 2050?

#reg_2000 = LinearRegression().fit(dfv[["counter"]][dfv.index.year>2000], dfv["CO2"][dfv.index.year>2000])
#reg_1960 = LinearRegression().fit(dfv[["counter"]][dfv.index.year<1980], dfv["CO2"][dfv.index.year<1980])

#print("The growth rate changed from {}ppm/day to {}ppm/day from 1970 to 2010".format(reg_1960.coef_,reg_2000.coef_))

## rough estimate of the number of days between 1960 and 2050 (90 years x 365 days)
#nday_2050 = 90*365
#predict_co2_2050 = reg_2000.intercept_ + nday_2050*reg_2000.coef_
#print(f"With the most recent growth rate, there will be {predict_co2_2050} ppm of CO2 in 2050"

Credit#

Contributors include Bruno Deremble and Alexis Tantet.


Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS