# Tutorial 1 : Python basics#

**Learning Outcomes**

Data manipulation in python: numpy, matplotlib, pandas

First steps with scikit-learn

Python is a popular computer language. We will use it along with 4 libraries: numpy for array handling, pandas for time series, scikit-learn for machine learning tools and matplotlib for plotting. There are *a lot* of resources on the web: the user manual remains the primary location where you should be looking for documentation about a specific function. https://stackoverflow.com/ is also a great source of information since many people before you have already had the same question as you. A good intro to numpy: https://sebastianraschka.com/blog/2020/numpy-intro.html

Even though python is a great language, don’t forget that it is an Interpreted language. That means that python can be very slow. For this reason you should

Rely on existing libraries when available

Write compact code

Avoid loops

In fact, most modern libraries deffer the heavy lifting part to compiled bits of codes which typically run much faster than your own implementation.

Last, don’t forget to comment your codes.

## Numpy for array handling#

Numpy is the core library when it comes to manipulating “dense” arrays of numbers and you should always use it instead of raw python lists. Numpy also comes with a couple of linear algebra routines to perform basic operations (matrix multiplication, linear systems, eigenvalue solver). However for large-scale application (and sparse matrices) we will use different libraries (like scipy).

```
import numpy as np
```

Vectors are first declared as a list of number and are then converted to numpy arrays. Note that python does not make the difference between row vectors and column vectors: all vectors are treated as rows.

```
x = np.array([1, 2, 3])
print(x)
```

```
[1 2 3]
```

Matrices are declared as an array of arrays in a similar way as vectors. Note that python is case sensitive

```
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
```

```
[[1 2 3]
[4 5 6]
[7 8 9]]
```

In order to convert a row vector to a column vector, one must add a new dimension as shown below:

```
x_col = x[:,None]
print(x_col)
```

```
[[1]
[2]
[3]]
```

You can compute the transpose of a vector with the method `.T`

. However keep in mind that the transpose of a 1d vector does not really make sense *in python*. If you want to transpose a vector in python, it must have at least two dimensions.

Question

Try to transpose

`x`

and`x_col`

and check their dimensions.

Hint: You can always check the dimensions of your array with the method

`.shape`

.

```
# your code here
```

Matrix multiplication can be achieved with `np.dot`

, `np.matmul`

, `@`

or `np.einsum`

.

Question

Pick the one you prefer to compute \(\mathbf A \mathbf x\), \(\mathbf x \mathbf x^\top\), and \(\mathbf x^\top \mathbf x\).

Hint: Did I already warn you about transposing 1d vectors in python?

Hint2: Einsum is the most advanced function but also the most complicated. This tutorial will tell you more about it

Hint3: you can add a question mark to the name of the function to access the help:

`np.einsum?`

With the vectors and matrices defined above, you should get: \(\mathbf A \mathbf x = [14, 32, 50]^\top\)

and \(\mathbf x^\top \mathbf x = 14\)

```
# your code here
```

You can access the first element of an array with `x[0]`

and the last element with `x[-1]`

. You can also select a subset using slices like `x[0:2]`

(or `x[:2]`

). Be careful that the last index of the slice is not part of the subset.

Question

Can you select the last two element of the vector \(\mathbf x\) with negative indices?

```
# your code here
```

Numpy comes with the standard math function `np.exp`

, `np.sqrt`

, `np.sin`

. Each of these functions applied to an array is applied element-wise. For instance `A**2`

raises each element of `A`

to the power of 2.

Question

Is

`A*A`

the same as`A@A`

?

```
# your code here
```

### Broadcasting#

In python, we can add, subtract, multiply, (etc.) arrays elementwise. In this context arrays have to be of the same size. But when we do `A+1`

, we add the scalar `1`

to each element of the array `A`

. This type of operation is called broadcasting.

Question

Check the broadcasting rules (link above) and add

`x`

to each column of`A`

You should get

```
# your code here
```

### Generating arrays#

Empty arrays can be generated with `np.zeros`

. We can also generate arrays with random numbers. The most useful functions are

`np.random.rand`

(uniform distribution)`np.random.randn`

(standard normal distribution)`np.random.normal`

(normal distribution)

Question

Generate 2 arrays \(v_1\) and \(v_2\) with \(N=500\) observations following the normal distribution \(\mathcal N (\mu_1,\sigma_1^2)\) for \(v_1\) and \(\mathcal N (\mu_2,\sigma_2^2)\) for \(v_2\), with \(\mu_1 = 2\), \(\mu_2 = 1\), \(\sigma_1 = 5\), \(\sigma_2 = 1\)

```
mu1 = 2
mu2 = 1
sigma1 = 5
sigma2 = 1
N = 500
# your code here
```

Question

Compute the mean of each array.

How does it compare to \(\mu_1\) and \(\mu_2\)? Can you explain?

```
# your code here
```

Question

Do you think \(v_1\) and \(v_2\) are correlated?

Can you compute the actual covariance between \(v_1\) and \(v_2\) with

`np.cov`

?Recall the definition of the covariance between two variables. Hint

What is the expected covariance matrix in terms of \(\sigma_1\) and \(\sigma_2\)

Do you recognize the array

`Cv`

defined below?

```
# your code here
Cv = np.array([[sigma1**2, 0], [0,sigma2**2]])
```

As you just saw, \(v_1\) and \(v_2\) are not correlated. Let’s apply the linear transformation

That you can also write for each realization

with \(\mathbf R\) the rotation matrix

and

Question

Create the rotation matrix

`R`

with \(\theta=\pi/3\)In order to compute \(\mathbf w\) it is convenient to store \(\mathbf v = [v_1, v_2]\) in a matrix

`V = np.array([v1,v2])`

. Compute`V`

.Compute

`W`

with one matrix multiplication.

Remarks:

The equations above are valid for one single observation.

layout for

`V`

and`W`

is the opposite as the one we adopted in class. We will adopt this new layout just for this tutorial. Take a moment to check if each observation is stored in a row or in a column.

```
# your code here
```

Question

Compute the mean of \(\mathbf w = (w_1,w_2)\) with two different techniques:

direct computation with

`np.mean`

(be careful, you need to compute the mean along the second dimension only! Look at the documentation.)with \(R\) and the mean of \(\mathbf v\).

For the second method, you will have to derive the formula from the definition of \(\mathbf w\)

```
# your answer here
```

Question

Can you guess what the covariance between \(w_1\) and \(w_2\) will be? Hint

Compute the theoretical covariance matrix \(C_w\) as a function of \(R\) and \(C_v\) (the covariance matrix of \(\mathbf v\)) Hint

Compute the covariance matrix of \(\mathbf w\) with 2 methods:

with

`np.cov`

,with

`R`

and`Cv`

Comment your results

You should get

```
# your answer here
```

## Matplotlib for plotting#

To visualize data, matplotlib is the reference tool. Matplotlib is also what runs behind the scene of high level python libraries (pandas, xarray) and so knowing the basics will also be useful there.

```
import matplotlib.pyplot as plt
```

Question

Can you visualize the data set

`V`

that we generated earlier. you can either pass the two variables to the function`plt.plot`

in order to plot \(v_2\) as a function of \(v_1\) or we can use the function`plt.scatter`

. Don’t forget to

```
#your answer here
```

Question

Do you

seethe lack of correlation between \(v_1\) and \(v_2\)?

Let’s analyze the probability density function (pdf) of this data set. The function `plt.hist`

provides a discrete estimate of the pdf along a given dimension.

Question

Use

`plt.hist`

to compute and plot an estimation of the pdf of \(w_1\) and \(w_2\). Don’t forget to adjust the number of bins.

Compare it with the analytical probability density function of \(w_1\) and \(w_2\)? (Hint)(Optional)

```
# your code here
```

We can also get a quick overview of the statistical distribution of the data with box plot.

Question

What is the meaning of a box plot?

Try it on \(w_1\)

```
# your code here
```

Question (optional)

Use

`np.histogram2d`

to compute the 2-dimensional pdf(***) On a single plot, superimpose the cloud of points and the contour lines of the pdf.

```
# your answer here
```

```
# Solution
# The (x,y) indexing is tricky: use 2 different number of bins in each direction
#psi, xedge, yedge = np.histogram2d(W[0,:],W[1,:],bins=[5,10])
#x1 = 0.5*(xedge[1:] + xedge[:-1])
#y1 = 0.5*(yedge[1:] + yedge[:-1])
#xg,yg = np.meshgrid(x1,y1,indexing='ij')
#plt.plot(W[0,:],W[1,:],'.', alpha=0.1)
#plt.contour(xg,yg,psi)
#plt.axis('equal')
```

:Question (Optional)

The Central limit theorem states that If \(X_1, X_2,\dots ,X_{N}\) are \(N\) random samples drawn from a population with overall mean \(\mu\) and finite variance \(\sigma ^{2}\) \(\bar {X}_{N}\) is the sample mean, then the limiting form of the distribution, \(Z=\lim _{N\to \infty }{\sqrt {N}}{\left({\frac {{\bar {X}}_{N}-\mu }{\sigma }}\right)}\) is a standard normal distribution. Can you propose an visual illustration of this theorem?

```
# your answer here
```

## Pandas for advanced array manipulation#

Compared to numpy which already handles arrays of numbers, Pandas essentially handles **metadata**. This will allow you to use predefined functions in pandas especially when it comes to time series manipulations.

Pandas has a lot of other nice features like

Intuitive handling of missing data (NaN)

Input/output in many formats

Simplified plotting procedures (labels, axes, etc. are added automatically)

Better management of the memory

### Data download#

If you are running a standalone version of the notebook, you will have to download the data first by running:

`!mkdir data`

`!wget -P data/ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv`

and in case wget is not installed, you can install it with

!pip install wget

!python -m wget -o ‘data/daily_in_situ_co2_mlo.csv’ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv

We are going to explore the basic functionalities of pandas with the historical CO2 measurements taken at Mona Loa also known as the Keeling curve. The data can be downloaded at scrippsco2.ucsd.edu and we are going to analyze the daily output.

Take a moment to inspect the file that is located here `data/daily_in_situ_co2_mlo.csv`

(either use a text editor or navigate within jupyter to open the file). Read the header to understand what is the data set about (you may have to investigate more to know about the units).

Question

Why is there NaN in the data set?

```
import pandas as pd
```

We are going to use the function `pd.read_csv`

to import the data in python. It is very important to do a clean import of the data so that we can work efficiently afterwards. Pay extra attention to

the commented part in the header (hint:

`comment`

)add a name to each column. Use the names

**year**,**month**,**day**for the corresponding columns. This way, pandas will directly recognize what they are (hint:`names`

)pandas handles spaces in cvs files in a very specific way: you may need to adjust its behavior with

`skipinitialspace`

.

Use `pd.read_csv`

to load the content of `data/daily_in_situ_co2_mlo.csv`

into a DataFrame called `df`

.

```
df = pd.read_csv('data/daily_in_situ_co2_mlo.csv', comment='%',names=["Year", "Month", "Day", "CO2", "NB", "scale"], skipinitialspace=True)
```

To access your data, you will need to know the metadata, i.e. the label of each column in the csv file (in this case provided by yourself). You can get a quick overview of your data by typing `df.head()`

or `df.describe()`

. You will see that the Dataframe `df`

can be treated like a spreadsheet with labeled columns and entries for each column. You can view the labels with `df.columns`

and the rows indices with `df.index`

. Make sure you understand the type of each column of the dataset with `df.dtypes`

.

```
#look at the data set here
```

You can select a column of the DataFrame by passing one name to `df`

like `df["Year"]`

. You can also select multiple columns by passing a list of names `df[["Year", "Month"]]`

. To select specific rows, you need to use the metho `df.loc[i0:i1]`

with `i0`

and `i1`

, the lower and upper bounds of the index. In our case we did not provide labels for the index so pandas automatically numbered the rows with a simple counter `0,1,...`

. In that case, `i0`

and `i1`

are integers but we will see later that `i0`

and `i1`

can also be two dates (if the index is a date).

Question

Create a new DataFrame that contains only the first 100 entries and with the columns

`Year`

and`CO2`

only.

```
# your answer here
```

One killer feature of pandas is the way it handles time series. In order to use the full potential of such tool, one need to change the index to a date that is recognized by pandas. We use `pd.to_datetime`

to combine the information in the year, month and day columns of the dataframe. `pd.to_datetime`

handles several formats as input and you should be careful to use the correct input format. However, because we carefully chose the name of the columns when we imported the file, we can directly assemble the columns `Year`

, `Month`

, `Day`

to create a new datetime object. You can follow the method in the documentation to re-allocate `df.index`

to a `datetime`

type variable. Henceforth, we will use `df.index`

to access the time dimension.

```
#your answer here
```

Question (Optional)

Note that one could have done this operation directly while reading the data. Use

`parse_dates`

in`pd.read_csv`

to create a datetime object in`df`

. Try it.

```
# your code here
```

Now `df.index`

contain Timestamps. You can quickly access the day, month, year of the records by using `df.index.day`

, `df.index.month`

`df.index.year`

(etc. see documentation). You can also construct time intervals (or timedetlas) by subtracting two timestamps

Question

Subtract the first index to the last index to know how many days are in this data set

```
#your answer here
```

You can now select specific rows of your dataset by passing dates to the `.loc[]`

method, as in `.loc[date0:date1]`

. The two dates `date0`

and `date1`

can be written as strings in the format `'YYYY-MM-DD'`

(e.g. `'2020-09-20'`

) or simply `'YYYY-MM'`

.

If `date0`

is only `'2020-09'`

(you are missing the days), Pandas will interpret it as the first day of September 2020. Note that Pandas can also interpret `'2020-09'`

as the whole month of September 2020 if this is the only argument as in `.loc[date0]`

.

Question

Select the whole month of September 2020 with two different methods

```
#your code here
```

Pandas has built-in plotting features that are based on matplotlib. Since you now master matplotlib, you will find it very easy to plot data within pandas.

Question

Plot the CO2 time series as a function of time. (Hint: select the column of interest in df and simply apply the method

`.plot()`

)

```
# your answer here
```

The CO2 curve is increasing. This is not really a surprised isn’t it?. However, we also notice wiglly patterns superimposed to this trend. Are we looking at noisy data are does it have a physical meaning?

Question

In order to get a closer look, can you plot only 2 years of the CO2 time series between Jan 1st 2000 and Jan 1st 2002?

```
#your answer here
```

So there seem to be a lot of variability and many missing data in this data set. We can try to average all available data for each single year that we have. That way we will probably see a smoother curve. The `resample`

function is here to create a new time series with the desired sampling frequency. The acronym for standard time intervals are listed in the Time series documentation. You need to either add the method `.mean()`

or `.sum()`

to the resampled data to indicate how to treat the data in the resampling interval.

Question

Plot the yearly average evolution of the CO2 concentration.

```
#your answer here
```

These wiggles are not random but are the actual annual cycle of the data.

Question (Optional)

Use the method

`.groupby`

to plot the annual cycleDiscuss the physical origin of this anual cycle

```
#your answer here
```

## Scikit-learn for data analysis#

Scikit-learn will be our toolbox for all statistical analysis. It provides the most popular machine learning algorithm, is well documented and has an active user community. Scikit learn does not work well with missing data. So in order to have a smooth transition between pandas and scikit-learn, it is convenient to create a copy of your DataFrame without any `NaN`

. Make sure you use (and understand) the `.copy()`

method to create the new array because we are going to make modification to this new DataFrame. You can call this new DataFrame `dfv`

.

```
dfv = df.dropna().copy()
```

Because we cannot do a linear regression with dates as indices, we need to create an additional column that contains the day number since the begining of the dataset. Remember how we worked with timedelta (and broadcasting) above?

Question

Can you create this new column and name it

`counter`

?

*Be careful that this new variable must have dtype int (or float) and not a timedelta. (Hint: look at the attributes of timedelta)*

```
#dfv["counter"] =
```

We are now ready to use the linear regression module.

```
from sklearn.linear_model import LinearRegression
```

We are first going to fit the data. As explained in the documentation, this function takes at least two arguments: the predictors and the observations. Predictors must be a 2d array of shape (n_samples, n_features), **even if there is only ONE feature**. One simple way to achieve this is to select a list of only one variable in the DataFrame (i.e. `dfv[["counter"]]`

, note the usage of double brackets here).

Question

Create the input variable

`X`

(day since first record) and the output variable`y`

(CO2 concentration)Perform the linear regression and put the output in the variable

`reg`

.

```
#X =
#y
#reg =
```

Question

What is the value of the regression coefficient?

What is its physical meaning?

```
#your code here
```

Question

Use the method

`.predict`

to construct the predicted value at all observation times and add it to the`dfv`

DataFrame.

```
#dfv["CO2_fit"] =
```

Question

Plot the observed value along with your prediction on a single plot.

```
#your code here
```

It is now super easy to compute the co2 growth rate in ppm/year for several time periods.

Question

Perform 2 linear regressions: one between 1960 and 1980 and one between 2000 and 2020 and compare how the growth rate evolved over time.

Can you give an estimate of the CO2 concentration in 2050?

```
#reg_2000 = LinearRegression().fit(dfv[["counter"]][dfv.index.year>2000], dfv["CO2"][dfv.index.year>2000])
#reg_1960 = LinearRegression().fit(dfv[["counter"]][dfv.index.year<1980], dfv["CO2"][dfv.index.year<1980])
#print("The growth rate changed from {}ppm/day to {}ppm/day from 1970 to 2010".format(reg_1960.coef_,reg_2000.coef_))
## rough estimate of the number of days between 1960 and 2050 (90 years x 365 days)
#nday_2050 = 90*365
#predict_co2_2050 = reg_2000.intercept_ + nday_2050*reg_2000.coef_
#print(f"With the most recent growth rate, there will be {predict_co2_2050} ppm of CO2 in 2050"
```

## Credit#

Contributors include Bruno Deremble and Alexis Tantet.