Tutorial 1 : Python basics#
Data manipulation in python: numpy, matplotlib, pandas
First steps with scikit-learn
Python is a popular computer language. We will use it along with 4 libraries: numpy for array handling, pandas for time series, scikit-learn for machine learning tools and matplotlib for plotting. There are a lot of resources on the web: the user manual remains the primary location where you should be looking for documentation about a specific function. https://stackoverflow.com/ is also a great source of information since many people before you have already had the same question as you. A good intro to numpy: https://sebastianraschka.com/blog/2020/numpy-intro.html
Even though python is a great language, don’t forget that it is an Interpreted language. That means that python can be very slow. For this reason you should
Rely on existing libraries when available
Write compact code
Avoid loops
In fact, most modern libraries deffer the heavy lifting part to compiled bits of codes which typically run much faster than your own implementation.
Last, don’t forget to comment your codes.
Numpy for array handling#
Numpy is the core library when it comes to manipulating “dense” arrays of numbers and you should always use it instead of raw python lists. Numpy also comes with a couple of linear algebra routines to perform basic operations (matrix multiplication, linear systems, eigenvalue solver). However for large-scale application (and sparse matrices) we will use different libraries (like scipy).
import numpy as np
Vectors are first declared as a list of number and are then converted to numpy arrays. Note that python does not make the difference between row vectors and column vectors: all vectors are treated as rows.
x = np.array([1, 2, 3])
print(x)
[1 2 3]
Matrices are declared as an array of arrays in a similar way as vectors. Note that python is case sensitive
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
[[1 2 3]
[4 5 6]
[7 8 9]]
In order to convert a row vector to a column vector, one must add a new dimension as shown below:
x_col = x[:,None]
print(x_col)
[[1]
[2]
[3]]
You can compute the transpose of a vector with the method .T
. However keep in mind that the transpose of a 1d vector does not really make sense in python. If you want to transpose a vector in python, it must have at least two dimensions.
Question
Try to transpose
x
andx_col
and check their dimensions.
Hint: You can always check the dimensions of your array with the method
.shape
.
# your code here
Matrix multiplication can be achieved with np.dot
, np.matmul
, @
or np.einsum
.
Question
Pick the one you prefer to compute \(\mathbf A \mathbf x\), \(\mathbf x \mathbf x^\top\), and \(\mathbf x^\top \mathbf x\).
Hint: Did I already warn you about transposing 1d vectors in python?
Hint2: Einsum is the most advanced function but also the most complicated. This tutorial will tell you more about it
Hint3: you can add a question mark to the name of the function to access the help:
np.einsum?
With the vectors and matrices defined above, you should get: \(\mathbf A \mathbf x = [14, 32, 50]^\top\)
and \(\mathbf x^\top \mathbf x = 14\)
# your code here
You can access the first element of an array with x[0]
and the last element with x[-1]
. You can also select a subset using slices like x[0:2]
(or x[:2]
). Be careful that the last index of the slice is not part of the subset.
Question
Can you select the last two element of the vector \(\mathbf x\) with negative indices?
# your code here
Numpy comes with the standard math function np.exp
, np.sqrt
, np.sin
. Each of these functions applied to an array is applied element-wise. For instance A**2
raises each element of A
to the power of 2.
Question
Is
A*A
the same asA@A
?
# your code here
Broadcasting#
In python, we can add, subtract, multiply, (etc.) arrays elementwise. In this context arrays have to be of the same size. But when we do A+1
, we add the scalar 1
to each element of the array A
. This type of operation is called broadcasting.
Question
Check the broadcasting rules (link above) and add
x
to each column ofA
You should get
# your code here
Generating arrays#
Empty arrays can be generated with np.zeros
. We can also generate arrays with random numbers. The most useful functions are
np.random.rand
(uniform distribution)np.random.randn
(standard normal distribution)np.random.normal
(normal distribution)
Question
Generate 2 arrays \(v_1\) and \(v_2\) with \(N=500\) observations following the normal distribution \(\mathcal N (\mu_1,\sigma_1^2)\) for \(v_1\) and \(\mathcal N (\mu_2,\sigma_2^2)\) for \(v_2\), with \(\mu_1 = 2\), \(\mu_2 = 1\), \(\sigma_1 = 5\), \(\sigma_2 = 1\)
mu1 = 2
mu2 = 1
sigma1 = 5
sigma2 = 1
N = 500
# your code here
Question
Compute the mean of each array.
How does it compare to \(\mu_1\) and \(\mu_2\)? Can you explain?
# your code here
Question
Do you think \(v_1\) and \(v_2\) are correlated?
Can you compute the actual covariance between \(v_1\) and \(v_2\) with
np.cov
?Recall the definition of the covariance between two variables. Hint
What is the expected covariance matrix in terms of \(\sigma_1\) and \(\sigma_2\)
Do you recognize the array
Cv
defined below?
# your code here
Cv = np.array([[sigma1**2, 0], [0,sigma2**2]])
As you just saw, \(v_1\) and \(v_2\) are not correlated. Let’s apply the linear transformation
That you can also write for each realization
with \(\mathbf R\) the rotation matrix
and
Question
Create the rotation matrix
R
with \(\theta=\pi/3\)In order to compute \(\mathbf w\) it is convenient to store \(\mathbf v = [v_1, v_2]\) in a matrix
V = np.array([v1,v2])
. ComputeV
.Compute
W
with one matrix multiplication.
Remarks:
The equations above are valid for one single observation.
layout for
V
andW
is the opposite as the one we adopted in class. We will adopt this new layout just for this tutorial. Take a moment to check if each observation is stored in a row or in a column.
# your code here
Question
Compute the mean of \(\mathbf w = (w_1,w_2)\) with two different techniques:
direct computation with
np.mean
(be careful, you need to compute the mean along the second dimension only! Look at the documentation.)with \(R\) and the mean of \(\mathbf v\).
For the second method, you will have to derive the formula from the definition of \(\mathbf w\)
# your answer here
Question
Can you guess what the covariance between \(w_1\) and \(w_2\) will be? Hint
Compute the theoretical covariance matrix \(C_w\) as a function of \(R\) and \(C_v\) (the covariance matrix of \(\mathbf v\)) Hint
Compute the covariance matrix of \(\mathbf w\) with 2 methods:
with
np.cov
,with
R
andCv
Comment your results
You should get
# your answer here
Matplotlib for plotting#
To visualize data, matplotlib is the reference tool. Matplotlib is also what runs behind the scene of high level python libraries (pandas, xarray) and so knowing the basics will also be useful there.
import matplotlib.pyplot as plt
Question
Can you visualize the data set
V
that we generated earlier. you can either pass the two variables to the functionplt.plot
in order to plot \(v_2\) as a function of \(v_1\) or we can use the functionplt.scatter
. Don’t forget to
#your answer here
Question
Do you see the lack of correlation between \(v_1\) and \(v_2\)?
Let’s analyze the probability density function (pdf) of this data set. The function plt.hist
provides a discrete estimate of the pdf along a given dimension.
Question
Use
plt.hist
to compute and plot an estimation of the pdf of \(w_1\) and \(w_2\). Don’t forget to adjust the number of bins.(Optional) Compare it with the analytical probability density function of \(w_1\) and \(w_2\)? (Hint)
# your code here
We can also get a quick overview of the statistical distribution of the data with box plot.
Question
What is the meaning of a box plot?
Try it on \(w_1\)
# your code here
Question (optional)
Use
np.histogram2d
to compute the 2-dimensional pdf(***) On a single plot, superimpose the cloud of points and the contour lines of the pdf.
# your answer here
# Solution
# The (x,y) indexing is tricky: use 2 different number of bins in each direction
#psi, xedge, yedge = np.histogram2d(W[0,:],W[1,:],bins=[5,10])
#x1 = 0.5*(xedge[1:] + xedge[:-1])
#y1 = 0.5*(yedge[1:] + yedge[:-1])
#xg,yg = np.meshgrid(x1,y1,indexing='ij')
#plt.plot(W[0,:],W[1,:],'.', alpha=0.1)
#plt.contour(xg,yg,psi)
#plt.axis('equal')
Question (Optional):
The Central limit theorem states that If \(X_1, X_2,\dots ,X_{N}\) are \(N\) random samples drawn from a population with overall mean \(\mu\) and finite variance \(\sigma ^{2}\) \(\bar {X}_{N}\) is the sample mean, then the limiting form of the distribution, \(Z=\lim _{N\to \infty }{\sqrt {N}}{\left({\frac {{\bar {X}}_{N}-\mu }{\sigma }}\right)}\) is a standard normal distribution. Can you propose an visual illustration of this theorem?
# your answer here
Pandas for advanced array manipulation#
Compared to numpy which already handles arrays of numbers, Pandas essentially handles metadata. This will allow you to use predefined functions in pandas especially when it comes to time series manipulations.
Pandas has a lot of other nice features like
Intuitive handling of missing data (NaN)
Input/output in many formats
Simplified plotting procedures (labels, axes, etc. are added automatically)
Better management of the memory
Data download#
If you are running a standalone version of the notebook, you will have to download the data first by running:
!mkdir data
!wget -P data/ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv
and in case wget is not installed, you can install it with
!pip install wget
!python -m wget -o ‘data/daily_in_situ_co2_mlo.csv’ https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/daily/daily_in_situ_co2_mlo.csv
We are going to explore the basic functionalities of pandas with the historical CO2 measurements taken at Mona Loa also known as the Keeling curve. The data can be downloaded at scrippsco2.ucsd.edu and we are going to analyze the daily output.
Take a moment to inspect the file that is located here data/daily_in_situ_co2_mlo.csv
(either use a text editor or navigate within jupyter to open the file). Read the header to understand what is the data set about (you may have to investigate more to know about the units).
Question
Why is there NaN in the data set?
import pandas as pd
We are going to use the function pd.read_csv
to import the data in python. It is very important to do a clean import of the data so that we can work efficiently afterwards. Pay extra attention to
the commented part in the header (hint:
comment
)add a name to each column. Use the names year, month, day for the corresponding columns. This way, pandas will directly recognize what they are (hint:
names
)pandas handles spaces in cvs files in a very specific way: you may need to adjust its behavior with
skipinitialspace
.
Use pd.read_csv
to load the content of data/daily_in_situ_co2_mlo.csv
into a DataFrame called df
.
df = pd.read_csv('data/daily_in_situ_co2_mlo.csv', comment='%',names=["Year", "Month", "Day", "CO2", "NB", "scale"], skipinitialspace=True)
To access your data, you will need to know the metadata, i.e. the label of each column in the csv file (in this case provided by yourself). You can get a quick overview of your data by typing df.head()
or df.describe()
. You will see that the Dataframe df
can be treated like a spreadsheet with labeled columns and entries for each column. You can view the labels with df.columns
and the rows indices with df.index
. Make sure you understand the type of each column of the dataset with df.dtypes
.
#look at the data set here
You can select a column of the DataFrame by passing one name to df
like df["Year"]
. You can also select multiple columns by passing a list of names df[["Year", "Month"]]
. To select specific rows, you need to use the metho df.loc[i0:i1]
with i0
and i1
, the lower and upper bounds of the index. In our case we did not provide labels for the index so pandas automatically numbered the rows with a simple counter 0,1,...
. In that case, i0
and i1
are integers but we will see later that i0
and i1
can also be two dates (if the index is a date).
Question
Create a new DataFrame that contains only the first 100 entries and with the columns
Year
andCO2
only.
# your answer here
One killer feature of pandas is the way it handles time series. In order to use the full potential of such tool, one need to change the index to a date that is recognized by pandas. We use pd.to_datetime
to combine the information in the year, month and day columns of the dataframe. pd.to_datetime
handles several formats as input and you should be careful to use the correct input format. However, because we carefully chose the name of the columns when we imported the file, we can directly assemble the columns Year
, Month
, Day
to create a new datetime object. You can follow the method in the documentation to re-allocate df.index
to a datetime
type variable. Henceforth, we will use df.index
to access the time dimension.
#your answer here
Question (Optional)
Note that one could have done this operation directly while reading the data. Use
parse_dates
inpd.read_csv
to create a datetime object indf
. Try it.
# your code here
Now df.index
contain Timestamps. You can quickly access the day, month, year of the records by using df.index.day
, df.index.month
df.index.year
(etc. see documentation). You can also construct time intervals (or timedetlas) by subtracting two timestamps
Question
Subtract the first index to the last index to know how many days are in this data set
#your answer here
You can now select specific rows of your dataset by passing dates to the .loc[]
method, as in .loc[date0:date1]
. The two dates date0
and date1
can be written as strings in the format 'YYYY-MM-DD'
(e.g. '2020-09-20'
) or simply 'YYYY-MM'
.
If date0
is only '2020-09'
(you are missing the days), Pandas will interpret it as the first day of September 2020. Note that Pandas can also interpret '2020-09'
as the whole month of September 2020 if this is the only argument as in .loc[date0]
.
Question
Select the whole month of September 2020 with two different methods
#your code here
Pandas has built-in plotting features that are based on matplotlib. Since you now master matplotlib, you will find it very easy to plot data within pandas.
Question
Plot the CO2 time series as a function of time. (Hint: select the column of interest in df and simply apply the method
.plot()
)
# your answer here
The CO2 curve is increasing. This is not really a surprised isn’t it?. However, we also notice wiglly patterns superimposed to this trend. Are we looking at noisy data are does it have a physical meaning?
Question
In order to get a closer look, can you plot only 2 years of the CO2 time series between Jan 1st 2000 and Jan 1st 2002?
#your answer here
So there seem to be a lot of variability and many missing data in this data set. We can try to average all available data for each single year that we have. That way we will probably see a smoother curve. The resample
function is here to create a new time series with the desired sampling frequency. The acronym for standard time intervals are listed in the Time series documentation. You need to either add the method .mean()
or .sum()
to the resampled data to indicate how to treat the data in the resampling interval.
Question
Plot the yearly average evolution of the CO2 concentration.
#your answer here
These wiggles are not random but are the actual annual cycle of the data.
Question (Optional)
Use the method
.groupby
to plot the annual cycleDiscuss the physical origin of this anual cycle
#your answer here
Scikit-learn for data analysis#
Scikit-learn will be our toolbox for all statistical analysis. It provides the most popular machine learning algorithm, is well documented and has an active user community. Scikit learn does not work well with missing data. So in order to have a smooth transition between pandas and scikit-learn, it is convenient to create a copy of your DataFrame without any NaN
. Make sure you use (and understand) the .copy()
method to create the new array because we are going to make modification to this new DataFrame. You can call this new DataFrame dfv
.
dfv = df.dropna().copy()
Because we cannot do a linear regression with dates as indices, we need to create an additional column that contains the day number since the begining of the dataset. Remember how we worked with timedelta (and broadcasting) above?
Question
Can you create this new column and name it
counter
?
Be careful that this new variable must have dtype int (or float) and not a timedelta. (Hint: look at the attributes of timedelta)
#dfv["counter"] =
We are now ready to use the linear regression module.
from sklearn.linear_model import LinearRegression
We are first going to fit the data. As explained in the documentation, this function takes at least two arguments: the predictors and the observations. Predictors must be a 2d array of shape (n_samples, n_features), even if there is only ONE feature. One simple way to achieve this is to select a list of only one variable in the DataFrame (i.e. dfv[["counter"]]
, note the usage of double brackets here).
Question
Create the input variable
X
(day since first record) and the output variabley
(CO2 concentration)Perform the linear regression and put the output in the variable
reg
.
#X =
#y
#reg =
Question
What is the value of the regression coefficient?
What is its physical meaning?
#your code here
Question
Use the method
.predict
to construct the predicted value at all observation times and add it to thedfv
DataFrame.
#dfv["CO2_fit"] =
Question
Plot the observed value along with your prediction on a single plot.
#your code here
It is now super easy to compute the co2 growth rate in ppm/year for several time periods.
Question
Perform 2 linear regressions: one between 1960 and 1980 and one between 2000 and 2020 and compare how the growth rate evolved over time.
Can you give an estimate of the CO2 concentration in 2050?
#reg_2000 = LinearRegression().fit(dfv[["counter"]][dfv.index.year>2000], dfv["CO2"][dfv.index.year>2000])
#reg_1960 = LinearRegression().fit(dfv[["counter"]][dfv.index.year<1980], dfv["CO2"][dfv.index.year<1980])
#print("The growth rate changed from {}ppm/day to {}ppm/day from 1970 to 2010".format(reg_1960.coef_,reg_2000.coef_))
## rough estimate of the number of days between 1960 and 2050 (90 years x 365 days)
#nday_2050 = 90*365
#predict_co2_2050 = reg_2000.intercept_ + nday_2050*reg_2000.coef_
#print(f"With the most recent growth rate, there will be {predict_co2_2050} ppm of CO2 in 2050"
Credit#
Contributors include Bruno Deremble and Alexis Tantet.