Appendix: Elements of Probability Theory#

Probability Measures#

Sample Space
The set of all possible outcomes of an experiment is called the sample space and is denoted by \(\Omega\).

\(Events\) are defined as subsets of the sample space.

A collection \(\mathcal{F}\) of sets in \(\Omega\) is called a \(\sigma\)-algebra on \(\Omega\) if

  1. \(\emptyset \in \mathcal{F}\);

  2. if \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\);

  3. if \(A_1, A_2, \ldots \in \mathcal{F}\), then \(\cup_{i = 1}^\infty A_i \in \mathcal{F}\).

Generated \(\sigma\)-algebra
The intersection of all the \(\sigma\)-algebras containing \(\mathcal{F}\), denoted \(\sigma(\mathcal{F})\), is a \(\sigma\)-algebra that we call the \(\sigma\)-algebra generated by \(\mathcal{F}\).

Borel \(\sigma\)-algebra
Let \(\Omega = \mathbb{R}^p\). The \(\sigma\)-algebra generated by the open subsets of \(\mathbb{R}^p\) is called the Borel \(\sigma\)-algebra of \(\mathbb{R}^p\) and is denoted by \(\mathcal{B}(\mathbb{R}^p)\).

The \(\sigma\)-algebra of a sample space contains all possible outcomes of the experiment that we want to study.

Intuitively, the \(\sigma\)-algebra contains all the useful information that is available about the random experiment that we are performing.

Probability Measure
A probability measure \(\mathbb{P}\) on the measurable space \((\Omega, \mathcal{F})\) is a function \(\mathbb{P}: \mathcal{F} \to [0, 1]\) satisfying

  1. \(\mathbb{P}(\emptyset) = 0\), \(\mathbb{P}(\Omega) = 1\);

  2. For \(A_1, A_2, \ldots\) with \(A_i \cap A_j = \emptyset\), \(i \ne j\), then

(170)#\[\begin{equation} \mathbb{P}(\cup_{i = 1}^\infty A_i) = \sum_{i = 1}^\infty \mathbb{P}(A_i) \end{equation}\]

Probability Space
The triple \((\Omega, \mathcal{F}, \mathbb{P})\) comprising a set \(\Omega\), a \(\sigma\)-algebra \(\mathcal{F}\) of subsets of \(\Omega\) and a probability measure \(\mathbb{P}\) on \((\Omega, \mathcal{F})\) is called a probability space.

Independent Sets
The sets \(A\) and \(B\) are \(independent\) if

(171)#\[\begin{equation} \mathbb{P}(A \cap B) = \mathbb{P}(A) \mathbb{P}(B). \end{equation}\]

Random Variables#

Measurable Space
A sample space \(\Omega\) equipped with a \(\sigma\)-algebra of subsets \(\mathcal{F}\) is called a measurable space.

This graph shows how random variable is a function from all possible outcomes to real values.
It also shows how random variable is used for defining probability mass functions.
By Niyumard - Own work, CC BY-SA 4.0

Random Variable and Random Vector
Let \((\Omega, \mathcal{F})\) and \((\mathbb{R}^p, \mathcal{B}(\mathbb{R}^p))\) be two measurable spaces. Is called a measurable function or random variable (\(p = 1\)) or random vector (\(p > 1\)) a function \(\boldsymbol{X}: \Omega \to \mathbb{R}^p\) such that the event

\(\{\omega \in \Omega: X_1(\omega) \le x_1, \ldots, X_p(\omega) \le x_p\}\) \(=: \{\omega \in \Omega: \boldsymbol{X}(\omega) \le \boldsymbol{x}\}\) \(=: \{\boldsymbol{X} \le \boldsymbol{x}\}\)

belongs to \(\mathcal{F}\) for any \(\boldsymbol{x} \in \mathbb{R}^p\).

In other words, the preimage of any Borel set under \(\boldsymbol{X}\) is an event.

Cumulative distribution function for the normal distribution.
By Inductiveload - self-made, Mathematica, Inkscape, Public Domain

Distribution Function of a Random Variable
Every random variable from a probability space \((\Omega, \mathcal{F}, \mu)\) to \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\) induces a probability measure \(\mathbb{P}\) on \(\mathbb{R}\) that we identify with the probability distribution function \(F_X: \mathbb{R} \to [0, 1]\) defined as

\(F_X(x) = \mu(\omega \in \Omega: X(\omega) \le x)\) \(=: \mathbb{P}(X \le x), x \in \mathcal{B}(\mathbb{R})\).

In this case, \((\mathbb{R}, \mathcal{B}(\mathbb{R}), F_X)\) becomes a probability space.

If \(X\) is not a measurable function, there exists an \(x\) in \(\mathbb{R}\) such that \(\{\omega \in \Omega: X \le x\}\) is not an event. Then, \(\mathbb{P}(X \le x) = \mu(\omega \in \Omega: X \le x)\) is not defined and we cannot define the distribution of \(X\).

This shows that it is the measurability of a random variable that makes it so special.
Many sample observations (black) drawn from a joint probability distribution.
The marginal densities are shown as well.
By IkamusumeFan - Own work, CC BY-SA 3.0

Joint Distribution Function
Let \(X\) and \(Y\) be two random variables. We can then define their joint distribution function as

(172)#\[\begin{equation} F_{X, Y}(x, y) = \mathbb{P}(X \le x, Y \le y) \end{equation}\]

We can view them as a random vector, i.e. a random variable from \(\Omega\) to \(\mathbb{R}^2\).

Independent Random Variables
Two random variables \(X\) and \(Y\) on \(\mathbb{R}\) are independent if the events \(\{\omega \in \Omega: X(\omega) \le x\}\) and \(\{\omega \in \Omega: Y(\omega) \le y\}\) are independent for all \(x, y \in \mathbb{R}\).

If \(X\) and \(Y\) are independent then \(F_{X, Y}(x, y) = F_X(x)F_Y(y)\).

Distribution Function of a Random Vector
Every random variable from a probability space \((\Omega, \mathcal{F}, \mu)\) to \((\mathbb{R}^p, \mathcal{B}(\mathbb{R}^p))\) induces a probability measure \(\mathbb{P}\) on \(\mathbb{R}^p\) that we identify with the distribution function \(F_\boldsymbol{X}: \mathbb{R}^p \to [0, 1]\) defined as

(173)#\[\begin{equation} F_\boldsymbol{X}(\boldsymbol{x}) = \mathbb{P}(\boldsymbol{X} \le \boldsymbol{x}) := \mu(\omega \in \Omega: \boldsymbol{X}(\omega) \le \boldsymbol{x}) \ \ \ \ \boldsymbol{x} \in \mathcal{B}(\mathbb{R}^p). \end{equation}\]

Expectation of Random Variables
Let \(X\) be a random variable from \((\Omega, \mathcal{F}, \mu)\) to \((\mathbb{R}^p, \mathcal{B}(\mathbb{R}^p))\). We define the expectation of \(\boldsymbol{X}\) by

(174)#\[\begin{equation} \mathbb{E}(\boldsymbol{X}) = \int_{\mathbb{R}^p} \boldsymbol{x} dF_\boldsymbol{X}(\boldsymbol{x}). \end{equation}\]

More generally, let \(f: \mathbb{R}^p \to \mathbb{R}\) be measurable. Then

(175)#\[\begin{equation} \mathbb{E}(f(\boldsymbol{X})) = \int_{\mathbb{R}^p} f(\boldsymbol{x}) dF_\boldsymbol{X}(\boldsymbol{x}). \end{equation}\]

\(dF_\boldsymbol{X}(\boldsymbol{x}) = \mathbb{P}(d\boldsymbol{x}) = \mathbb{P}(dx_1, \ldots, dx_p)\) and \(\int\) denotes the Lebesgue integral.

When a function is continuous, its Lebesgue integral can be replaced by its Riemann-Stieltjes integral.

\(L^p\) spaces
By \(L^p(\Omega, \mathcal{F}, \mu)\) we mean the Banach space of measurable functions on \(\Omega\) with norm

(176)#\[\begin{equation} \|X\|_{L^p} = \left(E(|X|^p)\right)^{1/p}. \end{equation}\]

In particular, we say that \(X\) is integrable if \(\|X\|_{L^1} < \infty\) and that \(X\) has finite variance if \(\|X\|_{L^2} < \infty\).

Variance, Covariance and Correlation of Two Random Variables#

Variance of a Random Variable
Provided that it exists, we define the variance of a random variable \(X\) as

(177)#\[\begin{equation} \mathrm{Var}(X) := \mathbb{E}\left[(X - \mathbb{E}(X))^2\right] = \mathbb{E}(X^2) - \mathbb{E}(X)^2. \end{equation}\]

Several sets of (X, Y) points, with the corresponding Pearson
correlation coefficient. The correlation reflects the noisiness and
direction of a linear relationship (top row), but not the slope of that
relationship (middle), nor many aspects of nonlinear relationships (bottom).
N.B.: the figure in the center as a slope of 0 but in that case the
correlation coefficient is undefined because the variance of Y is zero.
DenisBoigelot, original uploader was Imagecreator

Covariance, Variance and Correlation of Two Random Variables
Provided that it exists, we define the covariance of two random variables \(X\) and \(Y\) as

\(\mathrm{Cov}(X, Y) := \mathbb{E}\left[(X - \mathbb{E}(X)) (Y - \mathbb{E}(Y))\right]\) \(= \mathbb{E}(X Y) - \mathbb{E}(X) \mathbb{E}(Y).\)

The correlation of \(X\) and \(Y\) is

\(\mathrm{Corr}(X, Y) := \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)} \sqrt{\mathrm{Var}(Y)}}.\)

Conditional Expectation#

Conditional Probability
Let \(A\) and \(B\) be two events and suppose that \(\mathbb{P}(A) > 0\). The conditional probability of \(B\) given \(A\) is

(178)#\[\begin{equation} \mathbb{P}(B | A) := \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)}. \end{equation}\]

For random variables, the conditional probability of \(Y\) knowing \(X\) is

(179)#\[\begin{equation} \mathbb{P}(Y | X) := \frac{\mathbb{P}(X \cap Y)}{\mathbb{P}(X)}. \end{equation}\]

Conditional Expectation
Assume that \(X \in L^1(\Omega, \mathcal{F}, \mu)\) and let \(\mathcal{G}\) be a sub-\(\sigma\)-algebra of \(\mathcal{F}\). The conditional expectation of \(\boldsymbol{X}\) with respect to \(\mathcal{G}\) is the function \(\mathbb{E}(\boldsymbol{X} | \mathcal{G}): (\Omega, \mathcal{G}) \to \mathbb{R}^p\), which is a random variable satisfying

(180)#\[\begin{equation} \int_G \mathbb{E}(\boldsymbol{X} | \mathcal{G}) d\mathbb{P} = \int_G \boldsymbol{X} d\mathbb{P} \ \ \ \ \forall G \in \mathcal{G}. \end{equation}\]

It follows that (law of total expectation)

(181)#\[\begin{equation} \mathbb{E}(\mathbb{E}(\boldsymbol{X} | \mathcal{G})) = \mathbb{E}(\boldsymbol{X}). \end{equation}\]

Conditional Distribution Function
Given \(\mathcal{G}\) a sub-\(\sigma\)-algebra of \(\mathcal{F}\), we define the conditional distribution function

(182)#\[\begin{equation} F_\boldsymbol{X}(\boldsymbol{x} | \mathcal{G}) = \mathbb{P}(\boldsymbol{X} \le \boldsymbol{x} | \mathcal{G}) \ \ \ \ \forall \boldsymbol{x} \in \mathbb{R}^p. \end{equation}\]

Assume that \(f: \mathbb{R}^p \to \mathbb{R}\) is such that \(\mathbb{E}(f(X)) < \infty\). Then

(183)#\[\begin{equation} \mathbb{E}(f(\boldsymbol{X}) | \mathcal{G}) = \int_{\mathbb{R}^p} f(\boldsymbol{x}) dF_\boldsymbol{X}(\boldsymbol{x} | \mathcal{G}). \end{equation}\]

Conditional Expectation with respect to a Random Vector
The conditional expectation of \(\boldsymbol{X}\) given \(\boldsymbol{Y}\) is defined by

(184)#\[\begin{equation} \mathbb{E}(\boldsymbol{X} | \boldsymbol{Y}) := \mathbb{E}(\boldsymbol{X} | \sigma(\boldsymbol{Y})), \end{equation}\]

where \(\sigma(\boldsymbol{Y}) := \{\boldsymbol{Y}^{-1}(B): B \in \mathcal{B}(\mathbb{R}^p)\}\) is the \(\sigma\)-algebra generated by \(\boldsymbol{Y}\).

Conditional Variance and Least Squares#

Conditional Variance
Suppose that \(Y\) is a random variable and that \(\mathcal{G}\) is a sub-\(\sigma\)-algebra of \(\mathcal{F}\). Then, the random variable

(185)#\[\begin{equation} \mathrm{var}(Y | \mathcal{G}) := \mathbb{E}[(Y - \mathbb{E}(Y | \mathcal{G}))^2 | \mathcal{G}] \end{equation}\]

is called the conditional variance of \(Y\) knowing \(\mathcal{G}\).

It tells us how much variance is left if we use \(\mathbb{E}(Y | \mathcal{G})\) to predict \(Y\).

The Conditional Expectation Minimizes the Squared Deviations
Let \(X\) and \(Y\) be random variables with finite variance, let \(g\) be a real-valued function such that \(\mathbb{E}[g(X)^2] < \infty\). Then (Theorem 10.1.4 in Gut 2005)

(186)#\[\begin{align} \mathbb{E}[(Y - g(X))^2] &= \mathbb{E}[\mathrm{Var}(Y | X)] + \mathbb{E}[(\mathbb{E}(Y | X) - g(X))^2]\\ &\ge \mathbb{E}[\mathrm{Var}(Y | X)], \end{align}\]

where equality is obtained for \(g(X) = \mathbb{E}(Y | X)\).

Thus, the expected conditional variance of \(Y\) given \(X\) shows up as the irreducible error of predicting \(Y\) given only the knowledge of \(X\).

Absolutely Continuous Distributions and Densities#

Absolutely Continuous Distribution
A distribution function \(F\) is absolutely continuous with respect to the Lebesgue measure (denoted \(dx\)) if and only if there exists a non-negative, Lebesgue integrable function \(f\), such that

(187)#\[\begin{equation} F(b) - F(a) = \int_a^b f(x) dx \ \ \ \ \forall a < b. \end{equation}\]

The function \(f\) is called the density of \(F\) and is denoted by \(\frac{dF}{dx}\).

Equivalently, \(F\) is absolutely continuous if and only if, for every measurable set \(A\), \(dx(A) = 0\) implies \(\mathbb{P}(X \in A) = 0\).

Marginal Density
From an absolutely continuous random vector \((X, Y)\) with density \(f_{X, Y}\), we can derive the density of \(X\), or marginal density by integrating over \(Y\):

(188)#\[\begin{equation} f_X(x) = \int_{-\infty}^\infty f_{X, Y}(x, y) dy. \end{equation}\]

If \(X\) is an absolutely continuous random variable, with density \(f_X\), \(g\) is a measurable function, and \(\mathbb{E}(|g(X)|) < \infty\), then

(189)#\[\begin{equation} \mathbb{E}(g(X)) = \int_{-\infty}^\infty g(x) f_X(x) dx. \end{equation}\]

If \(X\) and \(Y\) are absolutely continuous, then \(X\) and \(Y\) are independent if and only if the joint density is equal to the product of the marginal ones, that is

(190)#\[\begin{equation} f_{X, Y}(x, y) = f_X(x) f_Y(y) \ \ \ \ \forall x, y \in \mathbb{R}. \end{equation}\]

Conditional Density
Let \(X\) and \(Y\) have a joint absolutely continuous distribution. For \(f_X(x) > 0\), the conditional density of \(Y\) given that \(X = x\) equals

(191)#\[\begin{equation} f_{Y | X = x}(y) = \frac{f_{X, Y}(x, y)}{f_X(x)} \end{equation}\]

Then the conditional distribution of \(Y\) given that \(X = x\) is derived by

(192)#\[\begin{equation} F_{Y | X = x}(y) = \int_{-\infty}^y f_{Y | X = x}(z) dz. \end{equation}\]

If \(X\) and \(Y\) are independent then the conditional and the unconditional distributions are the same.

Sample Estimates#

Let \((x_i, y_i), i = 1, \ldots, N\) be a sample drawn from the joint distribution of the random variables \(X\) and \(Y\). Then we have the following unbiased estimates:

  • Sample mean: \(\bar{x} = \sum_{i = 1}^N x_i\)

  • Sample variance: \(s_X^2 = \frac{1}{N - 1} \sum_{i = 1}^N \left(x_i - \bar{x}\right)^2\)

  • Sample covariance: \(q_{X,Y} = \frac{1}{N - 1} \sum_{i = 1}^N \left(x_i - \bar{x}\right) \left(y_i - \bar{y}\right)\)

Confidence Intervals#

Example: Regional surface temperature in France#

# Import modules
from pathlib import Path
import numpy as np
import pandas as pd
import holoviews as hv
import hvplot.pandas
import panel as pn

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(header=0, index_col=0, parse_dates=True)

# Set first and last years
LAST_YEAR = 2021

# Define file path
filename = 'surface_temperature_merra2_{}-{}.csv'.format(
filepath = Path(data_dir, filename)

# Read hourly temperature data averaged over each region
df_temp = pd.read_csv(filepath, **kwargs_read_csv).resample('D').mean()
temp_lim = [-5, 30]
label_temp = 'Temperature (°C)'
WIDTH = 260
# Scatter plot of demand versus temperature
def plot_temp(region_name, year):
    df = df_temp[[region_name]].loc[str(year)]
    df.columns = [label_temp]
    nt = df.shape[0]
    std = float(df.std(0))
    mean = pd.Series(df[label_temp].mean(), index=df.index)
    df_std = pd.DataFrame(
        {'low': mean - std, 'high': mean + std}, index=df.index)
    cdf = pd.DataFrame(index=df.sort_values(by=label_temp).values[:, 0],
                       data=(np.arange(nt)[:, None] + 1) / nt) = label_temp
    cdf.columns = ['Probability']
    pts = df.hvplot(ylim=temp_lim, title='', width=WIDTH).opts(
        title='Time series, Mean, ± 1 STD') * hv.HLine(
        df[label_temp].mean()) * df_std.hvplot.area(
        y='low', y2='high', alpha=0.2)
    pcdf = cdf.hvplot(xlim=temp_lim, ylim=[0, 1], title='', width=WIDTH).opts(
        title='Cumulative Distrib. Func.') * hv.VLine(
    pkde = df.hvplot.kde(xlim=temp_lim,
                         width=WIDTH) * hv.VLine(
        df[label_temp].mean()).opts(title='Probability Density Func.')
    return pn.Row(pts, pcdf, pkde)
# Show
pn.interact(plot_temp, region_name=df_temp.columns,
            year=range(FIRST_YEAR, LAST_YEAR))
/tmp/ipykernel_770/ FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  std = float(df.std(0))



Contributors include Bruno Deremble and Alexis Tantet.

Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS