{ "cells": [ { "cell_type": "markdown", "id": "ba125687", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Appendix: Elements of Probability Theory" ] }, { "cell_type": "markdown", "id": "ea64232c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Probability Measures" ] }, { "cell_type": "markdown", "id": "cf6409b5", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "
\n", "\n", "**Sample Space**\n", "
\n", "The set of all possible outcomes of an experiment is called the *sample space* and is denoted by $\\Omega$.\n", "\n", "$Events$ are defined as subsets of the sample space.\n", "
" ] }, { "cell_type": "markdown", "id": "b0414d57", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**$\\sigma$-algebra**\n", "
\n", "A collection $\\mathcal{F}$ of sets in $\\Omega$ is called a *$\\sigma$-algebra* on $\\Omega$ if\n", "\n", "1. $\\emptyset \\in \\mathcal{F}$;\n", "2. if $A \\in \\mathcal{F}$, then $A^c \\in \\mathcal{F}$;\n", "3. if $A_1, A_2, \\ldots \\in \\mathcal{F}$, then $\\cup_{i = 1}^\\infty A_i \\in \\mathcal{F}$.\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "ead735c9", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Generated $\\sigma$-algebra**\n", "
\n", "The intersection of all the $\\sigma$-algebras containing $\\mathcal{F}$, denoted $\\sigma(\\mathcal{F})$, is a $\\sigma$-algebra that we call the *$\\sigma$-algebra generated by $\\mathcal{F}$*.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "8034498b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "**Borel $\\sigma$-algebra**\n", "
\n", "Let $\\Omega = \\mathbb{R}^p$. The $\\sigma$-algebra generated by the open subsets of $\\mathbb{R}^p$ is called the *Borel $\\sigma$-algebra* of $\\mathbb{R}^p$ and is denoted by $\\mathcal{B}(\\mathbb{R}^p)$.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "80a492d0", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The $\\sigma$-algebra of a sample space contains all possible outcomes of the experiment that we want to study.\n", "\n", "Intuitively, the $\\sigma$-algebra contains all the useful information that is available about the random experiment that we are performing." ] }, { "cell_type": "markdown", "id": "1bbe805d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "**Probability Measure**\n", "
\n", "A probability measure $\\mathbb{P}$ on the measurable space $(\\Omega, \\mathcal{F})$ is a function $\\mathbb{P}: \\mathcal{F} \\to [0, 1]$ satisfying\n", "\n", "1. $\\mathbb{P}(\\emptyset) = 0$, $\\mathbb{P}(\\Omega) = 1$;\n", "2. For $A_1, A_2, \\ldots$ with $A_i \\cap A_j = \\emptyset$, $i \\ne j$, then\n", "\n", "\\begin{equation}\n", "\\mathbb{P}(\\cup_{i = 1}^\\infty A_i) = \\sum_{i = 1}^\\infty \\mathbb{P}(A_i)\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "e624313d", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Probability Space**\n", "
\n", "The triple $(\\Omega, \\mathcal{F}, \\mathbb{P})$ comprising a set $\\Omega$, a $\\sigma$-algebra $\\mathcal{F}$ of subsets of $\\Omega$ and a probability measure $\\mathbb{P}$ on $(\\Omega, \\mathcal{F})$ is called a *probability space*.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "cdaa047d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "**Independent Sets**\n", "
\n", "The sets $A$ and $B$ are $independent$ if\n", "\n", "\\begin{equation}\n", "\\mathbb{P}(A \\cap B) = \\mathbb{P}(A) \\mathbb{P}(B).\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "886d8173", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Random Variables" ] }, { "cell_type": "markdown", "id": "520f9972", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "
\n", "\n", "**Measurable Space**\n", "
\n", "A sample space $\\Omega$ equipped with a $\\sigma$-algebra of subsets $\\mathcal{F}$ is called a *measurable space*.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "10529cf3", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "\n", " \n", "
\n", "This graph shows how random variable is a function\n", "from all possible outcomes to real values.\n", "
\n", "It also shows how random variable is used\n", "for defining probability mass functions. \n", "
\n", "By Niyumard - Own work, CC BY-SA 4.0\n", "
\n", "
\n", "
\n", "\n", "**Random Variable and Random Vector**\n", "
\n", "Let $(\\Omega, \\mathcal{F})$ and $(\\mathbb{R}^p, \\mathcal{B}(\\mathbb{R}^p))$ be two measurable spaces.\n", "Is called a *measurable function* or *random variable* ($p = 1$) or *random vector* ($p > 1$) a function $\\boldsymbol{X}: \\Omega \\to \\mathbb{R}^p$ such that the event\n", "\n", "$\\{\\omega \\in \\Omega: X_1(\\omega) \\le x_1, \\ldots, X_p(\\omega) \\le x_p\\}$ $=: \\{\\omega \\in \\Omega: \\boldsymbol{X}(\\omega) \\le \\boldsymbol{x}\\}$ $=: \\{\\boldsymbol{X} \\le \\boldsymbol{x}\\}$ \n", "\n", "belongs to $\\mathcal{F}$ for any $\\boldsymbol{x} \\in \\mathbb{R}^p$.\n", "
" ] }, { "cell_type": "markdown", "id": "16775dce", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In other words, the preimage of any Borel set under $\\boldsymbol{X}$ is an event." ] }, { "cell_type": "markdown", "id": "38ebd4bd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", " \n", "\n", " \n", "
\n", "Cumulative distribution function for the normal distribution.
\n", "By Inductiveload - self-made, Mathematica, Inkscape, Public Domain\n", "
\n", "
\n", "
\n", "\n", "**Distribution Function of a Random Variable**\n", "
\n", "Every random variable from a probability space $(\\Omega, \\mathcal{F}, \\mu)$ to $(\\mathbb{R}, \\mathcal{B}(\\mathbb{R}))$ induces a probability measure $\\mathbb{P}$ on $\\mathbb{R}$ that we identify with the *probability distribution function* $F_X: \\mathbb{R} \\to [0, 1]$ defined as\n", "\n", "$F_X(x) = \\mu(\\omega \\in \\Omega: X(\\omega) \\le x)$ $=: \\mathbb{P}(X \\le x), x \\in \\mathcal{B}(\\mathbb{R})$.\n", "\n", "In this case, $(\\mathbb{R}, \\mathcal{B}(\\mathbb{R}), F_X)$ becomes a probability space.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "3fd57cf5", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If $X$ is not a measurable function, there exists an $x$ in $\\mathbb{R}$ such that $\\{\\omega \\in \\Omega: X \\le x\\}$ is not an event.\n", "Then, $\\mathbb{P}(X \\le x) = \\mu(\\omega \\in \\Omega: X \\le x)$ is not defined and we cannot define the distribution of $X$.\n", "\n", "
\n", "This shows that it is the measurability of a random variable that makes it so special.\n", "
" ] }, { "cell_type": "markdown", "id": "5f84a471", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "\n", " \n", "
\n", "Many sample observations (black) drawn from a joint probability distribution.
\n", "The marginal densities are shown as well.
\n", "By IkamusumeFan - Own work, CC BY-SA 3.0\n", "
\n", "
\n", "
\n", "\n", "**Joint Distribution Function**\n", "
\n", "Let $X$ and $Y$ be two random variables.\n", "We can then define their joint distribution function as\n", "\n", "\\begin{equation}\n", "F_{X, Y}(x, y) = \\mathbb{P}(X \\le x, Y \\le y)\n", "\\end{equation}\n", "\n", "We can view them as a *random vector*, i.e. a random variable from $\\Omega$ to $\\mathbb{R}^2$.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "969e1e73", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Independent Random Variables**\n", "
\n", "Two random variables $X$ and $Y$ on $\\mathbb{R}$ are independent if the events $\\{\\omega \\in \\Omega: X(\\omega) \\le x\\}$ and $\\{\\omega \\in \\Omega: Y(\\omega) \\le y\\}$ are independent for all $x, y \\in \\mathbb{R}$.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "088faaba", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If $X$ and $Y$ are independent then $F_{X, Y}(x, y) = F_X(x)F_Y(y)$." ] }, { "cell_type": "markdown", "id": "2eac23d5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "**Distribution Function of a Random Vector**\n", "
\n", "Every random variable from a probability space $(\\Omega, \\mathcal{F}, \\mu)$ to $(\\mathbb{R}^p, \\mathcal{B}(\\mathbb{R}^p))$ induces a probability measure $\\mathbb{P}$ on $\\mathbb{R}^p$ that we identify with the *distribution function* $F_\\boldsymbol{X}: \\mathbb{R}^p \\to [0, 1]$ defined as\n", "\n", "\\begin{equation}\n", "F_\\boldsymbol{X}(\\boldsymbol{x}) = \\mathbb{P}(\\boldsymbol{X} \\le \\boldsymbol{x}) := \\mu(\\omega \\in \\Omega: \\boldsymbol{X}(\\omega) \\le \\boldsymbol{x}) \\ \\ \\ \\ \\boldsymbol{x} \\in \\mathcal{B}(\\mathbb{R}^p).\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "37417cf6", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Expectation of Random Variables**\n", "
\n", "Let $X$ be a random variable from $(\\Omega, \\mathcal{F}, \\mu)$ to $(\\mathbb{R}^p, \\mathcal{B}(\\mathbb{R}^p))$.\n", "We define the *expectation* of $\\boldsymbol{X}$ by\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(\\boldsymbol{X}) = \\int_{\\mathbb{R}^p} \\boldsymbol{x} dF_\\boldsymbol{X}(\\boldsymbol{x}).\n", "\\end{equation}\n", "\n", "More generally, let $f: \\mathbb{R}^p \\to \\mathbb{R}$ be measurable.\n", "Then\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(f(\\boldsymbol{X})) = \\int_{\\mathbb{R}^p} f(\\boldsymbol{x}) dF_\\boldsymbol{X}(\\boldsymbol{x}).\n", "\\end{equation}\n", "\n", "
\n", "\n", "$dF_\\boldsymbol{X}(\\boldsymbol{x}) = \\mathbb{P}(d\\boldsymbol{x}) = \\mathbb{P}(dx_1, \\ldots, dx_p)$ and $\\int$ denotes the Lebesgue integral." ] }, { "cell_type": "markdown", "id": "e25ef16f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "When a function is continuous, its Lebesgue integral can be replaced by its Riemann-Stieltjes integral.\n", "
" ] }, { "cell_type": "markdown", "id": "d8cfc449", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**$L^p$ spaces**\n", "
\n", "By $L^p(\\Omega, \\mathcal{F}, \\mu)$ we mean the Banach space of measurable functions on $\\Omega$ with norm\n", "\n", "\\begin{equation}\n", "\\|X\\|_{L^p} = \\left(E(|X|^p)\\right)^{1/p}.\n", "\\end{equation}\n", "\n", "
\n", "\n", "In particular, we say that $X$ is integrable if $\\|X\\|_{L^1} < \\infty$ and that $X$ has finite variance if $\\|X\\|_{L^2} < \\infty$." ] }, { "cell_type": "markdown", "id": "7ea85b12", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Variance, Covariance and Correlation of Two Random Variables" ] }, { "cell_type": "markdown", "id": "b72b0f05", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Variance of a Random Variable**\n", "
\n", "Provided that it exists, we define the *variance* of a random variable $X$ as\n", "\n", "\\begin{equation}\n", "\\mathrm{Var}(X) := \\mathbb{E}\\left[(X - \\mathbb{E}(X))^2\\right]\n", "= \\mathbb{E}(X^2) - \\mathbb{E}(X)^2.\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "9fc3bba7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "\n", " \n", "
\n", "Several sets of (X, Y) points, with the corresponding Pearson
\n", "correlation coefficient. The correlation reflects the noisiness and
\n", "direction of a linear relationship (top row), but not the slope of that
\n", "relationship (middle), nor many aspects of nonlinear relationships (bottom).
\n", "N.B.: the figure in the center as a slope of 0 but in that case the
\n", "correlation coefficient is undefined because the variance of Y is zero.
\n", " DenisBoigelot, original uploader was Imagecreator\n", "
\n", "
\n", "
\n", "\n", "**Covariance, Variance and Correlation of Two Random Variables**\n", "
\n", "Provided that it exists, we define the *covariance* of two random variables $X$ and $Y$ as\n", "\n", "$\\mathrm{Cov}(X, Y) := \\mathbb{E}\\left[(X - \\mathbb{E}(X)) (Y - \\mathbb{E}(Y))\\right]$\n", "$= \\mathbb{E}(X Y) - \\mathbb{E}(X) \\mathbb{E}(Y).$\n", "\n", "The *correlation* of $X$ and $Y$ is\n", "\n", "$\\mathrm{Corr}(X, Y) := \\frac{\\mathrm{Cov}(X, Y)}{\\sqrt{\\mathrm{Var}(X)} \\sqrt{\\mathrm{Var}(Y)}}.$\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "a5e0e636", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Conditional Expectation" ] }, { "cell_type": "markdown", "id": "9b4b0ea3", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "**Conditional Probability**\n", "
\n", "Let $A$ and $B$ be two events and suppose that $\\mathbb{P}(A) > 0$.\n", "The *conditional probability* of $B$ given $A$ is\n", "\n", "\\begin{equation}\n", "\\mathbb{P}(B | A) := \\frac{\\mathbb{P}(A \\cap B)}{\\mathbb{P}(A)}.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "febe7fc1", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "For random variables, the *conditional probability* of $Y$ knowing $X$ is\n", "\n", "\\begin{equation}\n", "\\mathbb{P}(Y | X) := \\frac{\\mathbb{P}(X \\cap Y)}{\\mathbb{P}(X)}.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "e6b9a5bf", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Conditional Expectation**\n", "
\n", "Assume that $X \\in L^1(\\Omega, \\mathcal{F}, \\mu)$ and let $\\mathcal{G}$ be a sub-$\\sigma$-algebra of $\\mathcal{F}$.\n", "The *conditional expectation* of $\\boldsymbol{X}$ with respect to $\\mathcal{G}$ is the function $\\mathbb{E}(\\boldsymbol{X} | \\mathcal{G}): (\\Omega, \\mathcal{G}) \\to \\mathbb{R}^p$, which is a random variable satisfying\n", "\n", "\\begin{equation}\n", "\\int_G \\mathbb{E}(\\boldsymbol{X} | \\mathcal{G}) d\\mathbb{P} = \\int_G \\boldsymbol{X} d\\mathbb{P} \\ \\ \\ \\ \\forall G \\in \\mathcal{G}.\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "62f5e568", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It follows that (*law of total expectation*)\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(\\mathbb{E}(\\boldsymbol{X} | \\mathcal{G})) = \\mathbb{E}(\\boldsymbol{X}).\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "05fa276f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Conditional Distribution Function**\n", "
\n", "Given $\\mathcal{G}$ a sub-$\\sigma$-algebra of $\\mathcal{F}$, we define the *conditional distribution function*\n", "\n", "\\begin{equation}\n", "F_\\boldsymbol{X}(\\boldsymbol{x} | \\mathcal{G}) = \\mathbb{P}(\\boldsymbol{X} \\le \\boldsymbol{x} | \\mathcal{G}) \\ \\ \\ \\ \\forall \\boldsymbol{x} \\in \\mathbb{R}^p.\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "f5a948c0", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Assume that $f: \\mathbb{R}^p \\to \\mathbb{R}$ is such that $\\mathbb{E}(f(X)) < \\infty$.\n", "Then\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(f(\\boldsymbol{X}) | \\mathcal{G}) = \\int_{\\mathbb{R}^p} f(\\boldsymbol{x}) dF_\\boldsymbol{X}(\\boldsymbol{x} | \\mathcal{G}).\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "b4bcc905", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Conditional Expectation with respect to a Random Vector**\n", "
\n", "The *conditional expectation of $\\boldsymbol{X}$ given $\\boldsymbol{Y}$* is defined by\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(\\boldsymbol{X} | \\boldsymbol{Y}) := \\mathbb{E}(\\boldsymbol{X} | \\sigma(\\boldsymbol{Y})),\n", "\\end{equation}\n", "\n", "where $\\sigma(\\boldsymbol{Y}) := \\{\\boldsymbol{Y}^{-1}(B): B \\in \\mathcal{B}(\\mathbb{R}^p)\\}$ is the *$\\sigma$-algebra generated by $\\boldsymbol{Y}$*.\n", "
" ] }, { "cell_type": "markdown", "id": "6906645f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Conditional Variance and Least Squares" ] }, { "cell_type": "markdown", "id": "1dca6a19", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "**Conditional Variance**\n", "
\n", "Suppose that $Y$ is a random variable and that $\\mathcal{G}$ is a sub-$\\sigma$-algebra of $\\mathcal{F}$.\n", "Then, the random variable\n", "\n", "\\begin{equation}\n", "\\mathrm{var}(Y | \\mathcal{G}) := \\mathbb{E}[(Y - \\mathbb{E}(Y | \\mathcal{G}))^2 | \\mathcal{G}]\n", "\\end{equation}\n", "\n", "is called the *conditional variance* of $Y$ knowing $\\mathcal{G}$.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "9cf7576e", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It tells us how much variance is left if we use $\\mathbb{E}(Y | \\mathcal{G})$ to predict $Y$." ] }, { "cell_type": "markdown", "id": "fd38b071", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**The Conditional Expectation Minimizes the Squared Deviations**\n", "
\n", "Let $X$ and $Y$ be random variables with finite variance, let $g$ be a real-valued function such that $\\mathbb{E}[g(X)^2] < \\infty$.\n", "Then (Theorem 10.1.4 in Gut 2005)\n", "\n", "\\begin{align}\n", "\\mathbb{E}[(Y - g(X))^2]\n", "&= \\mathbb{E}[\\mathrm{Var}(Y | X)] + \\mathbb{E}[(\\mathbb{E}(Y | X) - g(X))^2]\\\\\n", "&\\ge \\mathbb{E}[\\mathrm{Var}(Y | X)],\n", "\\end{align}\n", "\n", "where equality is obtained for $g(X) = \\mathbb{E}(Y | X)$.\n", "
" ] }, { "cell_type": "markdown", "id": "869c64eb", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus, the expected conditional variance of $Y$ given $X$ shows up as the irreducible error of predicting $Y$ given only the knowledge of $X$." ] }, { "cell_type": "markdown", "id": "8356831f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Absolutely Continuous Distributions and Densities" ] }, { "cell_type": "markdown", "id": "2a977ca4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "**Absolutely Continuous Distribution**\n", "
\n", "A distribution function $F$ is *absolutely continuous* with respect to the Lebesgue measure (denoted $dx$) if and only if there exists a non-negative, Lebesgue integrable function $f$, such that\n", "\n", "\\begin{equation}\n", "F(b) - F(a) = \\int_a^b f(x) dx \\ \\ \\ \\ \\forall a < b.\n", "\\end{equation}\n", "\n", "The function $f$ is called the *density* of $F$ and is denoted by $\\frac{dF}{dx}$.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "d1b8a915", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Equivalently, $F$ is absolutely continuous if and only if, for every measurable set $A$, $dx(A) = 0$ implies $\\mathbb{P}(X \\in A) = 0$." ] }, { "cell_type": "markdown", "id": "997cf40b", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Marginal Density**\n", "
\n", "From an absolutely continuous random vector $(X, Y)$ with density $f_{X, Y}$, we can derive the density of $X$, or *marginal density* by integrating over $Y$:\n", "\n", "\\begin{equation}\n", "f_X(x) = \\int_{-\\infty}^\\infty f_{X, Y}(x, y) dy.\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "534bfcb8", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If $X$ is an absolutely continuous random variable, with density $f_X$, $g$ is a measurable function, and $\\mathbb{E}(|g(X)|) < \\infty$, then\n", "\n", "\\begin{equation}\n", "\\mathbb{E}(g(X)) = \\int_{-\\infty}^\\infty g(x) f_X(x) dx.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "a2fa9ed6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If $X$ and $Y$ are absolutely continuous, then $X$ and $Y$ are independent if and only if the joint density is equal to the product of the marginal ones, that is\n", "\n", "\\begin{equation}\n", "f_{X, Y}(x, y) = f_X(x) f_Y(y) \\ \\ \\ \\ \\forall x, y \\in \\mathbb{R}.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "66b0b7d2", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "**Conditional Density**\n", "
\n", "Let $X$ and $Y$ have a joint absolutely continuous distribution.\n", "For $f_X(x) > 0$, the *conditional density* of $Y$ given that $X = x$ equals\n", "\n", "\\begin{equation}\n", "f_{Y | X = x}(y) = \\frac{f_{X, Y}(x, y)}{f_X(x)}\n", "\\end{equation}\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "56566f41", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then the conditional distribution of $Y$ given that $X = x$ is derived by\n", "\n", "\\begin{equation}\n", "F_{Y | X = x}(y) = \\int_{-\\infty}^y f_{Y | X = x}(z) dz.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "f6369cc4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If $X$ and $Y$ are independent then the conditional and the unconditional distributions are the same." ] }, { "cell_type": "markdown", "id": "b9bc5550", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sample Estimates" ] }, { "cell_type": "markdown", "id": "833af93e", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let $(x_i, y_i), i = 1, \\ldots, N$ be a sample drawn from the joint distribution of the random variables $X$ and $Y$.\n", "Then we have the following unbiased estimates:\n", "\n", "- Sample mean: $\\bar{x} = \\sum_{i = 1}^N x_i$\n", "- Sample variance: $s_X^2 = \\frac{1}{N - 1} \\sum_{i = 1}^N \\left(x_i - \\bar{x}\\right)^2$\n", "- Sample covariance: $q_{X,Y} = \\frac{1}{N - 1} \\sum_{i = 1}^N \\left(x_i - \\bar{x}\\right) \\left(y_i - \\bar{y}\\right)$" ] }, { "cell_type": "markdown", "id": "f79b6c8f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Confidence Intervals" ] }, { "cell_type": "markdown", "id": "4ab0f7fd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example: Regional surface temperature in France" ] }, { "cell_type": "code", "execution_count": 175, "id": "98874095", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Import modules\n", "from pathlib import Path\n", "import numpy as np\n", "import pandas as pd\n", "import holoviews as hv\n", "import hvplot.pandas\n", "import panel as pn\n", "pn.extension()\n", "\n", "# Set data directory\n", "data_dir = Path('data')\n", "\n", "# Set keyword arguments for pd.read_csv\n", "kwargs_read_csv = dict(header=0, index_col=0, parse_dates=True)\n", "\n", "# Set first and last years\n", "FIRST_YEAR = 2014\n", "LAST_YEAR = 2021\n", "\n", "# Define file path\n", "filename = 'surface_temperature_merra2_{}-{}.csv'.format(\n", " FIRST_YEAR, LAST_YEAR)\n", "filepath = Path(data_dir, filename)\n", "\n", "# Read hourly temperature data averaged over each region\n", "df_temp = pd.read_csv(filepath, **kwargs_read_csv).resample('D').mean()\n", "temp_lim = [-5, 30]\n", "label_temp = 'Temperature (°C)'" ] }, { "cell_type": "code", "execution_count": 176, "id": "796c89fd", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "WIDTH = 260\n", "# Scatter plot of demand versus temperature\n", "def plot_temp(region_name, year):\n", " df = df_temp[[region_name]].loc[str(year)]\n", " df.columns = [label_temp]\n", " nt = df.shape[0]\n", " std = float(df.std(0))\n", " mean = pd.Series(df[label_temp].mean(), index=df.index)\n", " df_std = pd.DataFrame(\n", " {'low': mean - std, 'high': mean + std}, index=df.index)\n", " cdf = pd.DataFrame(index=df.sort_values(by=label_temp).values[:, 0],\n", " data=(np.arange(nt)[:, None] + 1) / nt)\n", " cdf.index.name = label_temp\n", " cdf.columns = ['Probability']\n", " pts = df.hvplot(ylim=temp_lim, title='', width=WIDTH).opts(\n", " title='Time series, Mean, ± 1 STD') * hv.HLine(\n", " df[label_temp].mean()) * df_std.hvplot.area(\n", " y='low', y2='high', alpha=0.2)\n", " pcdf = cdf.hvplot(xlim=temp_lim, ylim=[0, 1], title='', width=WIDTH).opts(\n", " title='Cumulative Distrib. Func.') * hv.VLine(\n", " df[label_temp].mean())\n", " pkde = df.hvplot.kde(xlim=temp_lim,\n", " width=WIDTH) * hv.VLine(\n", " df[label_temp].mean()).opts(title='Probability Density Func.')\n", " \n", " return pn.Row(pts, pcdf, pkde)" ] }, { "cell_type": "code", "execution_count": 177, "id": "de18ea7f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.holoviews_exec.v0+json": "", "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "" ], "text/plain": [ "Column\n", " [0] Column\n", " [0] Select(name='region_name', options=['Grand Est', ...], value='Grand Est')\n", " [1] DiscreteSlider(formatter='%d', name='year', options=[2014, 2015, 2016, ...], value=2014)\n", " [1] Row\n", " [0] Row\n", " [0] HoloViews(Overlay)\n", " [1] HoloViews(Overlay)\n", " [2] HoloViews(Overlay)" ] }, "execution_count": 177, "metadata": { "application/vnd.holoviews_exec.v0+json": { "id": "34261" } }, "output_type": "execute_result" } ], "source": [ "# Show\n", "pn.interact(plot_temp, region_name=df_temp.columns,\n", " year=range(FIRST_YEAR, LAST_YEAR))" ] }, { "cell_type": "markdown", "id": "e7f4df1a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## References\n", "\n", "- [Gut, A., 2005. *Probability: A Graduate Course*. Springer, New York.](https://doi.org/10.1007/978-1-4614-4708-5)\n", "- [Probability Topics in Statistics and Data Science. Jupyter Book.](http://theoryandpractice.org/stats-ds-book/probability-topics.html)" ] }, { "cell_type": "markdown", "id": "30e5a177", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "## Credit\n", "\n", "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n", "Contributors include Bruno Deremble and Alexis Tantet.\n", "\n", "
\n", "\n", "
\n", " \n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " \"Creative\n", "
This work is licensed under a   Creative Commons Attribution-ShareAlike 4.0 International License.\n", "
" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }