{ "cells": [ { "cell_type": "markdown", "id": "1352898a-c6e9-4745-b568-0ab32819be96", "metadata": {}, "source": [ "# Linear regression" ] }, { "cell_type": "markdown", "id": "bfdba51f-b14a-44ee-8be5-8fae70ee79be", "metadata": {}, "source": [ "Suppose we conduct an experiment where we observe $n$ data pairs and\n", "call them $(x_i, y_i)$, $i = 1, \\dots, n$. We want to describe the\n", "underlying relationship between $y_i$ and $x_i$ involving the error of\n", "the measurements, $\\varepsilon_i$, by the following relation:\n", "\\begin{equation}\n", " \\label{eq:1}\n", " y_i = \\alpha + \\beta x_i + \\varepsilon_i .\n", "\\end{equation}\n", "This relationship between the true (but unobserved)\n", "parameters $\\alpha$ and $\\beta$ and the data points is called a *linear\n", "regression model*." ] }, { "cell_type": "markdown", "id": "08b1f8a1-fdfa-4778-92a2-c2cd17d54386", "metadata": {}, "source": [ "Out goal is to find estimated values, $\\widehat{\\alpha}$ and\n", "$\\widehat{\\beta}$, for the parameters $\\alpha$ and $\\beta$ which would\n", "provide the \"best\" fit in some sense for the data points\n", "$(x_i, y_i)$,. We chose the best fit in the least-squares sense: the\n", "best-fit line minimizes the sum of squared residuals,\n", "$\\widehat{\\varepsilon}_i$, which are the differences between measured\n", "and predicted values of the dependent variable y:\n", "\\begin{equation}\n", " \\widehat{\\varepsilon}_i = y_i - \\widehat{\\alpha} -\n", " \\widehat{\\beta} x_i. \n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "342255ff-0b08-478c-8b71-9e229aa038a9", "metadata": {}, "source": [ "That is, we are looking for the values $\\widehat{\\alpha}$ and\n", "$\\widehat{\\beta}$ that are the solutions of the following minimization problem:\n", "find\n", "\\begin{equation}\n", " \\min_{\\widehat{\\alpha}, \\,\\widehat{\\beta}}\n", " Q(\\widehat{\\alpha}, \\widehat{\\beta}),\n", "\\end{equation}\n", "where\n", "\\begin{equation}\n", " Q(\\widehat{\\alpha},\\widehat{\\beta}) = \\sum_{i=1}^n \\widehat{\\varepsilon}_i^2=\n", " \\sum_{i=1}^n \\left(y_i - \\widehat{\\alpha} - \\widehat{\\beta} x_i\\right)^2 . \n", "\\end{equation}" ] }, { "cell_type": "markdown", "id": "50bb13be-cff4-4dc1-8fc8-8a55464c8717", "metadata": {}, "source": [ "To find a minimum, we take partial derivatives of $Q$ with respect to\n", "$\\widehat{\\alpha}$ and $\\widehat{\\beta}$ and equate them to\n", "zeros." ] }, { "cell_type": "markdown", "id": "ac06e451-5f8a-4382-911e-04bf2ef5d99d", "metadata": {}, "source": [ "\\begin{equation}\n", " \\frac{\\partial}{\\partial \\widehat{\\alpha}}\n", " Q(\\widehat{\\alpha }, \\widehat{\\beta}) = \n", " -2 \\sum_{i=1}^n \\left(\n", " y_i - \\widehat{\\alpha} - \\widehat{\\beta} x_i\n", " \\right) = 0 ,\n", "\\end{equation}\n", "or\n", "\\begin{equation}\n", " \\sum_{i=1}^n \\left(y_i - \\widehat{\\alpha} - \\widehat{\\beta} x_i\n", " \\right) = 0 .\n", "\\end{equation}\n", "Rearranging the terms and\n", "introducing notations $\\bar{x}$ and $\\bar{y}$ for the average\n", "values of the $x_i$ and $y_i$, respectively:\n", "\\begin{equation}\n", " \\bar{x} \\equiv \\frac{1}{n} \\sum_{i=1}^n x_i, \\quad\n", " \\bar{y} \\equiv \\frac{1}{n} \\sum_{i=1}^n y_i ,\n", "\\end{equation}\n", "we obtain:\n", "\\begin{equation}\n", " \\widehat{\\alpha} = \\bar{y} - \\widehat{\\beta} \\bar{x} , \n", "\\end{equation}\n", "or\n", "\\begin{equation}\n", " \\bar{y} = \\widehat{\\alpha} + \\widehat{\\beta} \\bar{x} .\n", "\\end{equation}\n", "This relation can be interpreted as follows: the best\n", "fit line passes through the ``center of mass'' of the data points.\n" ] }, { "cell_type": "markdown", "id": "4c88a7d0-fdfe-495e-9d2f-bae99fc60563", "metadata": {}, "source": [ "Now, take the derivative with respect to $\\widehat{\\beta}$:\n", "\\begin{equation}\n", " \\frac{\\partial}{\\partial \\widehat{\\beta}}\n", " Q(\\widehat{\\alpha}, \\widehat{\\beta}) =\n", " -2 \\sum_{i=1}^n \\left(\n", " \\left( y_i - \\bar{y} \\right) -\n", " \\widehat{\\beta} \\left( x_i - \\bar{x} \\right)\n", " \\right) \\left( x_i - \\bar{x} \\right) = 0 .\n", "\\end{equation}\n", "Rearranging the terms,\n", "\\begin{equation}\n", " \\sum_{i=1}^n \\left( y_i - \\bar{y} \\right)\n", " \\left(x_{i}- \\bar{x} \\right) -\n", " \\widehat{\\beta}\\sum_{i=1}^n \\left(x_i -\\bar{x} \\right)^{2} = 0,\n", "\\end{equation}\n", "or\n", "\\begin{equation}\n", " \\widehat{\\beta} = \\frac{\\displaystyle\n", " \\sum_{i=1}^n \\left( y_i - \\bar{y} \\right)\n", " \\left( x_i -\\bar{x} \\right)\n", " }{\n", " \\displaystyle \\sum_{i=1}^n \\left( x_i - \\bar{x} \\right)^{2}\n", " } .\n", "\\end{equation}\n", "We can now\n", "determine $\\widehat{\\alpha}$.\n", "\\begin{equation}\n", " \\widehat{\\alpha} = \\bar{y} - \\widehat{\\beta} \\bar{x} .\n", "\\end{equation}\n", "\n", "These relations solve the problem of finding the least squares fit to the data." ] }, { "cell_type": "markdown", "id": "424c8e2c-bad5-43e3-95a0-be014cdc3f6e", "metadata": {}, "source": [ "## Numerical example" ] }, { "cell_type": "code", "execution_count": null, "id": "395e1fcc-52f5-4bff-a85d-8919d79cd23f", "metadata": {}, "outputs": [], "source": [ "\n", "\"\"\"\n", " alpha, beta, sigma = linear_regression(x, y)\n", "\n", "Least square fit y = alpha + beta x. sigma is the standard error of beta\n", "\"\"\"\n", "function linear_regression(x, y)\n", " np = length(x)\n", " xbar = sum(x)/np\n", " ybar = sum(y)/np\n", " x2 = sum((x .- xbar) .^ 2)\n", " beta = sum((y .- ybar) .* (x .- xbar))/x2\n", " alpha = ybar - beta*xbar\n", " sigma = sqrt(sum((y .- alpha .- beta .* x) .^ 2)/((np - 2)*x2))\n", " return alpha, beta, sigma\n", "end" ] }, { "cell_type": "code", "execution_count": null, "id": "eacdde5f-76e9-497e-9263-c8f8d7735053", "metadata": {}, "outputs": [], "source": [ "using PyPlot" ] }, { "cell_type": "markdown", "id": "52040d06-d8cc-4c00-b41e-a4b011a7d376", "metadata": {}, "source": [ "\"Prepare\" the data " ] }, { "cell_type": "code", "execution_count": null, "id": "1d28105d-882a-4071-9d5b-388cae054963", "metadata": {}, "outputs": [], "source": [ "\n", "xmin = 0.0\n", "xmax = 10.0\n", "np = 200\n", "sc = 1.5\n", "x = range(xmin, xmax, np)\n", "y = 2 .* x .+ sc .* randn(np);" ] }, { "cell_type": "code", "execution_count": null, "id": "808d66a6-e490-418b-a2d6-64cd3cec245e", "metadata": {}, "outputs": [], "source": [ "\n", "alpha, beta, sigma = linear_regression(x, y)" ] }, { "cell_type": "code", "execution_count": null, "id": "f94dec35-0ae6-4f2d-9750-b557b9a885d3", "metadata": {}, "outputs": [], "source": [ "\n", "plot(x, y, linestyle=\"none\", marker=\".\", label=\"noisy data\")\n", "plot(x, alpha .+ beta .* x, linestyle=\"solid\", label=\"linear lsq fit\")\n", "\n", "grid(true)\n", "xlabel(\"x\")\n", "ylabel(\"y\")\n", "title(\"Linear regression\")\n", "legend();" ] }, { "cell_type": "code", "execution_count": null, "id": "364bac66-5d32-4aee-b9fd-34db5c601ff6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.10.5", "language": "julia", "name": "julia-1.10" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }