线性回归

在统计学中，线性回归（英语：linear regression）是利用称为线性回归方程的最小二乘函数对一个或多个自变量和因变量之间关系进行建模的一种回归分析。这种函数是一个或多个称为回归系数的模型参数的线性组合。只有一个自变量的情况称为简单回归，大于一个自变量情况的叫做多元回归（multivariable linear regression）。^[1]

在线性回归中，数据使用线性预测函数来建模，并且未知的模型参数也是通过数据来估计。这些模型被叫做线性模型。^[2]最常用的线性回归建模是给定X值的y的条件均值是X的仿射函数。不太一般的情况，线性回归模型可以是一个中位数或一些其他的给定X的条件下y的条件分布的分位数作为X的线性函数表示。像所有形式的回归分析一样，线性回归也把焦点放在给定X值的y的条件概率分布，而不是X和y的联合概率分布（多元分析领域）。

线性回归是回归分析中第一种经过严格研究并在实际应用中广泛使用的类型。^[3]这是因为线性依赖于其未知参数的模型比非线性依赖于其未知参数的模型更容易拟合，而且产生的估计的统计特性也更容易确定。

线性回归有很多实际用途。分为以下两大类：

如果目标是预测或者映射，线性回归可以用来对观测数据集的和X的值拟合出一个预测模型。当完成这样一个模型以后，对于一个新增的X值，在没有给定与它相配对的y的情况下，可以用这个拟合过的模型预测出一个y值。
给定一个变量y和一些变量 $X_{1}$ ,..., $X_{p}$ ，这些变量有可能与y相关，线性回归分析可以用来量化y与Xj之间相关性的强度，评估出与y不相关的 $X_{j}$ ，并识别出哪些 $X_{j}$ 的子集包含了关于y的冗余信息。

线性回归模型经常用最小二乘逼近来拟合，但他们也可能用别的方法来拟合，比如用最小化“拟合缺陷”在一些其他规范里（比如最小绝对误差回归），或者在桥回归中最小化最小二乘损失函数的惩罚。相反，最小二乘逼近可以用来拟合那些非线性的模型。因此，尽管“最小二乘法”和“线性模型”是紧密相连的，但他们是不能划等号的。

线性回归的“回归”指的是回归到平均值（英语：regression toward the mean）。

简介

理论模型

给一个随机样本 $(Y_{i},X_{i1},\ldots ,X_{ip}),\,i=1,\ldots ,n$ ，一个线性回归模型假设回归子 $Y_{i}$ 和回归量 $X_{i1},\ldots ,X_{ip}$ 之间的关系是除了X的影响以外，还有其他的变量存在。我们加入一个误差项 $\varepsilon _{i}$ （也是一个随机变量）来捕获除了 $X_{i1},\ldots ,X_{ip}$ 之外任何对 $Y_{i}$ 的影响。所以一个多变量线性回归模型表示为以下的形式：

Y_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta _{p}X_{ip}+\varepsilon _{i},\qquad i=1,\ldots ,n

其他的模型可能被认定成非线性模型。一个线性回归模型不需要是自变量的线性函数。线性在这里表示 $Y_{i}$ 的条件均值在参数 $\beta$ 里是线性的。例如：模型 $Y_{i}=\beta _{1}X_{i}+\beta _{2}X_{i}^{2}+\varepsilon _{i}$ 在 $\beta _{1}$ 和 $\beta _{2}$ 里是线性的，但在 $X_{i}^{2}$ 里是非线性的，它是 $X_{i}$ 的非线性函数。

数据和估计

区分随机变量和这些变量的观测值是很重要的。通常来说，观测值或数据（以小写字母表记）包括了n个值 $(y_{i},x_{i1},\ldots ,x_{ip}),\,i=1,\ldots ,n$ .

我们有 $p+1$ 个参数 $\beta _{0},\ldots ,\beta _{p}$ 需要决定，为了估计这些参数，使用矩阵表记是很有用的。

Y=X\beta +\varepsilon \,

其中Y是一个包括了观测值 $Y_{1},\ldots ,Y_{n}$ 的列向量， $\varepsilon$ 包括了未观测的随机成分 $\varepsilon _{1},\ldots ,\varepsilon _{n}$ 以及回归量的观测值矩阵 $X$ ：

X={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}}

X通常包括一个常数项。

如果X列之间存在线性相关，那么参数向量 $\beta$ 就不能以最小二乘法估计除非 $\beta$ 被限制，比如要求它的一些元素之和为0。

古典假设

样本是在总体之中随机抽取出来的。
因变量Y在实直线上是连续的，
残差项是独立且相同分布的(iid)，也就是说，残差是独立随机的，且服从高斯分布。

这些假设意味着残差项不依赖自变量的值，所以 $\varepsilon _{i}$ 和自变量X（预测变量）之间是相互独立的。

在这些假设下，建立一个显式线性回归作为条件预期模型的简单线性回归，可以表示为：

{\mbox{E}}(Y_{i}\mid X_{i}=x_{i})=\alpha +\beta x_{i}\,

最小二乘法分析

最小二乘法估计

回归分析的最初目的是估计模型的参数以便达到对数据的最佳拟合。在决定一个最佳拟合的不同标准之中，最小二乘法是非常优越的。这种估计可以表示为：

{\hat {\beta }}=(X^{T}X)^{-1}X^{T}y\,

回归推论

对于每一个 $i=1,\ldots ,n$ ，我们用 $\sigma ^{2}$ 代表误差项 $\varepsilon$ 的方差。一个无偏误的估计是：

{\hat {\sigma }}^{2}={\frac {S}{n-p}},

其中 $S:=\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}$ 是误差平方和（残差平方和）。估计值和实际值之间的关系是：

{\hat {\sigma }}^{2}\cdot {\frac {n-p}{\sigma ^{2}}}\sim \chi _{n-p}^{2}

其中 $\chi _{n-p}^{2}$ 服从卡方分布，自由度是 $n-p$

对普通方程的解可以写为：

{\hat {\boldsymbol {\beta }}}=(\mathbf {X^{T}X)^{-1}X^{T}y} .

这表示估计项是因变量的线性组合。进一步地说，如果所观察的误差服从正态分布。参数的估计值将服从联合正态分布。在当前的假设之下，估计的参数向量是精确分布的。

{\hat {\beta }}\sim N(\beta ,\sigma ^{2}(X^{T}X)^{-1})

其中 $N(\cdot )$ 表示多变量正态分布。

参数估计值的标准差是：

{\hat {\sigma }}_{j}={\sqrt {{\frac {S}{n-p}}\left[\mathbf {(X^{T}X)} ^{-1}\right]_{jj}}}.

参数 $\beta _{j}$ 的 $100(1-\alpha )\%$ 置信区间可以用以下式子来计算：

{\hat {\beta }}_{j}\pm t_{{\frac {\alpha }{2}},n-p}{\hat {\sigma }}_{j}.

误差项可以表示为：

\mathbf {{\hat {r}}=y-X{\hat {\boldsymbol {\beta }}}=y-X(X^{T}X)^{-1}X^{T}y} .\,

单变量线性回归

单变量线性回归，又称简单线性回归（simple linear regression, SLR），是最简单但用途很广的回归模型。其回归式为：

Y=\alpha +\beta X+\varepsilon

为了从一组样本 $(y_{i},x_{i})$ （其中 $i=1,\ 2,\ldots ,n$ ）之中估计最合适（误差最小）的 $\alpha$ 和 $\beta$ ，通常采用最小二乘法，其计算目标为最小化残差平方和：

\sum _{i=1}^{n}\varepsilon _{i}^{2}=\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}

使用微分法求极值：将上式分别对 $\alpha$ 和 $\beta$ 做一阶偏微分，并令其等于0：

\left\{{\begin{array}{lcl}n\ \alpha +\sum \limits _{i=1}^{n}x_{i}\ \beta =\sum \limits _{i=1}^{n}y_{i}\\\sum \limits _{i=1}^{n}x_{i}\ \alpha +\sum \limits _{i=1}^{n}x_{i}^{2}\ \beta =\sum \limits _{i=1}^{n}x_{i}y_{i}\end{array}}\right.

此二元一次线性方程组可用克莱姆法则求解，得解 ${\hat {\alpha }},\ {\hat {\beta }}$ ：

{\hat {\beta }}={\frac {n\sum \limits _{i=1}^{n}x_{i}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}={\frac {{\text{cov}}(X,Y)}{{\text{var}}(X)}}\,

{\hat {\alpha }}={\frac {\sum \limits _{i=1}^{n}x_{i}^{2}\sum \limits _{i=1}^{n}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\bar {y}}-{\bar {x}}{\hat {\beta }}

S=\sum \limits _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}=\sum \limits _{i=1}^{n}y_{i}^{2}-{\frac {n(\sum \limits _{i=1}^{n}x_{i}y_{i})^{2}+(\sum \limits _{i=1}^{n}y_{i})^{2}\sum \limits _{i=1}^{n}x_{i}^{2}-2\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}

{\hat {\sigma }}^{2}={\frac {S}{n-2}}.

协方差矩阵是：

{\frac {1}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}{\begin{pmatrix}\sum x_{i}^{2}&-\sum x_{i}\\-\sum x_{i}&n\end{pmatrix}}

平均响应置信区间为：

y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

预报响应置信区间为：

y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {1+{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

方差分析

在方差分析（ANOVA）中，总平方和分解为两个或更多部分。

总平方和SST (sum of squares for total) 是：

{\text{SST}}=\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}

　，其中：　

{\bar {y}}={\frac {1}{n}}\sum _{i}y_{i}

同等地：

{\text{SST}}=\sum _{i=1}^{n}y_{i}^{2}-{\frac {1}{n}}\left(\sum _{i}y_{i}\right)^{2}

回归平方和SSReg (sum of squares for regression。也可写做模型平方和，SSM，sum of squares for model) 是：

{\text{SSReg}}=\sum \left({\hat {y}}_{i}-{\bar {y}}\right)^{2}={\hat {\boldsymbol {\beta }}}^{T}\mathbf {X} ^{T}\mathbf {y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right),

残差平方和SSE (sum of squares for error) 是：

{\text{SSE}}=\sum _{i}{\left({y_{i}-{\hat {y}}_{i}}\right)^{2}}=\mathbf {y^{T}y-{\hat {\boldsymbol {\beta }}}^{T}X^{T}y} .

总平方和SST又可写做SSReg和SSE的和：

{\text{SST}}=\sum _{i}\left(y_{i}-{\bar {y}}\right)^{2}=\mathbf {y^{T}y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right)={\text{SSReg}}+{\text{SSE}}.

回归系数R²是：

R^{2}={\frac {\text{SSReg}}{\text{SST}}}=1-{\frac {\text{SSE}}{\text{SST}}}.

其他方法

广义最小二乘法

广义最小二乘法可以用在当观测误差具有异方差或者自相关的情况下。

总体最小二乘法

总体最小二乘法用于当自变量有误时。

广义线性模式

广义线性模式应用在当误差分布函数不是正态分布时。比如指数分布，伽玛分布，逆高斯分布，泊松分布，二项式分布等。

稳健回归

将平均绝对误差最小化，不同于在线性回归中是将均方误差最小化。

线性回归的应用

趋势线

一条趋势线代表着时间序列数据的长期走势。它告诉我们一组特定数据（如GDP、石油价格和股票价格）是否在一段时期内增长或下降。虽然我们可以用肉眼观察数据点在坐标系的位置大体画出趋势线，更恰当的方法是利用线性回归计算出趋势线的位置和斜率。

流行病学

有关吸烟对死亡率和发病率影响的早期证据来自采用了回归分析的观察性研究。为了在分析观测数据时减少伪相关，除最感兴趣的变量之外,通常研究人员还会在他们的回归模型里包括一些额外变量。例如，假设有一个回归模型，在这个回归模型中吸烟行为是我们最感兴趣的独立变量，其相关变量是经数年观察得到的吸烟者寿命。研究人员可能将社会经济地位当成一个额外的独立变量，已确保任何经观察所得的吸烟对寿命的影响不是由于教育或收入差异引起的。然而，我们不可能把所有可能混淆结果的变量都加入到实证分析中。例如，某种不存在的基因可能会增加人死亡的几率，还会让人的吸烟量增加。因此，比起采用观察数据的回归分析得出的结论，随机对照试验常能产生更令人信服的因果关系证据。当可控实验不可行时，回归分析的衍生，如工具变量回归，可尝试用来估计观测数据的因果关系。

金融

资本资产定价模型利用线性回归以及Beta系数的概念分析和计算投资的系统风险。这是从联系投资回报和所有风险性资产回报的模型Beta系数直接得出的。

经济学

线性回归是经济学的主要实证工具。例如，它是用来预测消费支出，^[4]固定投资支出，存货投资，一国出口产品的购买，^[5]进口支出，^[5]要求持有流动性资产，^[6]劳动力需求、^[7]劳动力供给。^[7]

参考文献

引用

^ Rencher, Alvin C.; Christensen, William F., Chapter 10, Multivariate regression – Section 10.1, Introduction, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics 709 3rd, John Wiley & Sons: 19, 2012 [2019-05-14], ISBN 9781118391679, （原始内容存档于2019-06-15） .
^ Hilary L. Seal. The historical development of the Gauss linear model. Biometrika. 1967, 54 (1/2): 1–24. JSTOR 2333849. doi:10.1093/biomet/54.1-2.1.
^ Yan, Xin, Linear Regression Analysis: Theory and Computing, World Scientific: 1–2, 2009 [2019-05-14], ISBN 9789812834119, （原始内容存档于2019-06-08）, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.
^ Deaton, Angus. Understanding Consumption. Oxford University Press. 1992. ISBN 978-0-19-828824-4.
^ ^5.0 ^5.1 Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. International Economics: Theory and Policy 9th global. Harlow: Pearson. 2012. ISBN 9780273754091.
^ Laidler, David E. W. The Demand for Money: Theories, Evidence, and Problems 4th. New York: Harper Collins. 1993. ISBN 978-0065010985.
^ ^7.0 ^7.1 Ehrenberg; Smith. Modern Labor Economics 10th international. London: Addison-Wesley. 2008. ISBN 9780321538963.

来源

书籍

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. 2003.
Draper, N.R. and Smith, H. Applied Regression Analysis. Wiley Series in Probability and Statistics. 1998.
Robert S. Pindyck and Daniel L. Rubinfeld. Chapter One. Econometric Models and Economic Forecasts. 1998.
Charles Darwin. The Variation of Animals and Plants under Domestication. (1868) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)

刊物文章

Galton, Francis. Regression Towards Mediocrity in Hereditary Stature (PDF). Journal of the Anthropological Institute. 1886, 15: 246–263 [2008-12-30]. （原始内容存档 (PDF)于2016-03-10）.

延伸阅读

Pedhazur, Elazar J. Multiple regression in behavioral research: Explanation and prediction 2nd. New York: Holt, Rinehart and Winston. 1982. ISBN 0-03-041760-0.
Barlow, Jesse L. Chapter 9: Numerical aspects of Solving Linear Least Squares Problems. Rao, C.R. (编). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
Björck, Åke. Numerical methods for least squares problems. Philadelphia: SIAM. 1996. ISBN 0-89871-360-9.
Goodall, Colin R. Chapter 13: Computation using the QR decomposition. Rao, C.R. (编). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
National Physical Laboratory. Chapter 1: Linear Equations and Matrices: Direct Methods. Modern Computing Methods. Notes on Applied Science 16 2nd. Her Majesty's Stationery Office. 1961.

参见

外部链接

Least-Squares Regression （页面存档备份，存于互联网档案馆）, PhET Interactive simulations, University of Colorado at Boulder

[1] Rencher, Alvin C.; Christensen, William F., Chapter 10, Multivariate regression – Section 10.1, Introduction, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics 709 3rd, John Wiley & Sons: 19, 2012 [2019-05-14], ISBN 9781118391679, （原始内容存档于2019-06-15） .

[2] Hilary L. Seal. The historical development of the Gauss linear model. Biometrika. 1967, 54 (1/2): 1–24. JSTOR 2333849. doi:10.1093/biomet/54.1-2.1.

[3] Yan, Xin, Linear Regression Analysis: Theory and Computing, World Scientific: 1–2, 2009 [2019-05-14], ISBN 9789812834119, （原始内容存档于2019-06-08）, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.

[4] Deaton, Angus. Understanding Consumption. Oxford University Press. 1992. ISBN 978-0-19-828824-4.

[Krugman-5] 5.0 ^5.1 Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. International Economics: Theory and Policy 9th global. Harlow: Pearson. 2012. ISBN 9780273754091.

[6] Laidler, David E. W. The Demand for Money: Theories, Evidence, and Problems 4th. New York: Harper Collins. 1993. ISBN 978-0065010985.

[Ehrenberg-7] 7.0 ^7.1 Ehrenberg; Smith. Modern Labor Economics 10th international. London: Addison-Wesley. 2008. ISBN 9780321538963.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

查论编试验设计
科学方法	实验试验设计对照实验内部效度 & 外部效度实验单位（英语：Experimental unit）双盲实验优化设计（英语：Optimal design）: 贝叶斯实验设计（英语：Bayesian experimental design）随机分配（英语：Random assignment）随机化（英语：Randomization）限制随机化（英语：Restricted randomization） Replication versus subsampling（英语：Replication (statistics)）样本量确定
实验（英语：Glossary of experimental design） & 阻塞	Treatment（英语：Glossary of experimental design）效应值混杂（英语：Contrast (statistics)）交互作用干扰因素正交 Blocking 协变量多余变量（英语：Nuisance variable）
概率模型 & 推论统计学	线性回归普通最小二乘法贝叶斯线性回归（英语：Bayesian linear regression）随机效应（英语：Random effect）混合模型等级线性模型: 贝氏网络方差分析科克伦定理（英语：Cochran's theorem）多元方差分析（英语：Multivariate analysis of variance）协方差分析（英语：Analysis of covariance）均值比较（英语：Comparing means）多重比较（英语：Multiple comparison）
试验设计: 完全随机设计	析因实验部分析因设计（英语：Fractional factorial design） Plackett-Burman（英语：Plackett-Burman design）田口方法响应曲面法 Polynomial & rational modeling（英语：Polynomial and rational function modeling） Box-Behnken（英语：Box–Behnken design） Central composite（英语：Central composite design）随机化区组设计（英语：Randomized block design） Generalized randomized block design（英语：Generalized randomized block design） (GRBD) 拉丁方阵希腊拉丁方阵正交阵列（英语：Orthogonal array）拉丁超立方重复测量设计（英语：Repeated measures design）交叉研究（英语：Crossover study）随机对照试验 Sequential analysis（英语：Sequential analysis） Sequential probability ratio test（英语：Sequential probability ratio test）
术语（英语：Glossary of experimental design）实验设计分类概率与统计主题 Statistical outline（英语：Outline of statistics） Statistical topics（英语：List of statistics articles）