Y = θ0 +θ1X + W
where θ0 is the constant term in the equation,
θ1 is the coefficient of the variable X,
W is the difference between the actual value Y and the predicted value (θ0 + θ1X)
θ0 and θ1 are called the parameters of the linear equation, while X and Y are the independent and dependent variables respectively.
The aim is the estimate the coefficients with X and Y to fit the line to the training data.
The difference between the actual value and the predicted value is called the error or residual. In order to estimate the best fit line, we need to estimate the coefficients which requires minimizing the mean squared error.
Evaluating the model
- R-squared: R-squared is a useful performance metric to understand how well the regression model has fitted over the training data. For example, an R-squared of 80% reveals that 80% of the training data fit the regression model. A higher R-squared value generally indicates a better fit for the model.
- Adjusted R-squared: The adjusted R-squared is a modified version of R-squared that takes into account the number of independent variables present in the model. When a new variable is added, adjusted R-squared increases if that variable adds value to the model, and decreases if it does not. Hence, adjusted R-squared is a better choice of metric than R-squared to evaluate the quality of a regression model with multiple independent variables, because adjusted R-squared only remains high when all those independent variables are required to predict the value of the dependent variable well; it decreases if there are any independent variables which don't have a significant effect on the predicted variable.
- RMSE: RMSE stands for Root Mean Squared Error. It is calculated as the square root of the mean of the squared differences between actual outputs and predictions. The lower the RMSE the better the performance of the model.
Estimating the coefficients θ
- Maximum Likelihood - the combined probability of all the events to occur in the sample. the likelihood is maximized to get the model parameters
- Empirical Risk Minimization - the collective error made by the model over all the training records. the empirical risk is minimized to get the model parameters
A higher empirical risk is equivalent to a lower likelihood and vice versa.
Calculate regression line using ordinary least squares (OLS) estimation ?
Assumptions
- Linearity - the relation between the independent variables and the dependent variable must be linear; otherwise, it might fail to identify the pattern in the data and result in a higher error.
- Multicollinearity: This is a phenomenon related to the correlation among the independent variables in the data. Linear regression assumes that there is no multicollinearity between the independent variables. Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of the regression model. This is because correlated variables will provide repeated information to the model and hence the significance of one variable might decrease than the actual value. To avoid this, it is necessary to include only non-correlated independent variables in the model.
- Homoscedasticity: According to this assumption, the error associated with each data point should be equally spread (meaning “constant variance”) along the best fit line. The derivation of the equation of the linear regression model assumes that the error terms have a constant variance. Hence, the results from the linear regression model might not be reliable in case the assumption of homoscedasticity is violated.
- Normal Distribution of error terms: If error terms are non-normally distributed, confidence intervals may become too wide or narrow. Once the confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on the minimization of least squares. The presence of a non–normal distribution suggests that there are a few unusual data points that must be studied closely to make a better model.
- Endogeneity: This is the phenomenon of independent variables being correlated to the error terms of the model. With endogeneity, the optimization process will lead to biased parameters of the model. This will adversely affect the performance of the model. To avoid this, it is assumed that the independent variables are not correlated to the error terms.