Warm up in Regression Analysis

Warm up in Regression Analysis

Oct 29, 2018 · 8 min read

Why regression analysis?

Regression analysis is a statistical process for estimating the relationship among independent variables(predictors) and dependent variables(responses). It helps to understand how the dependent variables change when any of the independent variables is varied, while the other independent variables are held fixed.

Some Overview….

N -> number of data points
k -> number of the weights(W) corresponding to specific parameters
model function-> Y=f(X, W)
If N<k, can’t perform regression analysis since the system is undetermined.
If N=k, there will be a unique solution if f is linear and X are linear independent, and it will be infinitely many or no solution if f is nonlinear.
If N>k, there is enough information for the best-fitted model, and the degree of freedom is N-k.

Linear Regression

Linear Regression is a statistical technique where the dependent variable(response) is predicted from the independent variables(predictors)
The goal of regression analysis is using a sample from a population to estimate the properties of the population. The coefficients in the model are estimated the actual population parameters, so we want the coefficient as good as possible.
Before building a linear regression model, need to ensure the assumptions:
  1. Linearity: the property of a mathematical relationship which can be graphically represented as a straight line.
  2. Homoscedasticity: the error term has mean 0 and equal finite variance at all levels of the independent variables.
  3. Multivariate normality: assumes that the residuals are normally distributed.
  4. Independence of errors: the residuals are uncorrelated with each other.
  5. Lack of multicollinearity: no independent variable is perfectly correlated to another independent variable.
Two types of linear regression models: Simple Linear Regression, and Multiple Linear Regression.
a) If there is only one independent variable(predictor/feature), then it is a Simple Linear Regression.
b) If there are more than one independent variables(predictor/feature), then it is a Multiple Linear Regression.
https://www.udemy.com/machinelearning/learn/v4/t/lecture/5732730?start=0
For the Categorical variable, we need to create dummy variables which are converting the information from the categorical variable. (Caution: we always need to omit one dummy variable because since n-1 dummy variables are known, then we will know the last one!)

Polynomial Regression

The polynomial regression reveals the relationship between the independent variable and the dependent variable as an nth degree polynomial in the independent variables. Normally, the polynomial regression function is a non-linear function, and it can be fitted with discrete or continuous features. It is a way to prevent the underfitting problem from linear regression since it increases the model complexity.
https://www.originlab.com/index.aspx?go=Products/Origin/DataAnalysis/CurveFitting

Logistic regression (linear classifier for classification)

Logistic Regression is using the logistic model to predict the binary outcomes from a linear combination of the predictors. Unlike Linear regression to output a set of numeric values, Logistic Regression will output the probability that given input belongs to a specific class.
Logistic regression is conducted when the dependent variable is binary, which can be discrete or continuous. In logistic regression, the response (dependent variable) has a finite number of possible outcomes.
Adding independent variables will always increase the variance of the model toward overfitting.
Before building a logistic regression model, need to ensure the assumptions:
  1. The dependent variable(response ) should be binary in nature.
  2. There is no outlier in the data.
  3. Lack of multicollinearity: no independent variable is perfectly correlated to another independent variable.
  4. Need to make sure the input data can be separated into two ‘regions’ with a straight perceptron boundary.
https://www.biomedware.com/files/documentation/spacestat/Statistics/Multivariate_Modeling/Regression/Implementation_of_Logistic_GWR.htm
https://www.saedsayad.com/logistic_regression.htm
Why the threshold of the logistic regression model is 0.5?
Since the optimization technique to find the best sigmoid curve, we apply the MLE to estimate the parameters. The criterion is we need to try the best to make every sample labeled in ‘1’ will have a logistic function result close to 1, same logic for label ‘0’.

Decision Tree Regression

Tree-based model: it breaks down the data into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
If the target variable is continuous numeric values with infinite possible outcomes, then the decision tree is called regression trees.
https://www.udemy.com/machinelearning/learn/v4/t/lecture/5732730?start=0
Mathematical Formula:
https://web.stanford.edu/~hastie/Papers/ESLII.pdf
The logic of decision tree is recursively splitting the tree into binary sub-trees:
  1. In each split, consider all predictors as the possible split attributes. For each predictor, the data is split into several split points, and then select the best split point for the specific predictor(independent variable) by evaluating the Sum of Square error(SSE)(lowest is the best). Repeat choosing the best split point for every predictor, then choosing the final best split point by evaluating the Sum of Square Error of all ‘best’ split points of all predictors at the same level.
  2. Repeat step 1 split the tree into sub-tree recursively, until we meet some stop criterion below.
Stop Criterion: Since the decision tree can be split until there is only one element in each node. Normally, such trees are quite complex and leading to overfitting with high model variance. Therefore, it is a need to set a stop criterion.
  1. we can set a threshold as the minimum number of input in each leaf, the tree stops splitting while parent-node reaches the threshold.
  2. we can set a threshold as the maximum depth the tree can go, the tree stops splitting while it reaches the specific number of depth in the tree.
  3. we can set a threshold as the target sum square error, the tree stops splitting while there is a large decrease more than the threshold, compared to its parent node. However, according to the greedy algorithm, it is possible to find good split after bad split. (not robust)
  4. Predictor values for all records are identical.
https://www.datacamp.com/community/tutorials/decision-trees-R
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
Worthy to mention: the ‘region’ of the decision tree in regression must satisfy the following criterions:
a) Disjoint (no parallel split perceptron into another split perceptron)
b) cover the whole space (split perceptron has to split the whole involved space)
Advantages:
a. Simple to understand and interpret.
b. Requires little data preparation — normalization.
c. Mirrors human decision making more closely than other approaches.
Disadvantages:
a. a small change in training data will cause a large change in the decision tree.
b. practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree.

Random Forest Regression

The random forest model is operated by constructing a multitude of decision trees on many bootstrap samples(re-sample with replacement) via bagging method at training time and make the final prediction with the average of the prediction across the trees.
https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
Advantages:
a) Prevent overfitting: by averaging prediction from trees, significantly reduce the risk of overfitting, therefore more accurate.
b) Run efficiently on large databases, and it can handle thousands of input variables without variable deletion.

Support Vector Regression (SVR)

SVR is a type of support vector machine that supports linear and nonlinear regression.
It produces by Support Vector Machine(SVM) depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction within the absolute difference of epsilon.
https://en.wikipedia.org/wiki/Support_vector_machine
https://upload.wikimedia.org/wikipedia/commons/7/7a/Svr_epsilons_demo.svg
With different thresholds ε, the model is flatter. As ε increases, the prediction becomes less sensitive to errors.
http://research.ncku.edu.tw/re/articles/e/20080620/images/080526035813E4MDyJ.gif

Optimization Techniques:

  1. Maximum Likelihood Estimation (More details will publish in latest post):
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
2. Least Square Estimation:
https://en.wikipedia.org/wiki/Least_squares

Regression Model Evaluation Metrics…..!

  1. RSME (root mean square error): it represents the sample standard deviation of the difference between predicted value and observed values, and it is also known as residual. The formula is:
2. MAE (mean absolute error): it represents the absolute difference between predicted values and actual values, and it also means all individual differences are weighted equally. The formula is:
RMSE v.s. MAE:
Similarities: They both range from 0 to 1, and the lower value indicates the better result.
Differences:
a. MAE directly take the average of the offset, RMSE penalizes the higher difference between the actual and predicted values more than MAE.
b. RMSE ≥ MAE, and they only equal to each other when all differences between the actual and predicted values are zero.
c. RMSE is more practical because the loss function defined in RMSE is smoothly differentiable and make it easier to perform in mathematical operations.
https://en.wikipedia.org/wiki/Coefficient_of_determination
d. RMSE is easier to solve, but MAE is more robust to outliers.
3. R square (R2): it represents the proportion of the variance for a dependent variable explained by the independent variables. In other words, it measures how well the observed outcomes are replicated by the model. The formula is:
https://en.wikipedia.org/wiki/Coefficient_of_determination
https://en.wikipedia.org/wiki/Coefficient_of_determination
4. Adjusted R square: it measures how well the independent variables fit a curve or line as well as R2 but adjust for the number of independent variables in a model. The formula is:
https://en.wikipedia.org/wiki/Coefficient_of_determination
R2 v.s. Adjusted R2:
a. R2 increases or stay the same with the addition of more independent variables, even they don’t have relationships with the dependent variables.
b. Adjusted R2 provides an adjustment to R2 statistic such as the independent variable has a strong correlation to the dependent variable will increase Adjusted R2, but decrease while the adding variable doesn’t have a strong correlation with dependent variables.

Pros and Cons of Regression Evaluation Metric:

  1. RMSE is the better choice if we only care about model accuracy.
  2. Any metric such like MAE which doesn’t take the square of the difference is more robust to outliers.
  3. Any square of error term metric should be more useful when large errors are particularly undesirable.

Comentários

Postagens mais visitadas deste blog

SE MARAVILHAR COM AS COISAS DA VIDA

Dilemas econômicos: Teorema de Arrow

SQL em escala com o Apache Spark SQL e DataFrames - conceitos, arquitetura e exemplos