Deal Outliers with Robust Regression

This is a regression technique that can helps us alleviate the problem of outliers. Robust Regression is a family of regression techniques that is really quite immune to the presence of outliers. Least Trimmed Squares Regression is a technique that fit a regression function and is not effected by the presence of outliers. Least Trimmed Squares Regression attempts to minimise the sum of squared residuals over a subset of k points.

library(mlbench)
library(robustbase)
data(BostonHousing)
str(BostonHousing)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ b      : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

reg1 <- lm(medv ~., data =BostonHousing)
summary(reg1)


Call:
lm(formula = medv ~ ., data = BostonHousing)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
b            9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

plot(reg1)

As we can see from the results above, most of the predictor are statistically significant. The Adjsted R-squared values is 0.73, and the p-value is less than 0.05 which make the model statistically significant. The model explain the 73% of variation of the response variable. Looking at the last graph resuduals vs. leverage, it sayas to us that observations 369 and 373 are outliers. We can also test is and run a Robust Regression.

library(car)
outlierTest(reg1) #identify the outlier data pts

    rstudent unadjusted p-value Bonferonni p
369 5.907411         6.4998e-09   3.2889e-06
372 5.491079         6.4185e-08   3.2478e-05
373 5.322247         1.5617e-07   7.9020e-05

library(robustbase)
ltsFit = ltsReg(medv ~., data =BostonHousing)
summary(ltsFit)


Call:
ltsReg.formula(formula = medv ~ ., data = BostonHousing)

Residuals (from reweighted LS):
   Min     1Q Median     3Q    Max 
-7.047 -1.340  0.000  1.306  7.086 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
Intercept  7.133193   3.586372   1.989  0.04733 *  
crim      -0.507579   0.053559  -9.477  < 2e-16 ***
zn         0.033001   0.008588   3.843  0.00014 ***
indus     -0.002682   0.036204  -0.074  0.94099    
chas1      1.175452   0.548071   2.145  0.03253 *  
nox       -3.064635   2.274219  -1.348  0.17850    
rm         5.457775   0.346509  15.751  < 2e-16 ***
age       -0.054621   0.008078  -6.762 4.41e-11 ***
dis       -0.912505   0.122830  -7.429 5.86e-13 ***
rad        0.222938   0.042951   5.191 3.23e-07 ***
tax       -0.009213   0.002152  -4.281 2.29e-05 ***
ptratio   -0.556512   0.077393  -7.191 2.84e-12 ***
b          0.010373   0.001665   6.230 1.10e-09 ***
lstat     -0.162327   0.036601  -4.435 1.17e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.667 on 434 degrees of freedom
Multiple R-Squared: 0.8442, Adjusted R-squared: 0.8396 
F-statistic: 180.9 on 13 and 434 DF,  p-value: < 2.2e-16

The key thing that we can see from the result above is that the Adjusted R-Squared value is grown up to 0.82, now 83% of the variation of the response variable is expained ny predictors.