The Quantile regression gives a more comprehensive picture of the effect of the independent variables on the dependent variable. Instead of estimating the model with average effects using the OLS linear model, the quantile regression produces different effects along the distribution (quantiles) of the dependent variable. The dependent variable is continuous with no zeros or too many repeated values. Examples include estimating the effects of household income on food expenditures for low- and high-expenditure households, what are the factors influencing total medical expenditures for people with low, medium and high expenditures. The following oexample is based on Medical Expenditure Panel Survey MEPS. The dependent variable is the total medical expenditures, and the independent variables are: supplemental insurance, total number of chronic conditions, age, female, and white. We estimate an OLS regression, and quantile regression at 25th, 50th, and 75th quantile. The standard Ordinary Least Squares OLS models the relationship between one or more independent variables and the conditional mean of a dependent variable. The Quantile Regression models the relationship betwwn the conditional quantiles rather than just the conditional mean of the dependent variable. A quantile regression gives a more comprehensive picture of the effect of the independent variables on the dependent variable because we can show different effects (quantiles). One pratical consideration is that the distribution of the dependent variable has to be continuous and it shouldn’t has zero or too many repeated values. One important aspect to take in considertion in Quantile Regression is that coefficients can be significanlty different than the OLS coefficients, showing different effects along the distribution of the dependent variable. The advantages of the Quantile regression are: Flexibility for modeling data with heterogeneous conditional distributions. Median regression is more robust to outliers than the OLS regression. Quantile regression can show different effects of the independent variables on the dependent variable depending across the spectrum of the dependent variable.
library(quantreg)
mydata <- read.csv("C:/07 - R Website/dataset/ML/quantile_health.csv")
attach(mydata)
summary(mydata)
dupersid totexp ltotexp suppins
Min. :20004018 Min. : 3 Min. : 1.099 Min. :0.0000
1st Qu.:24476022 1st Qu.: 1433 1st Qu.: 7.268 1st Qu.:0.0000
Median :90123058 Median : 3334 Median : 8.112 Median :1.0000
Mean :62616065 Mean : 7290 Mean : 8.060 Mean :0.5915
3rd Qu.:94161512 3rd Qu.: 7492 3rd Qu.: 8.922 3rd Qu.:1.0000
Max. :98347025 Max. :125610 Max. :11.741 Max. :1.0000
totchr age female white
Min. :0.000 Min. :65.00 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:69.00 1st Qu.:0.0000 1st Qu.:1.0000
Median :2.000 Median :74.00 Median :1.0000 Median :1.0000
Mean :1.809 Mean :74.25 Mean :0.5841 Mean :0.9736
3rd Qu.:3.000 3rd Qu.:79.00 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :7.000 Max. :90.00 Max. :1.0000 Max. :1.0000
# Define variables
Y <- cbind(totexp)
X <- cbind(suppins, totchr, age, female, white)
The variable totexp is the total expenditure and is dependent variable. The independent variables are suppins supplemental insurance, totchr total number of chronic conditions, age, female, and white. The first step is to perform an OLS regression.
# OLS regression
olsreg <- lm(Y ~ X, data=mydata)
summary(olsreg)
Call:
lm(formula = Y ~ X, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-16146 -5372 -2804 457 115461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 461.492 2777.453 0.166 0.86805
Xsuppins 585.984 436.309 1.343 0.17936
Xtotchr 2528.079 164.834 15.337 < 2e-16 ***
Xage 6.711 33.768 0.199 0.84248
Xfemale -1239.866 433.110 -2.863 0.00423 **
Xwhite 2193.155 1327.794 1.652 0.09870 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11520 on 2949 degrees of freedom
Multiple R-squared: 0.07828, Adjusted R-squared: 0.07672
F-statistic: 50.09 on 5 and 2949 DF, p-value: < 2.2e-16
The Xtotchr, the total of chronic condition, says that each of chronich condition brings 2528.079 more dollars in totexp total medical expenditure. Now, we perform a quantile regression.
# Simultaneous quantile regression
quantreg2575 <- rq(Y ~ X, data=mydata, tau=c(0.25, 0.75))
summary(quantreg2575)
Call: rq(formula = Y ~ X, tau = c(0.25, 0.75), data = mydata)
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1412.88889 433.20179 -3.26150 0.00112
Xsuppins 453.44444 75.05348 6.04162 0.00000
Xtotchr 782.47222 37.55769 20.83388 0.00000
Xage 16.08333 6.19162 2.59760 0.00943
Xfemale 16.05556 72.20278 0.22237 0.82404
Xwhite 338.08333 71.51522 4.72743 0.00000
Call: rq(formula = Y ~ X, tau = c(0.25, 0.75), data = mydata)
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4512.04545 2350.56284 -1.91956 0.05501
Xsuppins 708.40909 375.76929 1.88522 0.05950
Xtotchr 2855.31818 196.12587 14.55860 0.00000
Xage 87.36364 30.98410 2.81963 0.00484
Xfemale -554.59091 378.71501 -1.46440 0.14319
Xwhite 801.68182 370.96108 2.16109 0.03077
The Xtotchr, the total of chronic condition, for the 0.25 quantile is 782.47, and the interpretatation is: adding 25th quantile each of chronich condition brings only 782.42 more dollars in totexp total medical expenditure. This is a much ower value that we had before with OLS. This means, for low number of chronic conditions the medical expenditure is lower. On the oder hand, looking at the 0.75 quantile for the total of chronic condition we have 2855.31 more dollar per each more chronic condition. This value is more similar with the OLS coefficient, and in fact this time we have not a significant difference from the OLS coefficient.
We can also perform an ANOVA to compare the coefficient at 25th quantile vs. 75th quantile.
# Quantile regression at 25 the quanile
quantreg25 <- rq(Y ~ X, data=mydata, tau=0.25)
# Quantile regression at 75 the quanile
quantreg75 <- rq(Y ~ X, data=mydata, tau=0.75)
# ANOVA test for coefficient differences
anova(quantreg25, quantreg75)
Quantile Regression Analysis of Deviance Table
Model: Y ~ X
Joint Test of Equality of Slopes: tau in { 0.25 0.75 }
Df Resid Df F value Pr(>F)
1 5 5905 37.831 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from the result above, there is a significant difference in the coefficients, and this justify to use the quantile regression. Now, we can plot the data and the coefficinets we found from the quantile regression.
# Plotting data
quantreg.all <- rq(Y ~ X, tau = seq(0.05, 0.95, by = 0.05), data=mydata)
quantreg.plot <- summary(quantreg.all)
plot(quantreg.plot)
We focus our attention on the Xtotchr (total of chronic condition) graph. The red orizontal line is the OLS coefficient, and we can see that the value is exactly the same of what we found before (2528.079). Notice that the OLS line is flat along the quantile in the x-axis, because it cannot vary across the quantiles. Looking at the quantile trend (black curve with grey confidence intervals), we can see that for low quantiles there is a significant difference below OLS. On the contrary, there is a significant difference above OLS for high quantile. Again, we can see from the graph of Xtotchr that there is not a significant difference for the 75th quantile.
Looking at the Xage graph, there is not a significat difference from OLS across the quantiles, except at the last quantile.