Linear regression is a workhorse model of a Marketing Analyst’s toolkit. This is because it gives them the ability to describe data patterns, predict the value of marketing metrics in data and potentially make causal claims about the relationships between multiple variables.
In this tutorial you will apply linear regression to get first hand
experience with these tools. We will focus both on how to linear
regression in R
and how to correctly interpret the results.
You will use linear regression to evaluate the association between
product characteristics and product price in an internet mediated
market.
By the end of this tutorial you will be able to:
These lab assignments are not graded, but we
encourage you to invest time and effort into working through them from
start to finish. Add your solutions to the
lab-regression-answers.Rmd
file as you work through the
exercises so that you have a record of the work you have done.
Obtain a copy of both the question and answer files using Git. To clone a copy of this repository to your own PC, use the following command:
git clone https://github.com/tisem-digital-marketing/smwa-lab-regression.git
Once you have your copy, open the answer document in RStudio as an RStudio project and work through the questions.
The goal of the tutorials is to explore how to “do” the technical side of social media analytics. Use this as an opportunity to push your limits and develop new skills. When you are uncertain or do not know what to do next - ask questions of your peers and the instructors on the class Slack workspace.
The advent of the internet, and the rise in user generated content has had a large effect on sex markets. In 2008 and 2009, Scott Cunningham and Todd Kendall surveyed approximately 700 US internet mediated sex workers. The questions they asked included information about their illicit and legal labor market experiences and their demographics. Part of the survey asked respondents to share information about each of the previous four sessions with clients.
To gain access to the data, run the following code to download it and
save it in the file data/sasp_panel.dta
:
url <- "https://github.com/scunning1975/mixtape/raw/master/sasp_panel.dta"
# where to save data
out_file <- "data/sasp_panel.dta"
# download it!
download.file(url,
destfile = out_file,
mode = "wb"
)
The data include the log hourly price, the log of the session length (in hours), characteristics of the client (such as whether he was a regular), whether a condom was used, and some characteristics of the provider (such as their race, marital status and education level). The goal of this exercise is to estimate the price premium of unsafe sex and think through any bias in the coefficients within the regression models we estimate.
You might need to use the following R
libraries
throughout this exercise:1
library(haven) # to read stata datasets
library(dplyr)
library(tidyr)
library(broom)
library(ggplot2)
library(modelsummary)
read_dta()
function from the
haven
package.sasp <- read_dta('data/sasp_panel.dta')
sasp <-
sasp %>%
drop_na()
As mentioned above, the focus for the rest of this exercise is the
price premium for unprotected sex. In the sasp
data, there
is a variable lnw
which is the log of the hourly wage and a
variable unsafe
which takes the value 1 if there was unsafe
sex during the client’s appointment and 0 otherwise.
lnw
, for sessions featuring either unsafe and safe sex.
Your plot should therefore have two histograms, potentially overlaying
each other. Does there appear to be a difference in price between safe
and unsafe sex?sasp %>%
ggplot(aes(x = lnw, fill = factor(unsafe))) +
# i plot proportions, you don't need to
geom_histogram(aes(y = stat(count / sum(count))), alpha=0.6) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
ylab("Fraction of Bookings") +
xlab("Log Hourly Wage") +
theme_bw()
## Warning: `stat(count / sum(count))` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count / sum(count))` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lnw
on the variable
unsafe
. Report the results.simple_reg <- lm(lnw ~ unsafe, data = sasp)
tidy(simple_reg, conf.int = TRUE)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.63 0.0199 283. 0 5.60 5.67
## 2 unsafe -0.0351 0.0273 -1.29 0.198 -0.0886 0.0184
unsafe
. Is it
statistically significant?Interpretation (1): On average, unsafe sex is associated with a decrease in log hourly wage by 0.035
Interpretation (2): On average, unsafe sex is associated with a decrease in the the hourly wage by approximately 3.5 percent.
Interpretation 2 utilizes the log-level interpretation of the regression. Technically, the size of the effect is \((\exp{\hat{\beta}_1} - 1)*100\) percent, for small values of \(\beta_1\), \(\exp{\hat{\beta}_1} - 1 \approx \hat{\beta}\).
Statistical Significance: the p-value is 0.197 \(>\) 0.05, so the effect is not statistically significant at the 5 percent level of significance.Omitted Variable Bias: the effect of leaving out one or more relevant variables on the regression coefficients in the “misspecified” regression.
For omitted variable bias to occur we need:
\[ E[\hat{\beta}] = \beta + \text{bias} \]
which means that our estimated coefficient cannot accurately estimate the true population parameter, and thus can’t be interpreted causally.llength
, as a
second variable to your regression. Report the results. Did the
coefficient on unsafe
change?twovar_reg <- lm(lnw ~ unsafe + llength, data = sasp)
tidy(twovar_reg, conf.int = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.74 0.0730 92.3 0 6.59 6.88
## 2 unsafe 0.00296 0.0254 0.117 9.07e- 1 -0.0469 0.0528
## 3 llength -0.265 0.0169 -15.6 5.13e-51 -0.298 -0.231
llength
in your regression led to
the coefficient on unsafe
to be different in sign in the
single variable regression than in the two variable regression.The formula for Omitted Variable Bias (assuming omitted variable, \(x_2\) has coefficient \(\beta_2\))
\[ E[\hat{\beta_1}] = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_2)} \]
One would reason that:
\(\implies\) bias is negative, so that
\[ \begin{aligned} E[\hat{\beta_1}] &= \beta_1 + \text{something negative} \\ &< \beta_1 \end{aligned} \]
Remark: In case it was not clear from above, then:
reg
in the data). Report your results and
comment on any change in the regression estimate of
unsafe
.threevar_reg <- lm(lnw ~ unsafe + reg + llength, data = sasp)
tidy(threevar_reg, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.75 0.0731 92.3 0 6.61 6.89
## 2 unsafe 0.00418 0.0254 0.165 8.69e- 1 -0.0456 0.0539
## 3 reg -0.0608 0.0256 -2.38 1.75e- 2 -0.111 -0.0107
## 4 llength -0.260 0.0170 -15.3 4.62e-49 -0.293 -0.227
# I'll leave the comments out...
Marketers are generally interested in whether effects they find are heterogeneous, i.e. whether the reported coefficients vary across different observable characteristics.
het_reg <- lm(lnw ~ unsafe + unsafe:reg + reg + llength, data = sasp)
tidy(het_reg, conf.int = TRUE)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.74 0.0747 90.2 0 6.59 6.88
## 2 unsafe 0.0320 0.0382 0.837 4.03e- 1 -0.0430 0.107
## 3 reg -0.0346 0.0372 -0.930 3.53e- 1 -0.107 0.0383
## 4 llength -0.260 0.0170 -15.3 5.05e-49 -0.293 -0.227
## 5 unsafe:reg -0.0495 0.0509 -0.972 3.31e- 1 -0.149 0.0504
# I'll leave the comments out...
First, notice that neither of the terms unsafe
or
unsafe:reg
are statistically significant - so we don’t find
overwhelming evidence for differences.
If we want to take the point estimates seriously (ignoring the std errors - purely for the sake of interpretation practice), we’d see that there seems to be evidence of price discrimination. Providers do not charge a price premium for unsafe sex with clients who are are regulars. A potential reason could be that regulars are perceived to be less risky in terms of carrying STIs. I wouldn’t want to push this argument too hard.
Descriptive at best.
Why not causal: * Omitted variable bias, lack of randomization
There’s actually a very important omitted variable in the regressions we have run, which is a “fixed effect” for the provider. This boils down to dummy variables for each provider. We’ll discuss these in detail in the coming weeks.
Why does this matter: * think about omitted variable bias * providers have different willingness to engage in unsafe sex * These providers also likely price differently * This leads to a negative bias in our estimate (the formula for OVB with FE is messier, so omitted)fe_reg <- lm(lnw ~ unsafe*reg + llength + as.factor(id), data = sasp)
tidy(fe_reg, conf.int = TRUE)
## # A tibble: 464 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.56 0.159 47.6 5.73e-263 7.25 7.88
## 2 unsafe 0.0218 0.0345 0.633 5.27e- 1 -0.0458 0.0894
## 3 reg -0.0680 0.0258 -2.63 8.57e- 3 -0.119 -0.0173
## 4 llength -0.428 0.0138 -31.0 1.39e-149 -0.455 -0.401
## 5 as.factor(id)6 -0.532 0.191 -2.78 5.50e- 3 -0.906 -0.157
## 6 as.factor(id)7 0.0248 0.227 0.109 9.13e- 1 -0.421 0.471
## 7 as.factor(id)8 -0.294 0.190 -1.55 1.21e- 1 -0.667 0.0782
## 8 as.factor(id)9 0.709 0.289 2.46 1.41e- 2 0.143 1.28
## 9 as.factor(id)10 0.354 0.190 1.86 6.26e- 2 -0.0187 0.727
## 10 as.factor(id)11 -0.471 0.190 -2.48 1.33e- 2 -0.843 -0.0985
## # ℹ 454 more rows
Now that you have run a series of regressions, you want to present the results in a way that you could use in a report or a presentation.
Hint: You will want to use functions from the
modelsummary
package. We recommend reading the vignette to get started with using
the functions.
# a simple table - minimum customization
mods <- list(
simple_reg,
twovar_reg,
threevar_reg,
het_reg
)
msummary(mods,
coef_omit = "Interc",
gof_omit = "AIC|BIC|Log|Pseudo|F"
)
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
unsafe | −0.035 | 0.003 | 0.004 | 0.032 |
(0.027) | (0.025) | (0.025) | (0.038) | |
llength | −0.265 | −0.260 | −0.260 | |
(0.017) | (0.017) | (0.017) | ||
reg | −0.061 | −0.035 | ||
(0.026) | (0.037) | |||
unsafe × reg | −0.049 | |||
(0.051) | ||||
Num.Obs. | 1499 | 1499 | 1499 | 1499 |
R2 | 0.001 | 0.141 | 0.144 | 0.145 |
R2 Adj. | 0.000 | 0.140 | 0.143 | 0.143 |
RMSE | 0.53 | 0.49 | 0.49 | 0.49 |
modelplot(mods,
coef_omit = "Interc|^reg|ll") +
geom_vline(xintercept = 0,
alpha = 0.5,
linetype = "dashed") +
xlab("Coefficient Estimate + 95% CI") +
coord_flip() +
theme_bw()
If you haven’t installed one or more of these packages,
do so by entering install.packages("PKG_NAME")
into the R
console and pressing ENTER.↩︎
Generally, we need to be quite careful when we make decisions about dropping rows of data, and think through what the consequences of it might be. We’ve not done this here because our goal was to illustrate how to estimate and interpret regression estimates, but we would encourage you to be careful when you do this in your own work. At a minimum, you should mention why you’ve dropped rows, and whether there is likely to be selection bias in your subsequent results.↩︎