Motivation

Linear regression is a workhorse model of a Marketing Analyst’s toolkit. This is because it gives them the ability to describe data patterns, predict the value of marketing metrics in data and potentially make causal claims about the relationships between multiple variables.

In this tutorial you will apply linear regression to get first hand experience with these tools. We will focus both on how to linear regression in R and how to correctly interpret the results. You will use linear regression to evaluate the association between product characteristics and product price in an internet mediated market.

Learning Goals

By the end of this tutorial you will be able to:

  1. Estimate Single and Multiple Regression models with R.
  2. Interpret regression coefficients.
  3. Discuss likely biases in regression coefficients due to omitted variable bias.
  4. Present regression coefficients in a table and in a plot.

Instructions to Students

These lab assignments are not graded, but we encourage you to invest time and effort into working through them from start to finish. Add your solutions to the lab-regression-answers.Rmd file as you work through the exercises so that you have a record of the work you have done.

Obtain a copy of both the question and answer files using Git. To clone a copy of this repository to your own PC, use the following command:

git clone https://github.com/tisem-digital-marketing/smwa-lab-regression.git

Once you have your copy, open the answer document in RStudio as an RStudio project and work through the questions.

The goal of the tutorials is to explore how to “do” the technical side of social media analytics. Use this as an opportunity to push your limits and develop new skills. When you are uncertain or do not know what to do next - ask questions of your peers and the instructors on the class Slack workspace.

Regression Analysis in the Wild

The advent of the internet, and the rise in user generated content has had a large effect on sex markets. In 2008 and 2009, Scott Cunningham and Todd Kendall surveyed approximately 700 US internet mediated sex workers. The questions they asked included information about their illicit and legal labor market experiences and their demographics. Part of the survey asked respondents to share information about each of the previous four sessions with clients.

To gain access to the data, run the following code to download it and save it in the file data/sasp_panel.dta:

url <- "https://github.com/scunning1975/mixtape/raw/master/sasp_panel.dta"
# where to save data
out_file <- "data/sasp_panel.dta"
# download it!
download.file(url, 
              destfile = out_file, 
              mode = "wb"
              )

The data include the log hourly price, the log of the session length (in hours), characteristics of the client (such as whether he was a regular), whether a condom was used, and some characteristics of the provider (such as their race, marital status and education level). The goal of this exercise is to estimate the price premium of unsafe sex and think through any bias in the coefficients within the regression models we estimate.

You might need to use the following R libraries throughout this exercise:1

library(haven) # to read stata datasets
library(dplyr)
library(tidyr)
library(broom)
library(ggplot2)
library(modelsummary)
  1. Load the data. The data is stored as a Stata dataset, so it can be loaded with the read_dta() function from the haven package.
sasp <- read_dta('data/sasp_panel.dta')
  1. Some rows of the data have missing values. Let’s drop these.2 Write a short command to drop any rows which have missing values from the data.
sasp <- 
    sasp %>%
    drop_na()

As mentioned above, the focus for the rest of this exercise is the price premium for unprotected sex. In the sasp data, there is a variable lnw which is the log of the hourly wage and a variable unsafe which takes the value 1 if there was unsafe sex during the client’s appointment and 0 otherwise.

  1. Produce a diagram that plots a histogram of log hourly wage, lnw, for sessions featuring either unsafe and safe sex. Your plot should therefore have two histograms, potentially overlaying each other. Does there appear to be a difference in price between safe and unsafe sex?
sasp %>%
    ggplot(aes(x = lnw, fill = factor(unsafe))) +
    # i plot proportions, you don't need to
    geom_histogram(aes(y = stat(count / sum(count))), alpha=0.6) +
    scale_fill_manual(values=c("#69b3a2", "#404080")) +
    ylab("Fraction of Bookings") +
    xlab("Log Hourly Wage") +
    theme_bw()
## Warning: `stat(count / sum(count))` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count / sum(count))` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  1. Let’s formalize this idea with a regression. Run a single variable regression of log hourly wage, lnw on the variable unsafe. Report the results.
simple_reg <- lm(lnw ~ unsafe, data = sasp)

tidy(simple_reg, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)   5.63      0.0199    283.     0       5.60      5.67  
## 2 unsafe       -0.0351    0.0273     -1.29   0.198  -0.0886    0.0184
  1. Interpret the coefficient on unsafe. Is it statistically significant?

Interpretation (1): On average, unsafe sex is associated with a decrease in log hourly wage by 0.035

Interpretation (2): On average, unsafe sex is associated with a decrease in the the hourly wage by approximately 3.5 percent.

Interpretation 2 utilizes the log-level interpretation of the regression. Technically, the size of the effect is \((\exp{\hat{\beta}_1} - 1)*100\) percent, for small values of \(\beta_1\), \(\exp{\hat{\beta}_1} - 1 \approx \hat{\beta}\).

Statistical Significance: the p-value is 0.197 \(>\) 0.05, so the effect is not statistically significant at the 5 percent level of significance.
  1. A single variable regression most likely suffers from omitted variable bias. Explain what omitted variable bias is, and why it might impact your regression estimates.

Omitted Variable Bias: the effect of leaving out one or more relevant variables on the regression coefficients in the “misspecified” regression.

For omitted variable bias to occur we need:

  1. The included \(X\) variable(s) to be correlated with the omitted variable
  2. The omitted variable to be a relevant determinant of \(y\)
  1. and (2) leave to a violation of the exogeneity assumption \(E(u_i | x_i) = 0\). When we don’t have exogeneity,

\[ E[\hat{\beta}] = \beta + \text{bias} \]

which means that our estimated coefficient cannot accurately estimate the true population parameter, and thus can’t be interpreted causally.
  1. Add the log of the length of the session, llength, as a second variable to your regression. Report the results. Did the coefficient on unsafe change?
twovar_reg <- lm(lnw ~ unsafe + llength, data = sasp)

tidy(twovar_reg, conf.int = TRUE)
## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  6.74       0.0730    92.3   0          6.59      6.88  
## 2 unsafe       0.00296    0.0254     0.117 9.07e- 1  -0.0469    0.0528
## 3 llength     -0.265      0.0169   -15.6   5.13e-51  -0.298    -0.231
  1. Explain why ignoring llength in your regression led to the coefficient on unsafe to be different in sign in the single variable regression than in the two variable regression.

The formula for Omitted Variable Bias (assuming omitted variable, \(x_2\) has coefficient \(\beta_2\))

\[ E[\hat{\beta_1}] = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_2)} \]

One would reason that:

  • \(\beta_2 < 0\) … longer sessions lead to quantity discounts
  • \(\text{Cov}(x_1, x_2) >0\) … longer sessions more likely to feature unsafe sex

\(\implies\) bias is negative, so that

\[ \begin{aligned} E[\hat{\beta_1}] &= \beta_1 + \text{something negative} \\ &< \beta_1 \end{aligned} \]

Remark: In case it was not clear from above, then:

  • \(\text{Cov}(x_1, x_2)\) is the covariance between \(x_1\) and \(x_2\)
  • \(\text{Var}(x_2)\) is the standard deviation of \(x_2\)
  1. Add a third variable to the regression, whether the client is a regular or not (reg in the data). Report your results and comment on any change in the regression estimate of unsafe.
threevar_reg <- lm(lnw ~ unsafe + reg + llength, data = sasp)

tidy(threevar_reg, conf.int = TRUE)
## # A tibble: 4 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  6.75       0.0731    92.3   0          6.61      6.89  
## 2 unsafe       0.00418    0.0254     0.165 8.69e- 1  -0.0456    0.0539
## 3 reg         -0.0608     0.0256    -2.38  1.75e- 2  -0.111    -0.0107
## 4 llength     -0.260      0.0170   -15.3   4.62e-49  -0.293    -0.227
# I'll leave the comments out...

Marketers are generally interested in whether effects they find are heterogeneous, i.e. whether the reported coefficients vary across different observable characteristics.

  1. Estimate a regression model that allows the price effect of unsafe sex to differ for customers who are regulars to those who aren’t. Do this by modifying your regression command from (9). Report your results and discuss your findings.
het_reg <- lm(lnw ~ unsafe + unsafe:reg + reg + llength, data = sasp)

tidy(het_reg, conf.int = TRUE)
## # A tibble: 5 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   6.74      0.0747    90.2   0          6.59      6.88  
## 2 unsafe        0.0320    0.0382     0.837 4.03e- 1  -0.0430    0.107 
## 3 reg          -0.0346    0.0372    -0.930 3.53e- 1  -0.107     0.0383
## 4 llength      -0.260     0.0170   -15.3   5.05e-49  -0.293    -0.227 
## 5 unsafe:reg   -0.0495    0.0509    -0.972 3.31e- 1  -0.149     0.0504
# I'll leave the comments out...
  1. Interpret the results you found in (10).

First, notice that neither of the terms unsafe or unsafe:reg are statistically significant - so we don’t find overwhelming evidence for differences.

If we want to take the point estimates seriously (ignoring the std errors - purely for the sake of interpretation practice), we’d see that there seems to be evidence of price discrimination. Providers do not charge a price premium for unsafe sex with clients who are are regulars. A potential reason could be that regulars are perceived to be less risky in terms of carrying STIs. I wouldn’t want to push this argument too hard.

  1. Are the effects you documented causal, descriptive or predictive? Explain your answer.

Descriptive at best.

Why not causal: * Omitted variable bias, lack of randomization

There’s actually a very important omitted variable in the regressions we have run, which is a “fixed effect” for the provider. This boils down to dummy variables for each provider. We’ll discuss these in detail in the coming weeks.

Why does this matter: * think about omitted variable bias * providers have different willingness to engage in unsafe sex * These providers also likely price differently * This leads to a negative bias in our estimate (the formula for OVB with FE is messier, so omitted)
fe_reg <- lm(lnw ~ unsafe*reg + llength + as.factor(id), data = sasp)

tidy(fe_reg, conf.int = TRUE)
## # A tibble: 464 × 7
##    term            estimate std.error statistic   p.value conf.low conf.high
##    <chr>              <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)       7.56      0.159     47.6   5.73e-263   7.25      7.88  
##  2 unsafe            0.0218    0.0345     0.633 5.27e-  1  -0.0458    0.0894
##  3 reg              -0.0680    0.0258    -2.63  8.57e-  3  -0.119    -0.0173
##  4 llength          -0.428     0.0138   -31.0   1.39e-149  -0.455    -0.401 
##  5 as.factor(id)6   -0.532     0.191     -2.78  5.50e-  3  -0.906    -0.157 
##  6 as.factor(id)7    0.0248    0.227      0.109 9.13e-  1  -0.421     0.471 
##  7 as.factor(id)8   -0.294     0.190     -1.55  1.21e-  1  -0.667     0.0782
##  8 as.factor(id)9    0.709     0.289      2.46  1.41e-  2   0.143     1.28  
##  9 as.factor(id)10   0.354     0.190      1.86  6.26e-  2  -0.0187    0.727 
## 10 as.factor(id)11  -0.471     0.190     -2.48  1.33e-  2  -0.843    -0.0985
## # ℹ 454 more rows
If we want to take the point estimates seriously (ignoring the std errors - purely for the sake of interpretation practice), we’d see that there seems to be evidence of price discrimination –but in the other direction. Providers charge a higher price for unsafe sex with clients who are are regulars than those who aren’t. A potential reason could be that regulars are less likely to switch to a different provider, so they’re taken advantage of and charged a higher premium. I wouldn’t want to push this argument too hard.

Now that you have run a series of regressions, you want to present the results in a way that you could use in a report or a presentation.

Hint: You will want to use functions from the modelsummary package. We recommend reading the vignette to get started with using the functions.

  1. Take your regression estimates and produce a regression table to summarize your results in one place. You can choose any of the estimates you like to produce the table, but we encourage you to think about how each column adds something to a story you could tell to explain your findings. The final result should look similar to a regression table you see in academic publications.
# a simple table - minimum customization
mods <- list(
  simple_reg,
  twovar_reg,
  threevar_reg,
  het_reg
)

msummary(mods,
         coef_omit = "Interc",
         gof_omit = "AIC|BIC|Log|Pseudo|F"
         )
 (1)   (2)   (3)   (4)
unsafe −0.035 0.003 0.004 0.032
(0.027) (0.025) (0.025) (0.038)
llength −0.265 −0.260 −0.260
(0.017) (0.017) (0.017)
reg −0.061 −0.035
(0.026) (0.037)
unsafe × reg −0.049
(0.051)
Num.Obs. 1499 1499 1499 1499
R2 0.001 0.141 0.144 0.145
R2 Adj. 0.000 0.140 0.143 0.143
RMSE 0.53 0.49 0.49 0.49
  1. Take your regression estimates and produce a coefficient plot to summarize them in one place. You can choose any of the estimates you like to produce the plot, but we encourage you to think about the plot you produce can be used as part of a story you could tell to explain your findings.
modelplot(mods,
          coef_omit = "Interc|^reg|ll") +
  geom_vline(xintercept = 0, 
             alpha = 0.5, 
             linetype = "dashed") +
  xlab("Coefficient Estimate + 95% CI") +
  coord_flip() +
  theme_bw()

  1. If you haven’t installed one or more of these packages, do so by entering install.packages("PKG_NAME") into the R console and pressing ENTER.↩︎

  2. Generally, we need to be quite careful when we make decisions about dropping rows of data, and think through what the consequences of it might be. We’ve not done this here because our goal was to illustrate how to estimate and interpret regression estimates, but we would encourage you to be careful when you do this in your own work. At a minimum, you should mention why you’ve dropped rows, and whether there is likely to be selection bias in your subsequent results.↩︎