By the end of this tutorial you will be able to:
These lab assignments are not graded, but we
encourage you to invest time and effort into working through them from
start to finish. Add your solutions to the
lab-identification-answers.Rmd
file as you work through the
exercises so that you have a record of the work you have done.
Obtain a copy of both the question and answer files using Git. To clone a copy of this repository to your own PC, use the following command:
git clone https://github.com/tisem-digital-marketing/smwa-lab-identification-experiments.git
Once you have your copy, open the answer document in RStudio as an RStudio project and work through the questions.
The goal of the tutorials is to explore how to “do” the technical side of social media analytics. Use this as an opportunity to push your limits and develop new skills. When you are uncertain or do not know what to do next - ask questions of your peers and the instructors on the class Slack workspace.
You will need to load the following R
libraries to
complete the exercises:
library(readr)
library(dplyr)
library(broom)
library(ggplot2)
library(ggthemes)
library(rsample)
library(tidyr)
library(tibble)
library(purrr)
library(car)
library(janitor)
You may need to install some of these if they are not already on your machine.
In this exercise we will work with the same data as in the “The Design of Empirical Research” slides. Recall that we had a data generating process that was defined by the following rules:
The dataset hair.csv
in the data/
directory
contains this data set.
df <-
read_csv("data/hair.csv") %>%
# for snakecase names
clean_names() %>%
# for ease later
mutate(hair = as.factor(hair))
## Rows: 5000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Hair
## dbl (1): logIncome
## lgl (1): College
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# (a)
df %>%
group_by(hair) %>%
summarise(income = round(mean(log_income) , 2))
## # A tibble: 2 × 2
## hair income
## <fct> <dbl>
## 1 Brown 5.11
## 2 Other Color 5.09
# (b)
df %>%
filter(college) %>%
group_by(hair) %>%
summarise(income = round(mean(log_income) , 2))
## # A tibble: 2 × 2
## hair income
## <fct> <dbl>
## 1 Brown 5.34
## 2 Other Color 5.21
# (c) see lecture notes
Now let’s consider the following linear regression equation:
\[ \text{log_income}_i = \beta_0 + \beta_1 1[\text{Hair Color = Brown}]_i + \varepsilon_i \] where \(1[\text{Hair Color = Brown}]_i\) is a variable that takes the value 1 if an individual has brown hair, and the value 0 otherwise. \(\varepsilon_i\) is the regression error term which has \(E(\varepsilon_i) =0\).
HINT: You may need to modify the hair
variable to make
it a factor variable before running the regression. You may also need to
use the function relevel(variable_name, ref = some_number)
in your code so that the regression returns the coefficients you
need.
mod_1 <- lm(log_income ~ relevel(hair, ref = 2), data = df)
tidy(mod_1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.09 0.0184 277. 0
## 2 relevel(hair, ref = 2)Brown 0.0160 0.0284 0.562 0.574
# note to TA: relate this back algaebraically to conditional means from the regression equation
mod_2 <- lm(log_income ~ relevel(hair, ref = 2), data = df %>% filter(college == TRUE))
tidy(mod_2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.21 0.0274 190. 0
## 2 relevel(hair, ref = 2)Brown 0.133 0.0655 2.02 0.0432
# note to TA: relate this back algaebraically to conditional means from the regression equation
mod_3 <- lm(log_income ~ relevel(hair, ref = 2) + college, data = df)
tidy(mod_3)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.00 0.0229 219. 0
## 2 relevel(hair, ref = 2)Brown 0.0800 0.0299 2.68 7.43e- 3
## 3 collegeTRUE 0.213 0.0323 6.60 4.60e-11
HINTS:
linearHypothesis()
functionmatchCoefs()
functionlinearHypothesis(mod_1, matchCoefs(mod_1, "hair"), rhs = 0.1)
## Linear hypothesis test
##
## Hypothesis:
## relevel(hair, ref = 2)Brown = 0.1
##
## Model 1: restricted model
## Model 2: log_income ~ relevel(hair, ref = 2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4999 4916.7
## 2 4998 4908.0 1 8.6102 8.768 0.00308 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linearHypothesis(mod_2, matchCoefs(mod_2, "hair"), rhs = 0.1)
## Linear hypothesis test
##
## Hypothesis:
## relevel(hair, ref = 2)Brown = 0.1
##
## Model 1: restricted model
## Model 2: log_income ~ relevel(hair, ref = 2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1490 1375.2
## 2 1489 1374.9 1 0.22787 0.2468 0.6194
linearHypothesis(mod_3, matchCoefs(mod_3, "hair"), rhs = 0.1)
## Linear hypothesis test
##
## Hypothesis:
## relevel(hair, ref = 2)Brown = 0.1
##
## Model 1: restricted model
## Model 2: log_income ~ relevel(hair, ref = 2) + college
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4998 4866.1
## 2 4997 4865.7 1 0.4347 0.4464 0.5041
“Adding the college indicator as a control variable allows you to use the right variation in the regression model to identify the effect of brown hair on income.”
Explain what they mean by this statement.
What does controlling for college do?
sim_data()
that will simulate one
dataset that obeys the DGP. We’ve sketched out a template for you to
start from.# If you want to run this code when you "knit" the script, replace eval=FALSE with eval=TRUE.
# (a)
sim_data = function(){
df <-
YOUR_CODE_HERE %>%
mutate(hair = as.factor(hair))
return(df)
}
# (b)
set.seed(YOUR_SEED)
all_samples <- tibble::enframe(replicate(n = 1000,
sim_data(),
simplify = FALSE)
)
# (c)
model_output <-
all_samples %>%
mutate(mod_1 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2), data = .x),
conf.int = TRUE) %>%
filter(stringr::str_detect(term, 'hair'))
),
mod_2 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2), data = .x %>% filter(college == TRUE)),
conf.int = TRUE) %>%
filter(stringr::str_detect(term, 'hair'))
),
mod_3 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2) + college, data = .x),
conf.int = TRUE)%>%
filter(stringr::str_detect(term, 'hair'))
)
) %>%
unnest(c(mod_1, mod_2, mod_3),
names_sep = "_")
# (d)
YOUR_CODE %>%
ggplot() +
# model 1 estimates
stat_density(aes(x=YOUR_CODE_HERE), geom = 'line', color = "blue") +
# model 2 estimates
stat_density(aes(x=YOUR_CODE_HERE), geom = 'line', color = "purple") +
# model 3 estimates
stat_density(aes(x=YOUR_CODE_HERE), geom = 'line', color = "orange") +
xlab("YOUR_X_LABEL") +
theme_bw()
# Simulation
# function to simulate data
sim_data = function(){
df <-
tibble(college = runif(5000) < .3) %>%
mutate(hair = case_when(
runif(5000) < .2+.8*.4*(!college) ~ "Brown",
TRUE ~ "Other Color"
),
log_income = .1*(hair == "Brown") +
.2*college + rnorm(5000) + 5
) %>%
mutate(hair = as.factor(hair))
return(df)
}
# Simulate it!
set.seed(42)
all_samples <- tibble::enframe(replicate(n = 1000,
sim_data(),
simplify = FALSE)
)
model_output <-
all_samples %>%
mutate(mod_1 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2), data = .x),
conf.int = TRUE) %>%
filter(stringr::str_detect(term, 'hair'))
),
mod_2 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2), data = .x %>% filter(college == TRUE)),
conf.int = TRUE) %>%
filter(stringr::str_detect(term, 'hair'))
),
mod_3 = purrr::map(value,
~tidy(lm(log_income ~ relevel(hair, ref = 2) + college, data = .x),
conf.int = TRUE)%>%
filter(stringr::str_detect(term, 'hair'))
)
) %>%
tidyr::unnest(c(mod_1, mod_2, mod_3),
names_sep = "_")
model_output %>%
ggplot() +
stat_density(aes(x=mod_1_estimate), geom = 'line', color = "blue") +
stat_density(aes(x=mod_2_estimate), geom = 'line', color = "purple") +
stat_density(aes(x=mod_3_estimate), geom = 'line', color = "orange") +
xlab("Coefficient Estimate") +
theme_bw()
You are approached by Bol.com to help them work on a business problem they are facing. They provide the following brief:
Bol.com’s Seller Experience team is looking to expand the number of third party vendors who sell on the site. One problem they are facing is that new vendors have less, and noisier reviews by consumers about their reputation. As a result, third party sellers get less orders and many leave the platform within a few months. The Seller Experience team wants to design an experiment that introduces something new to the platform that helps third party sellers more readily gain reputation when they first join the platform.
Write your answer here
Write your answer here
Write your answer here
Write your answer here
Write your answer here
Write your answer here