This exercise studies online reputation management through the use of
public comments by firms in response to online reviews. The content is
based on the article “Online reputation management: Estimating the
impact of management responses on consumer reviews” by Proserpio and
Zervas. The article is published in Marketing Science in 2017, and is
available in our course readings. You should read through the article
before answering these questions. The paper wants to investigate the
relationship between a hotel’s use of management responses and its
online reputation (measured by star rating, stars
) &
establish a causal relationship from the use of management responses to
online reputation. Your goal in this exercise is to explain key
arguments and replicate selected results from this paper. The data for
this exercise is located data/responses.dta
.1
For this exercise you might need to the following packages:
library(haven)
library(dplyr)
library(tidyr)
library(fixest)
library(purrr)
library(broom)
library(modelsummary)
Key points:
hotels_orm
.
For this exercise you will only need the rows where
xplatform_dd_obs = 1
. Keep only the columns
hotel_id, year, stars, after, ta_dummy, first_response, cum_avg_stars_lag, log_count_reviews_lag, t, ash_interval, traveler_type
.hotels_orm <-
# 0.5 pt
read_stata("data/responses.dta") %>%
# 1 pt
filter(xplatform_dd_obs == 1) %>%
# 0.5pt
select(hotel_id, year, stars, after,
first_response, cum_avg_stars_lag,
log_count_reviews_lag, ta_dummy,
t, ash_interval, traveler_type
)
Key points:
Key Points:
**Key Points:
first_response
= 0 or 1, and the columns
take the values of ta_dummy
= 0 or 1. The values in the
data frame should be the respective group means of
stars
.first_response
= 1 and
first_response
= 0 for each of ta_dummy
= 0
and ta_dummy
= 1.# answers (a) - 1 pt
pvt_tbl <- hotels_orm %>%
group_by(ta_dummy, first_response) %>%
summarize(stars = mean(stars)) %>%
pivot_wider(names_from = ta_dummy, values_from = stars)
## `summarise()` has grouped output by 'ta_dummy'. You can override using the
## `.groups` argument.
pvt_tbl %>%
# answers (b) - 1 pt
mutate_all(funs(. - lag(.))) %>%
# answers (c) - 1pt
mutate(did_simple = `1` - `0`) %>%
# prints the answer
na.omit() %>%
select(did_simple)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 1 × 1
## did_simple
## <dbl>
## 1 0.302
The authors use the following regression equation to estimate the difference in difference estimator of the effect of online responses on online reputation:
\(Stars_{ijt} = \beta_1 After_{ijt} + \beta_2 TripAdvisor_{ij} + \delta After_{ijt} \times TripAdvisor_{ij} + X_{ijt}\gamma + \alpha_j + \tau_t + \varepsilon_{ijt}\)
where \(Stars_{ijt}\) is the star-rating of review \(i\) for hotel \(j\) in calendar month \(t\), \(After_{ijt}\) is an indicator for reviews (on either platform) submitted after hotel \(j\) started responding, \(TripAdvisor_{ij}\) is an indicator for TripAdvisor ratings, \(X_{ijt}\) are control variables, \(\alpha_j\) are hotel fixed effects, \(\tau\) are calendar-month fixed effects and \(\varepsilon_{ijt}\) is the error term.
To relationship between the variables in the equation above and the variables in the dataset is:2
first_reponse
,ta_dummy
, andafter
Key points
[1 pt]
Easiest way to do this is to compute the four means using the regression
model (OK to ignore X and simplify time FE ) and do the DiD table and
show \(\delta\) is the difference.
Without loss of generality, set X, \(alpha\) and \(tau\) to zero for all hotels and all time periods.
2pts Then
Now if you subtract the latter from the former you get \(\delta\)
NOTE: Wordy versions of this would be OK if completely articulates the argument. If not, deduct points for each missed part of the argument.fixest
package. In
particular, produce three regression models:model_1
should be the regression equivalent of the
simple DiD in 6 using the whole dataset. This estimate is not in Table 4
of Proserpio and Zervas.model_2
should extend model_1
by adding
the fixed effects and uses the whole dataset.model_3
should be the same estimating equation as
model_2
but correct for the Ashenfelter dip.model_4
should augment model_2
by adding
the variables cum_avg_stars_lag
and
log_count_reviews_lag
to \(X_{ijt}\), and correct for the Ashenfelter
dip.For each model, standard errors should be clustered by
hotel_id
.
Use the following starter code for estimating each regression model:
model_x <- fixest(YOUR_CODE ~ YOUR_CODE +
# t:ta_dummy is the platform specific linear time
# trend they mention in the notes of table 4
# you need this to get their estimates in models 2 thru 4
t:ta_dummy
|
# put any additional fixed effects here (if you need them)
YOUR CODE,
data = YOUR_CODE,
cluster = ~ YOUR_CODE # what variable denotes the clusters
# for the standard errors
)
# 1 pt per correct model
ash_dip <- hotels_orm %>%
filter(ash_interval == 0)
model_1 <- feols(stars ~ after + first_response + ta_dummy,
data = hotels_orm,
cluster = ~hotel_id
)
model_2 <- feols(stars ~ after + first_response + ta_dummy + t:ta_dummy
|
t + hotel_id,
data = hotels_orm,
cluster = ~hotel_id
)
model_3 <- feols(stars ~ after + first_response + ta_dummy + t:ta_dummy
|
t + hotel_id,
data = ash_dip,
cluster = ~hotel_id
)
model_4 <- feols(stars ~ after + first_response + ta_dummy +
cum_avg_stars_lag + log_count_reviews_lag + t:ta_dummy
|
t + hotel_id,
data = ash_dip,
cluster = ~hotel_id
)
## NOTE: 3,368 observations removed because of NA values (RHS: 3,368).
# 0.5 pt per correct model
tidy(model_1, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.10 0.0207 198. 0 4.06 4.14
## 2 after 0.302 0.0258 11.7 1.26e-30 0.252 0.353
## 3 first_response -0.0346 0.0228 -1.52 1.29e- 1 -0.0793 0.0100
## 4 ta_dummy -0.322 0.0242 -13.3 1.57e-38 -0.369 -0.275
tidy(model_2, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 after 0.149 0.0207 7.21 8.37e-13 0.108 0.189
## 2 first_response -0.00532 0.0118 -0.450 6.53e- 1 -0.0285 0.0179
## 3 ta_dummy -1.01 0.0493 -20.4 4.77e-83 -1.10 -0.909
## 4 ta_dummy:t 0.00522 0.000434 12.0 3.73e-32 0.00437 0.00608
tidy(model_3, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 after 0.123 0.0224 5.49 4.59e- 8 0.0790 0.167
## 2 first_response -0.0116 0.0128 -0.913 3.61e- 1 -0.0367 0.0134
## 3 ta_dummy -1.03 0.0508 -20.2 8.01e-82 -1.13 -0.927
## 4 ta_dummy:t 0.00556 0.000454 12.2 3.73e-33 0.00467 0.00645
tidy(model_4, conf.int = TRUE)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 after 0.0970 0.0186 5.20 2.18e- 7 0.0604 0.134
## 2 first_response -0.00270 0.0111 -0.243 8.08e- 1 -0.0245 0.0191
## 3 ta_dummy -0.803 0.0439 -18.3 1.11e- 68 -0.889 -0.717
## 4 cum_avg_stars_lag 0.288 0.0109 26.5 1.04e-130 0.267 0.309
## 5 log_count_reviews_l… -0.00339 0.00491 -0.691 4.89e- 1 -0.0130 0.00623
## 6 ta_dummy:t 0.00473 0.000393 12.1 3.39e- 32 0.00396 0.00550
Key points:
Key points:
Authors have a whole section on this: “Why do management responses affect hotel ratings?”
[1pt per factor] Three factors might explain the increase in ratings after the management started to respond to reviews:
(i.) the increased cost of leaving a negative review,
(ii.) the increased benefit of leaving a positive response and the possibly
(iii.) higher rating of returning customers.
- (i) Table 15 Cols 2-4 show longer reviews posted for bad star ratings as need to defend views.
- (ii) Table 14 col 1 shows increase in reviews after MR starts suggesting this might be the case
- (iii) No obvious evidence, authors argue the data is not rich enough.</div></div></div></div>
Key points:
Note that the data is stored as a .dta
format (i.e. a Stata dataset). We use the package haven
with its read_stata()
command to load a Stata dataset.↩︎
This mapping is not immediately obvious, and one of the (small) perils of using a dataset that one hasn’t constructed themselves from scratch. We hope this clarifies which variables need to be included in the regression.↩︎