My answers

A/B Testing Case Study: “Free Trial Screening”

Udacity is an online learning platform that offers courses and nano-degree programs in various fields of technology, business, and data science. They aim to provide accessible and high-quality education to individuals seeking to advance their careers or learn new skills in the rapidly evolving fields of technology and business.

Udacity courses currently have two options on the home page: “start free trial”, and “access course materials”. If the student clicks “start free trial”, they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks “access course materials”, they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

Udacity have struggled with setting clear expectations for students who enrol in their programs, particularly with regard to the workload in terms of expected study time. As a result, they face the problem of frustrated students who leave the course before completing the free trial. These frustrated students who leave before the trial ends, ultimately do not stay enrolled long enough to be charged for the paid version of the course. Udacity wants to improve the overall student- and educator experience on the platform.

Udacity have designed an experiment, to test a change where if the student clicked “start free trial”, they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. The screenshot below shows what the experiment looks like for students allocated to the treatment group:

The unit of analysis cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

The data that you need to analyze is located in data/udacity.csv. It contains the raw information needed to compute any of the metrics we discuss in the questions below. The data is broken down by day.

The definition of each variable is:

pageviews: Number of unique cookies to view the course overview page that day.
clicks: Number of unique cookies to click “Start Free Trial.”
enrollments: Number of user-ids to enroll in the free trial that day.
payments: Number of user-ids who enrolled and eventually pay after 14 days of being enrolled. Note that the date for the payments column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.

Outcome Variable and Baseline Variable Choices

Which of the following metrics (some of which you may need to compute from existing data) would you choose to measure for this experiment and why? For each metric you choose, indicate whether you would use it as an baseline metric or an outcome metric. The minimum detectable effect for each metric is included in parentheses.
1. Number of cookies: the number of unique cookies to view the course overview page. (Minimum Detectable Effect = 3000)
2. Number of user-ids: the number of users who enroll in the free trial. (Minimum Detectable Effect = 50)
3. Number of clicks: That is, number of unique cookies to click the “Start free trial” button (which happens before the free trial screener is trigger). (Minimum Detectable Effect = 240)
4. Click-through-probability: That is, number of unique cookies to click the “Start free trial” button divided by number of unique cookies to view the course overview page. (Minimum Detectable Effect = 0.01)
5. Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start free trial” button. (Minimum Detectable Effect = 0.01)
6. Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (Minimum Detectable Effect = 0.01)
7. Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. (Minimum Detectable Effect = 0.0075)

solution

Outcomes variables that we want to track include:

Gross conversion
Net conversion
Retention

Baselines (dont change due to experiment):

clicks,
pageviews
click thru (clicks/ cookies)

Determining Sample Size, Duration and Exposure

The following estimates are baseline values for some key numbers from Udacity in the time period before the experiment begins. Note: that these numbers are fictitious.

# Number of unique cookies per day
n_cookies_per_day <- 40000
# Number of unique cookies that click "start free trial" per day
n_cookies_start_trial <- 3200
# Number of enrollments per day
n_enroll <- 660
# Click through probability for start free trial
n_ctr_free <- 0.08
# Probability of enrolling, given click
n_prob_enrol_click <- 0.20625
# Probability of payment, given enrol
n_prob_payment_enroll <- 0.53
# Probability of Payment, given click
n_prob_payment_click <- 0.109313

How many page views will you need to collect to have adequate statistical power in your experiment? Ensure there is enough power for each of your metrics of choice. Use the following values in your analysis, \(alpha = 0.05\), \(\beta = 0.2\)

solution

## Sample size for gross conversion
# start with clicks
n_clicks_test_size <-
    2 *
    round(
    power.prop.test(
    p1 = n_prob_enrol_click,
    p2 = n_prob_enrol_click + 0.01,
    power = 0.8,
    sig.level = 0.05
    )$n
)
# from clicks to pageviews
pageviews_for_sample_gross_conversion <- n_clicks_test_size / n_ctr_free

## Sample size for retention
n_retention_test_size <-
    2 *
    round(
        power.prop.test(
            p1 = n_prob_payment_enroll,
            p2 = n_prob_payment_enroll + 0.01,
            power = 0.8,
            sig.level = 0.05
        )$n
    )
# enroll per pageview
pageviews_for_sample_retention <- round(n_retention_test_size / (n_enroll / n_cookies_per_day))

## Sample size for net conversion
n_conversion_test_size <-
    2 *
    round(
        power.prop.test(
            p1 = n_prob_payment_click,
            p2 = n_prob_payment_click + 0.0075,
            power = 0.8,
            sig.level = 0.05
        )$n
    )
## clicks to pageviews
n_conversion_test_size / n_ctr_free

## [1] 699600

What percentage of Udacity’s traffic would you divert to this experiment? Explain.
Given the percentage you chose, how long would the experiment take to run? If the answer is longer than four weeks, then this is unreasonably long, and you should go back and update your earlier decisions.

solution

## what ratio to divert
# if we rely on conversion and divert 100% traffic:
ceiling(pageviews_for_sample_retention / n_cookies_per_day)

## [1] 119

# too many days -- have to drop this metric

# so rely on gross conversion
# 100 percent traffic, is that realistic no
ceiling(pageviews_for_sample_gross_conversion / n_cookies_per_day)

## [1] 17

# if we do 60% -- 28 days -- ok thats approx a month, seems reasonable
ceiling(pageviews_for_sample_gross_conversion / (0.6*n_cookies_per_day))

## [1] 28

Balance Checks

Load the data into R.

solution

udacity <- read_csv("data/udacity.csv")

## Rows: 74 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): date, treatment
## dbl (4): pageviews, clicks, enrollments, payments
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(udacity)

## Rows: 74
## Columns: 6
## $ date        <chr> "Sat, Oct 11", "Sun, Oct 12", "Mon, Oct 13", "Tue, Oct 14"…
## $ pageviews   <dbl> 7723, 9102, 10511, 9871, 10014, 9670, 9008, 7434, 8459, 10…
## $ clicks      <dbl> 687, 779, 909, 836, 837, 823, 748, 632, 691, 861, 867, 838…
## $ enrollments <dbl> 134, 147, 167, 156, 163, 138, 146, 110, 131, 165, 196, 162…
## $ payments    <dbl> 70, 70, 95, 105, 64, 82, 76, 70, 60, 97, 105, 92, 56, 122,…
## $ treatment   <chr> "control", "control", "control", "control", "control", "co…

Are there any missing values in your data? Make a decision to drop or keep them, and justify your decision.

solution

udacity <- 
    udacity %>%
    na.omit()

Verify the randomization into treatment and control group was successful. If you find evidence that the randomization failed look at the day by day data and see if you can offer any insight into what is causing the problem.

udacity %>%
    select(pageviews, clicks, treatment) %>%
    mutate(ctr = clicks/pageviews) %>%
    st(group = 'treatment', group.test = TRUE)

Summary Statistics
treatment	control			treatment
Variable	N	Mean	SD	N	Mean	SD	Test
pageviews	23	9224.478	850.296	23	9189.652	802.03	F=0.02
clicks	23	751.87	77.008	23	750.435	70.323	F=0.004
ctr	23	0.082	0.004	23	0.082	0.003	F=0.025
Statistical significance markers: * p<0.1; p<0.05; * p<0.01

Analysis

Compute the value of the outcome variables you chose above aggregating across all days in the experiment.

solution

agg_df <- udacity %>%
    group_by(treatment) %>%
    summarise(
        enrollment = sum(enrollments),
        click = sum(clicks),
        payment = sum(payments),
        gross_conversion = enrollment/click,
        net_conversion = payment / click
    )

agg_df

## # A tibble: 2 × 6
##   treatment enrollment click payment gross_conversion net_conversion
##   <chr>          <dbl> <dbl>   <dbl>            <dbl>          <dbl>
## 1 control         3785 17293    2033            0.219          0.118
## 2 treatment       3423 17260    1945            0.198          0.113

Produce plots of your data that show how the outcome variables of interest differ between the treatment and control groups.

solution

# for gross conversion
# note i compute for net too, you'd need to change the plot
udacity %>% 
    group_by(treatment) %>%
    summarise(
        enrollment = sum(enrollments),
        click = sum(clicks),
        payment = sum(payments),
        gross_conversion = enrollment/click,
        net_conversion = payment / click,
        std_err_gc =  sqrt(((gross_conversion) * (1 - gross_conversion) / n())),
        std_err_nc =  sqrt(((net_conversion) * (1 - net_conversion) / n()))
    ) %>%
    ggplot() +
    geom_bar(aes(x = treatment, 
                 y = gross_conversion), 
             stat="identity", 
             fill="skyblue", 
             alpha=0.7
             ) +
    geom_errorbar(aes(x = treatment, 
                      ymin = gross_conversion - std_err_gc, 
                      ymax = gross_conversion + std_err_gc), 
                  width = 0.4, 
                  colour = "orange", 
                  alpha=0.9, 
                  size=1.5
                  ) +
    theme_bw() + 
    ggtitle("Gross Conversion") +
    theme(text = element_text(size = 14),
          plot.title = element_text(hjust = 0.5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conduct the appropriate statistical tests to examine the effect of the treatment on your outcome variables using data aggregated across the duration of the experiment.

HINT: Use the prop.test() or t.test() functions.

solution

# gross conversion
prop.test(agg_df$enrollment, agg_df$click)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  agg_df$enrollment out of agg_df$click
## X-squared = 21.983, df = 1, p-value = 2.751e-06
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.01193170 0.02917804
## sample estimates:
##    prop 1    prop 2 
## 0.2188747 0.1983198

# net conversion
prop.test(agg_df$payment, agg_df$click)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  agg_df$payment out of agg_df$click
## X-squared = 1.9666, df = 1, p-value = 0.1608
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.001914623  0.011662068
## sample estimates:
##    prop 1    prop 2 
## 0.1175620 0.1126883

Now consider each day as an independent unit of observation. Use a linear regression to test whether the treatment impacts your outcome variables of choice. Interpret the results.

solution

# i do this in logs to gove the percentage impact interpretation
tidy(lm(log(enrollments/clicks) ~ treatment, data = udacity))

## # A tibble: 2 × 5
##   term               estimate std.error statistic  p.value
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          -1.53     0.0433    -35.4  6.00e-34
## 2 treatmenttreatment   -0.107    0.0612     -1.75 8.70e- 2

tidy(lm(log(payments/clicks) ~ treatment, data = udacity))

## # A tibble: 2 × 5
##   term               estimate std.error statistic  p.value
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         -2.16      0.0582   -37.2   7.16e-35
## 2 treatmenttreatment  -0.0589    0.0823    -0.716 4.78e- 1

Which of two sets of estimates do you prefer? Explain.

solution

This boils down to whether we are interested in total conversions or average conversions.

The regression is computing an conversion metric each day, and then the regression is estimating the average conversion effect, where the average is taken over days of the trial.

The proportions test is using aggregated data and asking what is the total effect on conversions.

The total conversions measure is more interesting than the average in this case.

Recommendations

Based on your analysis, would you launch this feature? Justify your answer.

solution

The screener will help reduce the enrollment, but not enough evidence to show that there will be more students who make the payments.

I would not recommend launching this screener.

Follow Up Experiment

If you wanted to reduce the number of frustrated students who cancel early in the course, what experiment would you try?
1. Give a brief description of the change you would make,
2. What your hypothesis would be about the effect of the change,
3. What metrics you would want to measure, and what unit of analysis you would use.

Include an explanation of each of your choices.

solution

No solution provided.

Many possibilities to discuss in class!

My answers

My name

27 April, 2024

Learning Goals

Instructions to Students

A/B Testing Case Study: “Free Trial Screening”

Outcome Variable and Baseline Variable Choices

solution

Determining Sample Size, Duration and Exposure

solution

solution

Balance Checks

solution

solution

Analysis

solution

solution

solution

solution

solution

Recommendations

solution

Follow Up Experiment

solution