Reducing A/B test measurement variance by 30%+

Victor Lei posted April 11, 2019

As a data-driven company, A/B testing is a powerful tool that is leveraged at TripAdvisor in order to ensure we are making decisions which are driving real incremental change. A key challenge is that getting sufficient sensitivity for these tests can be difficult due to the high variation observed in some metrics. Decreasing measurement variance is especially important given the size and popularity of TripAdvisor where even a very small change in some metrics could potentially mean an impact on millions of users. It can also mean the difference between a successful test that reaches statistical significance and is rolled-out to the main site, and an inconclusive test that then requires further testing.

We find that incorporating pre-experiment data in a post-experiment analysis allows the measurement uncertainty of our A/B tests to be significantly reduced compared to the normal method of using a two sample t-test. The variance reduction is especially pronounced (over 30%) when measuring more frequently occurring outcomes like visits. An added benefit of using this method is that we only need to modify our post-experiment analysis, and no changes are needed to our A/B experiment sampling methodology. This means we can even re-analyze past A/B tests to improve the precision of those results.

The normal method

A typical methodology for analyzing A/B tests is to compare the results from the test and control cohorts using a two sample t-test (like Welch’s t-test) in order to measure the average treatment effect.

The natural way to increase the statistical power of these tests and measure the treatment effect more precisely is to increase the sample size by either:

  1. running longer experiments; or
  2. increasing the amount of traffic taking part in the experiment.

However, both methods come with trade-offs such as:

  1. potentially increasing the time taken for testing;
  2. reducing the number of concurrent A/B tests that can be run; and/or
  3. increasing the risks if a test actually has a negative impact.

Incorporating additional information

The key idea is that incorporating information from before the experiment can increase the precision of measuring the treatment effect (Gelman and Hill, 2006). Let’s say we’re interested in the A/B experiment effect on revenue per user, then:

  1. the treatment effect is measured as the difference between the mean revenue per user of the test and control cohorts; and
  2. the variance of the estimated treatment effect is the sum of the variance of the mean revenue per user from both the test and control cohorts.

If we were better able to predict revenue per user in both the control and test cohorts, our variance of the estimated treatment effect would be lower. Pre-experiment variables are often associated with the outcome metric we are measuring, so using these, we should be better able to predict the outcome metric for each user. But which pre-experiment variables should we use? Some prior empirical research provides some helpful guidance (Deng et. al, 2013; Xie and Aurisset, 2016).

  • Use 2+ weeks worth of pre-experiment data
  • Use the outcome variable from the pre-experiment period
  • Use a binary variable for indicating missingness of pre-experiment data (e.g. new users)
  • It is possible to use variables that are not technically from the pre-experiment period, as long as they are not affected by the treatment (e.g. day of the week a user first enters the experiment)

The approach

There are 2 main ways we could incorporate this pre-experimental data in a post-experiment analysis:

  1. Stratification; and
  2. Control variates.

Let’s start with stratification as it can be conceptually easier to understand.

Stratification

Suppose we are interested in the treatment effect on an outcome (like revenue or visits) and we have some pre-experiment data about each user on the outcome metric during the pre-experiment period bucketed into 3 groups: “Low”, “Medium” and “High”. Further, let’s assume the groups are positively correlated with the outcome, so people in the “Low” group during the pre-experiment period also tend to have relatively lower outcomes during the experiment period. For simplicity, let’s assume a constant, homogeneous treatment effect of +2 units to all individuals regardless of their pre-experiment group. Then, for a member of pre-experiment group k with treatment effect θ, we sample their post-experiment outcome Y from a normal distribution parameterized by:

We simulate 4,000 users with these characteristics with half of the users in the treatment cohort, and half in the control cohort. See the R code in the appendix for the simulation code and exact sampling parameters.

The overall outcome distribution for the test and control cohorts might look as follows. Note the high variance of the overall distribution; this results in a noisy estimate of the average outcome in both cohorts, leading to a noisy estimate of the treatment effect.

Breaking down the outcome value by pre-experiment groups reveals the key to why stratification works; by splitting out the pre-experiment groups, we have effectively explained away some of the variation we observe in the overall distribution.

Rather than a single treatment effect estimate, we instead have 3 treatment effect estimates (one for each pre-experiment group) which are then combined into the final estimate using a weighted average. The weighting will be the sample size in each pre-experiment group.

In the simulated example, the point estimates remain the same but we achieve ~80% variance reduction on the treatment effect estimate. The stronger the association between the pre-experiment variable and the outcome, the greater the variance reduction.

Control variates

Stratification is fine when there are few variates and they are all categorical. However, with a large number of pre-experiment variates, variates with continuous values, and potential interaction effects, it becomes challenging to scale using stratification. We will discuss 2 approaches that use control variates to address these issues:

  1. regression; and
  2. Controlled-experiment Using Pre-Experiment Data (CUPED).

Regression

The normal method of using a two sample t-test can be expressed using regression. If θ is the treatment effect, and Y is the outcome, we can estimate the treatment effect (and its standard error) using linear regression:

This makes it straight-forward to add in control variates, simply by adding more terms to the regression. If each X is a pre-experiment control variate:

The resulting inferences on the treatment effect should be very similar to those achieved with stratification if the control variates are categorical, as shown in the simulated example. The key benefit is the simplicity and flexibility of adding in different control variates as well as interactions and higher-order terms. Note that the usual linear regression assumptions still apply, so model specification is still important. We use robust standard errors to minimize the effects of heteroskedasticity.

CUPED

A newer method called CUPED is related to regression but has a different theoretical approach (Deng et. al, 2013). In practice, we find it often yields similar estimates to the simpler regression method. With CUPED, rather than use the mean of the treatment and control group (as in the normal method), we can augment them with a control variate X. The resulting estimator has a lower variance but is still unbiased as long as   so they cancel out in the difference.

In order to minimize the variance of the estimator, it turns out beta is equal to the OLS estimator from regressing the centered Y on centered X (without including the treatment indicator). The authors recommend beta should be the same in both equations, effectively meaning the regression should be run on the combined treatment and control data, but only after centering the X and Y for each cohort separately (see the example in the appendix for clarification).

The paper does not go into detail about incorporating multiple control variates, and while we could just keep using the same form of the equation, it can be tedious multiplying the different mean values with the modeled beta coefficients. Instead, we show there is a straight-forward way to calculate the estimator after fitting the uncentered regression of Y on X (or adding back the mean values of each cohort to the centered regression): look at the difference in the mean of the residuals for the treatment and control groups.

A nice benefit of using CUPED is that the resulting R-squared from the centered regression can be interpreted as the percentage variance reduction achieved compared to the normal method (Deng et. al, 2013). In other words, the more accurately we are able to predict the outcome metric using pre-experiment data, the greater the variance reduction. It is also a convenient way to calculate the variance on the estimator.

Results

We recently ran a two week A/B test on TripAdvisor that looked at the impact of incorporating Google One-Tap on our home page (which has since been rolled out to the main site). A small box appeared in the top right corner of the home page for logged-in Google users.

This would allow users with an existing Google account to easily sign-up as a member and login; naturally, we wanted to understand the impact of this change on our site metrics like membership, engagement and revenue.

The results for the impact on membership were decisively positive, but the impact on  downstream metrics like engagement and revenue were less clear due to the high variance on the estimates. Using the regression method, we incorporated a number of control variates such as:

  • Number of visits to our different business units in the 28 days pre-experiment (e.g. hotels, attractions, restaurants, flights etc.);
  • Revenue in the 28 days pre-experiment;
  • Web browser, operating system, and locale they first entered the experiment with;
  • Day of the experiment (between 1 and 14), and day of the week (Monday – Sunday) they first entered the experiment; and
  • Number of days since the user’s last visit.

As a result, we found substantial improvements in the sensitivity of our estimates for more common actions like visits, and a more modest sensitivity improvement for rarer events like revenue. This would align with the theory since the variance reduction is made possible due to the ability of the pre-experiment control variates to predict the outcomes during the experiment; rarer outcomes are usually more difficult to predict leading to lower reductions in the variance of our estimates. It also aligns with the empirical research which has found limited variance reduction in the rarer, downstream metrics compared with more common upstream metrics (Deng et. al, 2013; Xie and Aurisset, 2016).

Conclusion

Incorporating pre-experiment data notably reduced variance in our A/B experiment measurements and lead to greater experiment sensitivity. This increased precision will improve confidence in our A/B experiment results and our decisions to roll out new features to the platform in the future.

Author’s Biography

Victor Lei is a Senior Machine Learning Engineer in the Machine Learning team at TripAdvisor. In the past year since he joined the team, he has worked on TripAdvisor’s TV campaign and membership strategy. He has been particularly focused on applying causal inference in non-experimental settings, and implementing new methods at the intersection of causal inference and machine learning.

Prior to TripAdvisor, he worked at Legendary Entertainment, using data science to help inform the movie production and marketing process. Victor holds an MS in Computational Science and Engineering from Harvard University.

References

  • Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006, pp 175-177.
  • Deng, Alex, et al. “Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.” Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013
  • Xie, Huizhi, and Juliette Aurisset. “Improving the sensitivity of online controlled experiments: Case studies at netflix.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

Appendix – Simulation Code (R)