2

Challenges in Experimentation

 2 years ago
source link: https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Challenges in Experimentation

Lyft has long had a strong culture of experimentation. The norm is to test each and every product change, to build up evidence to drive large decisions, and to use causal data to support — but not necessarily dictate — strategic direction. Our largest challenges come from:

  1. Strong network effects which make proving causation difficult
  2. The responsive, real-world nature of our business
  3. Our diverse lines of business which require experiments covering disparate populations
  4. Developing our culture of experimentation as our teams and product offerings grow

The Experimentation team’s job is to address these challenges. Our vision is to “enable decision-making at Lyft to be accurate, timely, and efficient, with every change to Lyft products and platforms being made with confidence.” In this post, we provide an overview of capabilities we have deployed and a preview of our plans for the future. We have a number of forthcoming blog posts and hope you enjoy this overview of our space!

Comic from XKCD

Measuring network effects

In a traditional A/B test, one portion of a user population is exposed to one experience while another receives a different treatment (e.g., a button that says “Request” versus “Request Now”). While this works well to understand the impact on users, Ride-sharing has strong interference or “crowd out” effects: what one user does may impact another! For example, if “Request Now” is clicked more often than “Request”, then the riders who have “Request Now” may match with nearby drivers before a “Request” user has a chance. This violates the statistical assumption that the two treatments are independent from each other.

While user split experiments are still our most common test type, Lyft has pioneered experimentation techniques that have helped to shape decision making around key aspects of the business in order to better measure the true effects of changes on the network. The second most common type of experiment we run is a time split test which gives all users within a set geographic and time boundary the same experience. This is commonly used to test pricing, ETAs, routing, mapping, and how users experience the Lyft network. Time split testing is a powerful way to establish causation in the face of network effects as it reduces interference between users; however, the limitation with time split experiments is that users would have an inconsistent experience if they use the app in both treatment and control periods. Due to this interference, we try and limit the use of time split tests to changes that are opaque to users.

Over the last year, we have built out full support for region split testing. Region split tests divide treatments by geography and use a synthetic control to conduct causal inference. Region splits ensure that users within a specific geographic area will maintain a consistent experience while allowing evidence of treatment impact to be collected. This is popular with launching mapping features, for example. For control, we use trends from before the test begins to develop counterfactual predictions of what would have happened had we not launched a treatment. Because region split tests can be vulnerable to interference from other large scale tests, we also built out a scheduling tool to coordinate experiments and regions.

In addition, we are currently implementing time split variance reduction techniques like residualization to improve the speed and power of time split tests. Our previous implementation of time splits was vulnerable to interference from storms, major events, and outages that frequently resulted in needing to restart experiments. By applying new causal inference methodologies, we can drastically improve the pace of experimentation.

Finally, we have put in place processes for testing a feature using multiple types of tests in order to more fully understand the impact to the business. An upcoming blog post will talk about using multiple test types to better measure long term marketplace effects.

Managing real-world dynamism

Lyft operates on streets that are impacted by the weather, in a labor market that is impacted by macroeconomic trends, and in support of expanding mobility options. Together, these forces mean our customers’ behavior can vary widely across time and space while changing rapidly from month to month. Experimental results may lose external validity over time. Parameters that were tested and set years ago may no longer be rational. Updating these parameters requires time consuming time-split tests and the results may not even hold for long!

Our answer to manage our rapidly evolving markets is to make experimentation just as dynamic. We are investing in always-on adaptive experimentation platforms that will enable us to test widely and move quickly. One of these platforms is for parameter tuningwhich enables us to jointly optimize for multiple, continuous parameters. We are also building reinforcement learning approaches, including contextual bandits, which help us to test broad sets of treatments (especially with customer communications) and converge on the best performing variants. This convergence gives customers the best experience sooner than with traditional, fixed allocation A/B tests. Adaptive experiments will allow us to continually optimize our marketplace, dynamically adjusting to new conditions.

Supporting diverse lines of business

As Lyft has expanded into new lines of business, our experimentation needs have also grown. A/B tests splitting across riders and drivers is the most frequent type of experiment at Lyft. This means that we randomly assign a user to treatment or control based on their unique identity as a rider or driver. With the rise of the Lyft Transit, Bikes, and Scooters (TBS) business, the types of entities that need to be tested have expanded.

Person in jeans undocking a pink and black Lyft Bicycle from a dock on the side of the street

We have created new randomization units to support our expanding businesses. For example, we have added support for session splits (alternating treatment assignments with each session), the aforementioned region splits, and hardware splits (such as bikes) — including for Lyft’s new eBike which was named one of the best inventions of 2021. These new types of A/B test randomization help to test hypotheses, inform strategic decision-making, and ensure reliability for all parts of our business.

Supporting our culture of experimentation

With hundreds of job openings at the time of writing, Lyft is growing rapidly. Coordination problems increase as more people run experiments, but fresh ideas give us the opportunity to form ever stronger norms around science hygiene. We are continuously working to provide new tools and reinforce beneficial norms. The scale of this challenge is massive as teams across Lyft run many thousands of experiments per year.

The outcome of an experiment is one of: (a) shipping a treatment to a larger population, (b) abandoning the treatment by stopping the experiment, or (c) extending the period of the experiment to collect more data. This decision should be made based on the hypothesis being tested. In turn, the hypothesis should specify the target population, the primary decision metrics that we are trying to improve, and any guardrail metrics that should not be made worse. There are a number of potential pitfalls around hypothesis creation, multiple hypothesis testing, and trade-off coordination which could result in poor decision making.

The first pitfall is not following best practices of hypothesis creation while setting up the experiment. Gaps here could include not clearly defining the treatment, not specifying the primary metrics, leaving out guardrail metrics, or not even registering a hypothesis before the experiment began. To address these gaps, we created a guided hypothesis workflow that helps experimenters follow best practices while defining the hypothesis. In this workflow, the hypothesis is pre-registered and recorded in various experiment communications which reduces HARKing as the hypothesis is shared broadly at experiment launch and recorded. We also built in structured information to the hypothesis to allow us to record the association between an experiment, the hypothesis, and hypothesis components like primary metrics. This can later be used to enable teams to quickly check their hypothesis after tests are completed and to locate institutional knowledge around what has been previously tested with regard to a specific metric.

The second pitfall is the multiple comparisons problem where experimenters look at multiple metrics or multiple dimensional cuts (e.g., regions, operating systems etc.). Examining a long list of metrics increases the chance of finding false positives: as the number of comparisons increase from 1 to 20, type I error increases from 0.05 to 0.64 (for uncorrelated metrics). As a result, we may ship features that are worse than the control. Our work on the guided hypothesis workflow helps to mitigate, but does not eliminate, this risk. To address this challenge more holistically, the team used the Benjamini-Hochberg method to build Multiple Hypothesis Testing correction which adjusts p-values when multiple primary metrics are used. Essentially, each new metric that is examined represents a different hypothesis that is being tested, and we need to correct for that fact. We are currently evaluating adoption, feedback, and impact to decision-making as this methodology may lean towards conservative outcomes.

Comic from XKCD illustrating challenges with false discovery

The third pitfall that experimenters may face is unaligned trade-offs. As we invest in our business using different product features (e.g., pricing, incentives, coupons), we aim to stay aligned on investment decisions which are challenging to coordinate as our teams grow. For example, we would not want one team to build an experience that costs $50 today to improve predicted rides by 1 next month while, unbeknownst to them, another team builds an optimization that generates $30 today but reduces predicted rides by 1 next month. This destructive behavior is the equivalent of “buying high and selling low” and would lose us $20 (numbers are hypotheticals for example purposes only)! Today, Lyft has the Revenue Operations team that helps to coordinate investments. On the Experimentation side, we have built improved results and decision tracking on experiments to better quantify the thousands of decisions that are being made — and the trade-offs they imply. We aim to use this results log to generate consensus on proper investment trade-offs (e.g., how much to spend today for a benefit tomorrow) and then to feed these back into the hypothesis workflow tool to help coordinate teams that are working in similar areas.

Conclusion

Our team is working on numerous aspects of the experimentation platform ranging from novel statistical techniques to improving decision-making. Many of these efforts are big bets we are making to support our growing business and dynamic marketplace. We will continue to share the results and learnings from these efforts in future posts.

A big shout out to all of our team members in San Francisco, New York City, and Kyiv working on the experimentation platform!

Interested in experimentation, applying science at scale, or working at Lyft in general? Lyft is hiring! Let me ([email protected] — product), Mohan ([email protected] — engineering), or Nick ([email protected] — science) know you are interested! Also let me know if there are additional topics you’d like to hear about.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK