9

Migrations: Refactoring for Your System

 1 year ago
source link: https://blog.bitsrc.io/migrations-8f1b0273abfa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

“Migration” as a term is misleading and kind of inadequate

The dictionary definition of migration is roughly: “the movement of things from one place to another”. A close analog in a distributed system might be “change all RPC clients of service A to service B”.

When working on migration steering efforts, I’ve taken care to clarify that the definition of migration efforts is more like broad technical efforts imposed upon codebases or datasets.

A few points of clarification:

  • The level of effort could range from zero or near-zero (transparent migrations) to huge (customer-side rewrites)
  • “0 to 1” efforts are also in scope (“have all stateless services call new service C for new capabilities”)
  • Deprecation efforts (“1 to 0”) are also in scope (“stop using service C from all stateless services”)

The underlying details of these programs are often immaterial to the actual execution problems they face, and the way leadership thinks of them — so I’ve typically just acknowledged that when organizing migrations, many different kinds of work might be captured.

Migrations are like system-wide code refactoring

Regular refactoring of active codebases for the purposes of simplification, risk-reduction or performance improvement is good hygiene. As requirements or circumstances change, the design decisions made in the past for the existing code become less effective, and the effort required to work around those gaps becomes technical debt.

In a large distributed system, the problem is the same, but at a larger scale. Perhaps the KV storage solution from five years ago can’t scale to the current traffic, or the legacy search index can’t efficiently support EBR. Will Larson says migrations are the sole scalable fix to tech debt. This makes sense if you think of migrations as refactoring at scale.

These problems feel familiar to software engineers, but the challenge tends to be that the migration efforts span many more people, tend to be under-justified, and are poorly executed. If you want your stack to consistently be using the latest-and-greatest technology, or optimized and responsive to modern business needs, then you should be doing more migrations, not fewer — but over time with a lower cost-per-migration.

Benefits of migration outcomes are not uniform across constituents

From the narrow perspective of a provider or platform team, it can feel like the risk or effort imposed on customer teams clearly supports the benefits of the newer solution. Justifications for the migration can sound like

  • We’ll cut tail latencies in half!
  • We can support significantly larger datasets!
  • The new system more easily mitigates some class of compliance risk!
  • Shutting down the old system will free up our time for better customer support!
  • We can enable new features in the underlying 3rd party infrastructure components!

While all these may technically be true, the effort required to get to the new system probably affects teams who:

  • Don’t have tight RPC latency requirements
  • Don’t have large datasets
  • Don’t have a major compliance-related surface area
  • Don’t require significant customer support
  • Literally do not care about new infrastructure features and just want stuff to work

In these contexts, customer teams have little to no incentive to help with your migration — even if the organization at large stands to benefit. This is not intentionally malicious — it’s just reality.

The value of migration efforts tends to be under-justified. By that, I don’t mean that the objectives aren’t worthwhile, but that migration teams often don’t do the research and put in the work to convey the value to affected teams from a perspective they can understand. This can especially be the case when the migration effort is engineer-driven, and facing contention with other elements of roadmaps which are steered by product managers telling a more compelling story. These teams are left wondering “why are we doing this again?”

Ensuring some system is in place to validate and elevate business value for these projects, and carving out dedicated migration time every month or quarter amongst teams can help alleviate this problem. If the effort can’t be made mostly transparent (see below), then working with senior engineers to socialize the impact with engineering leadership is important.

Stronger owners for better outcomes

A sure path to misery for a large migration is failing to clearly articulate the organizational ownership for the outcomes of the migration, and ensuring that the owning team understands their accountability for delivering those outcomes.

Migration ownership models prone to dysfunction include:

  • Migrations driven by a person
  • Migrations driven by a working group
  • Migrations are owned by a team, but migration outcomes are not factored into the team or organziational strategy or OKRs
  • Migrations assigned to the team who doesn’t own, or just partially owns, the technical domain the migration is happening in

The best team to own and drive migration is the one whose strategy is already focused on supporting customer teams and innovating in that space, whether it's databases or client-side UX platforms or stream processing. If such a team doesn’t exist, it probably should.

The nice thing about lining up the migration outcomes with the team and organizational accountability structure is that the team is now incentivized to not “just” ship the program as defined, but find the least-painful path towards that outcome at every turn. More on this in the next section.

At Twitter, I was consulting on two independent migrations for two different service frameworks, with the goal of unifying all online services onto TwitterServer. One team was responsible for migrating services from framework A to TwitterServer, another team was responsible for migrating services from framework B. This effort was unimportant to most product teams, but a longtime nagging issue for several platform teams.

Framework A’s team added a few full-time team members to the project, added migration outcomes to the team OKRs, deferred a few other projects from their roadmap, and spent considerable time on automation and direct customer code changes.

Framework B’s team allocated a fraction of one engineer, plus a part-time cross-team working group, and provided migration guidance materials to customer teams, but ultimately placed the bulk of the work on them.

The first migration was completed on schedule; we canceled the second after several quarters — most teams simply refused to do the work.

All migration projects go more smoothly when there’s less work to be done

The hardest part of major migration projects is getting many people to do a specific kind of work within a specific time boundary. Setting aside the kinds of coordination headwinds imposed by large organizations in general, the migration project is competing with all kinds of other work the team wants to put on their roadmap. Sometimes, the decision to defer migration efforts might be entirely out of the customer team’s control.

Migration projects are often structured as such:

  • We’re migrating from A to B
  • Here’s a document outlining how to do the work on the customer side
  • We don’t really know how to validate the changes to customer systems so you’re on your own here, glhf
  • If you have any problems, here’s a Slack channel or mailing list
  • If this could be done by the end of next month that would be great

This approach fans out the costs of gaining context on the migration itself, deciding on a migration approach for the affected applications (which are probably somewhat similar overall), the cost of the actual implementation as well as the risk of the independent implementations themselves. Projects with this structure tend to surface weak abstractions and platform ownership because customer teams are doing work to support their own providers. It’s kind of like when you call your cell carrier to argue over your bill, and the person on the phone asks you for your phone number.

Spending the time upfront to explore automation of the migration work, tool development, and centralization of the broadest possible class of implementation functionality means it takes longer to start a broad cross-team program — but this style of migration tends to be the one that actually finishes.

Less work means less contention for resources on affected teams’ roadmaps, which in turn means the ROI of the migration becomes much easier to justify. Bonus points if the end state of the migration reduces the surface area the customer team needs to care about in the future.

One time at Twitter, I consulted on a project to migrate service customers from performance monitoring tool A to tool B, which in large part consisted of migrating configuration files from one format and schema to another. The migration was originally structured in a way similar to the above — the intent was to file hundreds of tickets to customer teams with instructions for how to set up configurations for the new tool.

I pointed out a few things to the project team:

  • They owned the schema and formatting of both the old and new configuration files
  • They were the domain experts in this particular performance domain
  • Customer-based manual migration would likely be error-prone and suboptimal
  • Most customers didn’t really care about the configuration itself and just wanted the tool to work

This presented the team with an opportunity to not just automate the config changes on the customers’ behalf, but also to improve and optimize them based on their expert familiarity with Twitter service performance characteristics and monitoring.

The migration was a success.

Urgency is paramount

A naive projection of migration cost might look something like

migration_cost = impl_cost + (num_customers * customer_adoption_cost)

Basically, the work to be done (maybe, lines of code) on the migration itself.

Long-lived migrations introduce entropy and uncertainty into systems, which percolate into design decisions, implementations, and future deprecations across the board. The cost is a bit more like

migration_cost = impl_cost + (num_customers * customer_adoption_cost)
+ (double_soln_operating_cost * time)
+ (num_customers * double_impl_cost * time)

Meaning that for any given migration, if the work to be done was the same but we could decide to spread it out over a quarter, or a year — the cost of spreading it over a year would be higher.

Once a migration project reaches maturity and buy-in for execution, it is critical to focus on its completion for the health and simplicity of the overall system state.

Design principles for migrations

There are two end states for written code: it’s either part of a (code, traffic, data) migration or not part of a migration because your company shutters.

One thing I encouraged my product platform teams at Twitter to consider is that “these will not be the last platforms you build”. Meaning that even while we might deliver the right services and solutions for the needs of today’s product roadmap, we need to think about how to facilitate a future migration when it becomes necessary.

In practice, that means things like:

  • Closely modeling services and schemas around the best-known invariant concepts underpinning the business
  • Carefully identifying what kinds or elements of services or data models require centralized ownership, development and operation versus more federated models
  • Simple and well-managed interfaces which don’t bleed implementation details into customer code

A lot of that sounds like general “software engineering best practices” and it is — within a program, an engineer might design for what happens if a particular class needs to be reimplemented. Within a distributed system, engineers should consider what happens if an entire business domain needs to be reimplemented, or simply thrown away.

There’s always a risk of over-engineering and that needs to be carefully mitigated. But most technology company businesses don’t remain static — they’re always changing.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK