3

Migrations Done Well: Executing Migrations

 2 years ago
source link: https://blog.pragmaticengineer.com/migrations-done-well-executing-them/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Migrations Done Well: Executing Migrations

This is Part 2 in the 3-part series on Migrations Done Well.

Part 1: Typical migrations

  1. The stories of four different migrations
  2. Types of migrations

Part 2: Executing a migration (this article)

  1. Preparation
  2. Pre-migration
  3. The migration
  4. After the migration
  5. The migration’s long-tail

Part 3: The people and the business side of migrations (an upcoming article)

  1. The people aspect of migrations
  2. Selling migrations to the business
  3. Closing advice for migrations
  4. Further reading

1. Preparing for migrations

Migrations are risky and when they go wrong, they can cause all kinds of significant damage. However, if you do some groundwork before starting the migration, you’ll reduce risk, gain confidence and understand the scope of the migration better. Here are a few things you should do:

  • Understand the reason for the migration. Why is this migration needed? What is wrong with the old solution? What is the impact of the old solution being unsuitable? What would happen if this migration did not complete?
  • What were the constraints in the past? When a migration is needed, it's because the current solution fails to meet a particular requirement. This could be availability, performance, or something else. Outline what these capability constraints are. You’ll need to make sure the new system solves for these past constraints.
  • A map of the differences of Old vs the New. Outline the old system vs the new system. What is changing and why, and how do these systems work? Are external interfaces changing at all? Are you adding or removing functionality for the new system?
  • Consumers and producers needing migration. Which customers of the old system need to be migrated to the new one? Map customers of the old system. Will all consumers move over, or will some remain on the old system?
  • Write and share a migration plan. Write up how the migration will happen. What are the migration steps? Who will do each step? For example, who will build the new service, who will migrate each of the consumers, and who will monitor the progress of the migration? Put this plan in writing and share it with consumers whom the migration will impact.
  • Capacity planning. What load is the current system under? What load will the new service need to handle? For load planning, don’t only look to the time of the migration, but look ahead into the future. How will you ensure the new system has the capacity to handle this load? How will you confirm it can handle the load?
  • Error budgets. For complex and large migrations you can expect some things to go wrong. You might also have migrations where you expect a small percentage of traffic to not work as expected. Instead of just accepting things going wrong, budget for it. What is the acceptable limit for errors during the migration? This could be defined as latency degradation, % of data that is not consistent for a certain time, percentage of application errors during the migration, and so on. By defining error budgets you’ll also have to measure them to keep yourself honest.
  • Edge cases. What are potential tricky situations you need to take into account? Write these up, then share it with the team working on the migration and all consumers impacted, so they can also add their edge cases. Common edge cases you should worry about include:
  • What happens with data producers during migration? For example, say you have processes writing to the database. What happens when the migration happens? Will these writes continue? Will you stop them? Will they be queued?
  • How will service consumers behave during migration?Will there be edge cases when clients behave in unexpected ways, or experience data consistency issues as they interact with a system undergoing migration?
  • Security audit of migrations. What are the attack vectors during and after the migration? Are there vulnerabilities a bad actor could take advantage of, as the migration happens? Vulnerabilities might not be strictly related to the code or data, they could include phishing attacks.

    This is exactly what happened to NFT marketplace OpenSea, in February of this year. The company announced a week-long migration to customers. A bad actor realized this migration was an opportunity for an attack and sent out emails that looked like they were from OpenSea, asking customers to take action for the migration to happen. Customers then were tricked into authorizing a smart contract they believed was from OpenSea, and dozens of unlucky customers lost over $1.7M in NFT assets.

    Had the company predicted such phishing and sought to counter it, they could have chosen to either shorten the migration window, communicated the phishing risks, or have monitored suspicious smart contracts authorized during the migration period, and warned users against executing them.
  • Run a migration pre-mortem. As engineering director Nathan Gould suggests:

Love this. One of my favorite migration practices is to hold a "pre-mortem".

The basic play is:
- Write up a migration plan
- Gather the team running the migration, plus stakeholders
- Brainstorm stuff that could go wrong
- Identify and prioritize ways to mitigate risk

— Nathan Gould (@0xNathanGould) March 11, 2022

1.1 Preparing for data migrations

Data migrations bring additional complexities of their own and tend to require more thorough planning. Some additional complexities you might need to consider are:

  • Too much data to migrate in one go. In the case of service replacements, this means that the new service might need to still use the old service to look up older data. In this scenario, smart approaches could be used. For example, when modifying a record, the new service might create a copy from the old system, and stop using the old system for that record.
  • How will data producers be migrated? What happens with data-producing processes, for example, processes writing to the database at the time of the migration? Will they be paused? The writes queued? The writes dropped?
  • Active-active challenges. When using an active-active data storage setup – distributing data across several clusters for resilience – you’ll have additional challenges when migrating. If migrating with downtime, you might be able to migrate all clusters during this downtime. However, if running a zero-downtime migration, consumers will see inconsistent data. How will they deal with this inconsistency?
  • Rollback. If the migration has issues, how will you revert to the previous state? When utilizing downtime to perform the migration, the rollback might be as simple as not switching over to the newly migrated data. However, with zero downtime migration, you might need tooling to write back data committed to the migrated database as you roll back to the old database.
  • Disaster recovery considerations. In the event of a major issue like data corruption or a ransomware attack, how easy is it to roll back to a pre-migration state of the system? Are there sufficient backups? Should you make a last snapshot save before the migration proceeds?

2. Pre-migration steps

You have a plan in place and are confident you have the edge cases covered. Can you start the migration? Almost certainly not yet! First, you’ll need to ensure you have monitoring in place so you can track the status of the migration and detect problems. You’ll also want to validate that the migration will work with shadowing, dry runs and other processes.

Screenshot-2022-03-23-at-14.36.54-2.png
Shadowing a migration. See more details in Part 1 of the series.

2.1. Monitoring

Monitoring the migration is the single most important action which can make a migration successful, and detect one going wrong. The lack of dedicated migration monitoring is the reason for most migrations causing outages, in my experience.

  • Define what needs to be monitored. How can you tell if the migration goes well? What do you need to measure to know that customers are not being impacted and the service is healthy? For data migrations, how can you measure that the data is not corrupted?
  • Build the monitoring systems. Put in place the graphs, tools and alerts that tell you if the migration is healthy or if it has issues.
  • Throwaway monitoring is to be expected! Many engineering teams are hesitant to build monitoring as they’re used to building graphs and alerts for permanent features. Break away from this thinking! You need to build monitoring and yes, it will be removed once the migration is done.
  • Do you have monitoring for all error budgets? Remember how you defined your error budgets? Make sure there’s monitoring for all of those metrics.
  • If it’s painful to build one-off monitoring, take note for later. At companies which have invested heavily in platform teams, building monitoring for migrations should be a breeze. If it’s not, then you should incentivize making it easier to do. Do this by contributing to tooling, or by pushing management to invest in tools, be it via internal platform teams or contracting with vendors.

2.2. Validation

Validate that the migration will work as expected.Once you’ve completed most of the development work for the migration, you’ll want to hold off rolling out in production. Instead, you should validate the migration with production data and traffic. Here are some common ways to do this.

Shadowing, also commonly referred to as ‘parallel run’ or ‘shadow loading.’ This is a common approach, both with service replacements and service integrations. This approach means sending all production traffic to the new system or new integration as well, and monitoring that it works as expected.

Shadowing has several benefits:

  • Migration issues are caught early.
  • It’s as close to pre-production testing as it gets.
  • It serves as load testing. If your shadowed system can handle production load, you can be confident that it won’t have issues when switching over.

There are a few caveats with shadowing:

  • Shadowing validation. You’ll likely need to build shadowing validation tooling to confirm the new system works as expected.
  • Mocking might be needed to avoid side effects. The new system might need to mock certain functionality. For example, if you are migrating an old system which also sends emails in certain scenarios, you’ll want to stop the new system from sending emails while in shadowing mode. Doing so would result in emails sent twice!
  • Not always practical. There may be times when shadowing is not an approach that’s pragmatic. This is the case when you would have to mock much of the shadowed system’s capability. For example, if you are migrating the email layer in an application, shadowing email sending without sending emails might be meaningless.

Load testing is a common approach to confirm that the new system has the capacity to handle future, increased loads. While shadowing can give a sense of current load-handling capability, the goal of load testing is to simulate extreme loads.

Common ways to do load testing are these:

  • Use mocked or generated data. Generate test data and execute a load test. The benefit of this approach is that you can tweak the load test characteristics easily. The downside is that the test data will not exercise edge cases which only real-world data has.
  • Use production data, but in bulk. A more common load testing approach is to sidestep test data and use real data. Collect production data for a while, then use this production data for load testing.
  • Combine shadowing with a bulk release of production data. We used this approach at Uber with a system called Hailstorm. It buffered all production requests coming in for a defined time and replayed it in a shorter time window. For example, we released 10 hours worth of production data in one hour.

Performance sampling is an approach worth doing for high-load systems, especially when you are changing technologies like frameworks or programming languages. It’s easiest to do when combining with shadowing. What are the performance characteristics of the new system, compared to the old? How has latency changed, and has resource usage like CPU utilization and memory usage, decreased or increased?

Do a dry-run of the full migration. Why wait to do the migration in production, when things may go wrong that cause outages? If possible, do a dry-run migration and inspect whether things work as expected.

You can combine a dry-run migration with shadowing. Dry-runs are common to do with data migrations, where this might be the best way to confirm that data moves as expected.

Test for events that will only happen in future. As an edge case, your migrated system might have to deal with edge cases that only happen on certain future dates. For example, if migrating a billing system that sends out bills at the end of the month, you’ll want to test that this functionality works.

One option is to do shadowing for long enough so that these future events occur, and you can validate them in real time. However, you might not have the luxury of waiting, especially if these events happen once every few months, or once a year. In this case, you’ll have to simulate these events and validate that the migrated system handles them as expected. A good example is testing billing systems like this:

Pre-migration:
- do a complete data migration to the new database and test new service features + performance with real life (anonymized) data.
- test future processes. E.g. billing run every 1st of the month: do it in the new system simulating future date and compare results.

— Jeroen Baidenmann (@jeroenbai) February 18, 2022

3. The Migration

We’ve planned the migration, done pre-migration validation and perhaps even shadowed the new system, assuming it’s practical to do. Time to start the migration!

Migration downtime

The most important decision in any migration should be whether or not to have downtime for consumers of the system being migrated. This decision will determine strategies you can use for the migration itself.

Zero-downtime migrations are when customers notice nothing of the migration. These are the ones that need the most preparation to achieve.

Doing zero downtime migrations is often a steep learning curve and involves more upfront work, when done the first time. However, at companies where zero downtime is the norm, the amount of additional work drops over time as people get better at it, and at building or utilizing tools which aid the process.

There will be many cases when zero downtime is too expensive to do, in terms of time spent building tooling. There will also be cases where it’s not possible, for example, with infrastructure migrations. However, if you start by considering what it takes to do zero downtime migrations, then you can make a more informed choice.

The first thing you should do is determine if you need zero downtime migrations.

— Bryan Liles (@bryanl) March 12, 2022

I would encourage teams to do at least one zero downtime migration, if they’ve never done one before. Teams that have never performed such a migration often overestimate their complexity, and underestimate the benefit of not having to work outside business hours when doing zero downtime migrations.

A benefit of zero downtime is being able to do all things calmly rather than in a hurry and not needing to time anything to a maintenance window. Especially in async work cultures, or cultures where stress and agitation is disliked, that is valuable.

— Felix Huttmann (@felixhuttmann) March 11, 2022

Migration with downtime affecting few to no customers is when you take the system down, but customers don’t notice anything happened. How is this possible?

For one, your business might be intentionally offline during certain hours. For example, if you have systems servicing the stock market which only operate during business hours, customers won’t notice migrations done outside these hours.

If your migration involves non time-sensitive functionality, you can also get away without customers noticing. In these cases you might be able to take the old system offline, queue incoming requests, do the migration, then replay those requests for the new system to process. For example, if you are migrating a system that sends out marketing emails, customers won’t notice – or care – that some marketing emails arrive an hour after they usually do, thanks to the migration.

Migration with reduced functionality is an approach where you don’t introduce downtime to a system, but some functionality will go offline. A common case is to turn a system to ‘read only’ mode during a data migration; all read operations still function, but new data can not be written. That data will either be queued for later writing, or it might be discarded until the migration is complete.

Migration with planned downtime is an approach where you take the current system fully offline, perform the migration, then bring the migrated system online. You are able to accurately estimate the downtime needed and can communicate this downtime, ahead of time. You typically choose outside peak hours to perform the migration, usually in the middle of the night for most customers, or at the weekend.

Migration with downtime has the benefit that you don’t need to worry about edge cases of consumers accessing a system that is not fully migrated.

The downside of this approach – of taking the system offline – is that it adds pressure; if something goes wrong with the migration, there’s not much time to fix it without exceeding the communicated downtime window.

Migrations utilizing regular downtime periods is an approach common at more traditional businesses which have been doing migrations for years, or decades. Several banks fall into this category, which commonly allow for downtime to happen over the weekend or during bank holidays.

Having regular downtime periods that can be used for migrations is convenient for engineering teams. They can use these often long timeframes and not have to worry about a migration going wrong as they have ample time to revert it, and to get it right in the next period.

I personally find regular downtime periods tempt engineering teams to take the easy route of not preparing well for migrations, and disincentivize zero-downtime approaches. On top of this, regular downtime periods incentivize working outside business hours like at weekends or late at night.

Migration strategies

How will you perform the migration? Here are the most common migration strategies:

Switch over. With a flip of a switch, or more often a configuration chance, you route all traffic to the new system. This is usually done after extensive shadowing.

There are migrations where a switch over is the only sensible strategy. Code migrations are one of them, and data migrations might be, too. Migrations that utilize downtime are almost always ones that use a switch over method once the migration is complete.

Staged rollout. This means rolling out the migration gradually to parts of the system, or to a certain group of consumers. As the rollout proceeds, the team monitors the system to make sure it works as expected, and pauses or reverses the rollout when they see issues.

Staged rollouts are popular when releasing new features, gradually rolling them out to all users. This means that many teams already have access to the tools – like feature flags – to use for staged rollouts.

Several types of migrations can easily be done as staged rollouts. However, more complex ones like data migrations or infrastructure migrations often require extra complexity to be added, in order to use a staged rollout approach. This is because in those migrations, both data migration and code changes often need to be tied together. A staged rollout with a data migration might mean writing more code to keep the migrated data and the code executed in sync; this new code is yet another source of bugs.

Writeback to the old system is a common approach with both service replacement migrations and some data migrations. In cases where the existing service has many internal consumers, the migration to the new system does not move these consumers over to the new system.

Instead, the new system writes data back to the old system, allowing for consumers to operate without changes. Now, the migration is complete in the sense that the new system operates as primary. However, there will be a long-tail migration effort to move all consumers of the old system to use the new system.

Migration toolset

Let’s get to the migration! Here are approaches to consider using.

Go through your migration plan with the team, once more. This should be a document outlining:

  • Each migration step.
  • How each step will be validated to confirm it’s successful, so the next step can begin.
  • Who does what during the migration.
  • Contacts on teams who could be impacted.
  • Edge cases that might occur and how those will be handled.
  • A rollback plan on how to revert the migration if any of the steps go wrong.

Validate that your migration monitoring is in-place and working. The location of graphs, alerts and tools are already part of your migration plan: double check they are up. Test that they work.

Announce the timing of the migration ahead of time. How much time is enough time for this depends on the scale and impact of the migration. This could be days’ worth of heads up, or weeks’. For small migrations, announcing a few minutes beforehand to relevant teams and oncalls that the migration is starting, might also be sufficient.

I strongly suggest not to start a migration without announcing it to teams and stakeholders which could be impacted. Don’t forget to not only notify engineering teams, but non-engineering groups like customer support, operations and other people who can help in letting you know if they see unexpected issues during the migration.

Start! Once the announcement is out, start executing the migration.

If there’s an issue: don’t panic. There’s always a chance that something goes wrong during the migration. If this happens, keep your cool and respond the same way as you would during any outage.

Focus on mitigating the issue, instead of fixing the root cause. Use the rollback plan in your mitigation plan. Once you’ve rolled back, regroup to identify the issue and prepare for a new migration.

If you don’t have a rollback plan, or the rollback is not working, consider asking for a second opinion before proceeding with anything bold. You’re in the heat of the moment, under pressure and it’s much easier to miss things in this state. There are few situations where a second pair of eyes isn’t helpful.

Newsletter

Subscribe to my weekly newsletter for advice, observations and inspiration across the software engineering industry. Especially relevant for those in big tech and at high-growth startups.

4. After the migration

You’ve completed the migration. Congratulations! What now?

Announce the migration is complete. Notify not just the impacted stakeholders, but a broader group of potentially interested teams that the migration is complete.

If the migration was a non-trivial project, follow the same steps as when wrapping up projects, that I suggest in the article Software Engineers Leading Projects, including the final project update:

“Write a concise summary of why people should care, what the impact – or expected impact – of the project is, summarize the work, highlight key contributors, and link to details.

“Shine a light on all team members, and anyone who meaningfully helped with the work. A final project update email is the time to praise people by name who worked on the project. Make sure to not omit key people, which is why getting feedback on this draft is a good idea.”

When announcing the completed migration, specify where you’d like people to report any issues they might see as a result of the migration.

Validate that the migration is successful. Confirm that everything works as expected. Don’t only look at logs and graphs, but look at business metrics and get details from customer support. Is anything off? If so, take a closer look.

Prioritize for follow-up work, ahead of new projects. If all goes well, there will be no further work on the migration. But what if problems surface, such as glitches for certain consumers or corrupted data?

I suggest you treat wrapping up the migration – and fixing any issues coming up – ahead of starting new work. If the issues are minor, you might be able to resolve them with the help of the support engineer or oncall engineer, if your team has such a rotation. If they are more complex issues, you might have to push back plans for future projects in favor of finishing the migration.

5. The migration long-tail

In organizations with lots of services, it’s often impractical to migrate all consumers to a new system. Instead, a writeback strategy is used and the migration is broken up into these steps:

  1. Migrate so the new system is primary. This new system uses writebacks or a similar approach to keep the old system up-to-date and usable by existing clients.
  2. The migration long-tail. Migrate existing consumers, one by one, to the new system. Do this either via the teams migrating as self-service, the migrating team providing hand-holding, or as a mix.
  3. Shut down the old system. Once all consumers are moved over, retire the old system. This step is often much later than most teams predict.

I’ve consistently observed the long-tail of migrations to take more time and more toll on teams owning the migration, than they expected. This often hits team morale as migrations are thankless enough to do, but they’re especially taxing when they drag on, and the team gets questioned on why it’s taking so long.

The risk of outages during this long-tail migration is high. It’s common to proceed with great caution during the first phase of migrating to the primary system. However, during the long tail, teams usually proceed with fewer guard rails, and outages frequently occur as consumers migrate.

Also, the longer the old system is kept alive, the more problems occur because of the difference in functionality in the old versus the new system. This is especially problematic if code changes make it into the new system to support teams, which have yet to migrate to the new system.

How do you manage this migration long-tail, to make it as efficient as possible? Here are things to do:

1. Put monitoring in-place for the writebacks to the old system. If the primary system has issues writing back to the old system, this can lead to major outages, as it did in the case of Uber. Even though the writeback is temporary, it’s critical you monitor its health.

2. Make it easy for teams to migrate using self-service tooling. Create runbooks, tools to do dry-run migrations and other ways for engineers owning systems to migrate, test and execute the migrations themselves. Adopt the platform team mindset by enabling the consumers to execute the migration. And if your team owns a major migration, chances are you’re either a platform team, or playing at being a platform team role during this migration.

3. Create visibility on the long-tail migration. Have visibility on:

  • Who uses the old service? Which consumers are dependent on it?
  • Which features do they use in the old service? This information will help with prioritizing migration, especially if the new service is missing functionality from the old service.
  • How much do they use it? How often, or with what load?
  • How critical is the old service for each team? If the old system breaks, which issues would it cause?
  • When does each consumer plan to migrate? Is this migration in their roadmap? Is this a soft commitment or a hard one?

4. Get leadership support for the long-tail of the migration. I remember how challenging it was within Uber to motivate teams to move their systems over to the new payments system, instead of relying on the old systems that worked fine, thanks to writebacks. Most teams prioritized shipping business functionality over doing a migration that made no difference to their business.

It’s true that migrations rarely count in business impact terms for customers which are forced to do them. And it’s fair that they won’t prioritize them when they don’t think the work is important enough. This is where leadership comes in.

If a migration is truly strategic to a company, make leadership – up to and including the CTO or the engineering lead of the organization – understand why it’s strategic. And get them to prioritize the migration as one of the top initiatives for the organization during annual or bi-annual planning.

This was exactly how much of the migration effort to move to Uber’s new payments system was prioritized, this move became one of the top strategic priorities for the migration. Suddenly teams had a valid reason to prioritize this work, ahead of some business initiatives.

Without leadership support, large migrations will drag out, and some will never get done. So get this support as early as you can.

A migration checklist

I created an extensive migration preparation checklist, available for newsletter subscribers. 🔒 Access the document here.

An excerpt from the migration checklist. See the full document here.

Part 3 will be published for all subscribers of my newsletter, and on this blog. Sign up here to not miss it. Paying subscribers can access the complete series and a migration checklist in this published article.

Featured Pragmatic Engineer Jobs

  1. Senior Software Engineer at Intro. $150-225K + equity. Los Angeles, California.
  2. Senior Software Engineer at OpenTable. Berlin.
  3. Software Engineer at Gem. San Francisco.
  4. Senior Product Engineer at Casual. $150-250K + equity. Remote (US, EU).
  5. Senior Software Engineer at Patina. Remote (US).
  6. Founding Software Engineer at Teero. €70-100K + equity. Amsterdam, Remote (EU).
  7. Senior Backend Engineer (Ruby) at Rise Calendar. €80-120K + equity. Remote (EU). I'm an investor.
  8. Engineering Manager at Spill. £80-100K + equity. London, Remote (EU).
  9. Senior Backend Engineer at Spill. £80-100K + equity. London, Remote (EU).
  10. Full Stack Engineer at Relive. €55-75K + equity. Remote (EU).
  11. Founding Software Engineer at Keel. £90-130K + equity. Remote (EU).
  12. Senior Backend Engineer at Waybridge. £60-110K + equity. London.
  13. Senior Full Stack Engineer at Packfleet. $105-240K + equity. NYC, Remote (US, EU).
  14. Engineering Manager at Clipboard Health. Remote (US, EU).
  15. Senior Full Stack Javascript Engineer at Clevertech. $60-125K. Remote (Global).
  16. Senior Product Engineer at Barsala. $100K+ + equity. Remote (Global).
  17. Engineering Manager at Raycast. $140-170K + equity. Remote (EU).

The above jobs score at least 10/12 on The Pragmatic Engineer Test. Browse more senior engineer and engineering leadership roles with great engineering cultures, or add your own on The Pragmatic Engineer Job board.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK