How we migrated to Google Cloud

At Songkick, we recently completed a successful migration from our data centre to Google Cloud Platform. I’d like to give an overall account of how we successfully delivered this project with a small team in around 18 months, while allowing the rest of the company to continue delivering exciting products and features to our users.

Why migrate?

For roughly 9 years before the migration, our infrastructure lived in a data centre just outside of London. This had served us really well over the years, but had a number of important issues:

Only a very small number developers on our team had the experience to debug and fix some low-level infrastructure problems that could occur, for example tricky networking issues. This led to us being over-reliant on those developers to fix occasional out of hours fires, and consequently them not getting enough sleep!
During recent years, we’d done enough to put out fires and fix bugs in our infrastructure, but not enough to actually improve it, keep it up to date, and bring it in line with modern practices. We were at a point where doing that would be a huge task (likely larger than migrating to the cloud).
Adding capacity or making major changes to our infrastructure took too long. Buying new hardware, installing it and manually moving VMs around was time-consuming and made us slower to deliver as a team.
Moving to the cloud would mean that we have access to new services and tools that would allow us to build even better things. For example, we are now using Google Cloud Machine Learning, Cloud Dataflow and BigQuery, and these have had a huge impact.

How did we start?

With a minimum of fuss! Once we’d decided that something needed to be done, we assembled a small team of 4 engineers, had a meeting to discuss first steps, and sent an email to the company:

We then worked through the migration in 5 phases:

Phase 1: MVP

Goal: work in a small team to deliver an MVP of a single service to production in 3 months

This included deciding on a hosting solution (self-hosted vs managed hosting vs various cloud options), deciding high level platform architecture and major technology choices (for example using terraform/puppet for provisioning), and migrating a single back-end service fully serving production traffic. And all of this with only four engineers in three months!

Some important aspects of this were:

Getting a production service migrated as quickly as possible allowed us to find out very quickly if our overall approach was likely to succeed. Instead of building a non-production prototype in a fully-fledged infrastructure that looked great, and only further down the line finding that we couldn’t handle the levels of traffic we needed, or that the new service was much less stable than the old one, this approach allowed us to fail fast if we were to fail at all.
The small team effect: we’ve consistently found at Songkick that small autonomous teams tend to achieve more (per team-member) than larger teams, have more ownership of what they are building and are happier and more effective as a result. This recipe is even more successful when the team has a short deadline and a clear goal. Having a small team also allowed us to get started with minimal disruption to the rest of the company.
To achieve the above with only four engineers, we had to be very ruthless with our prioritisation, and defer building a bunch of infrastructure that we would normally require for a production service. The most extreme example of this was that for the MVP service we would not have any monitoring or logging! This is a decision we’d normally be extremely uncomfortable with, but decided it really was that important to gain confidence in the overall approach as quickly as possible.

The outcome: we did it! After three months of work, we had a single small back-end service deployed to Google Cloud and serving production traffic.

Phase 2: Build confidence

Goal: Prove that our high level technology and architectural choices will work (3 months)

At this stage, the highest priority was to understand the biggest risks facing the success of the project and make sure that we could overcome them. As with phase 1, the idea here is that it’s much better to find that the approach needed a major re-think as early as possible, rather than when we’d already spent eight months migrating.

To do this, we sat down as a team and listed all the risks and uncertainties with our approach, and prioritised the most serious ones. A couple of examples of our biggest concerns were:

Will it be too expensive?
Will the network performance be good enough (we had very stable network performance in our data centre and some of our applications may rely on this)?

Next, for each prioritised issue, we came up with a plan to test how serious each concern was. For example, for the above questions:

We spent time modelling everything we would need for our new platform, and then calculating likely best case and worst case costs.
We migrated the application that most relied on network stability and monitored it over several months for any issues.

At the end of this phase, we found that we had no major issues, and we were ready to move onto phase 3:

Phase 3: Building the foundations

Goal: Build our monitoring, alerting, logging, and deployment infrastructure (3 months)

For each of the above, we got together to discuss requirements, then investigated the options available, then as a team made a decision (or decided to investigate more) before building the solution. Where possible, we tried to stick with solutions that were similar to our existing infrastructure, in order to reduce unknowns, and make it easier for engineers to transition between platforms. On the other hand, where we saw large benefits to moving to something new, we did so. For example, we completely overhauled our monitoring system to use prometheus, as this was much more cloud-friendly than our existing monitoring system.

Phase 4: Execution

Goal: Move everything to the new platform! (9 months)

The most exciting part of the project! This included moving all of our applications to the platform, including our frontend applications, services, databases, ETL system and more.

Some key aspects of this phase:

We wanted to fully migrate one component at a time (and deprecate the data centre version of that component), rather than duplicate all of our infrastructure then do a big-bang release at the end. This meant we could deliver value to the company (taking advantage of our cloud infrastructure) as we migrated, pressure test our solution gradually and overcome issues early.
At this stage of the project, education for the rest of the engineering team was key. As well as the 4 core engineers on the team, the rest of the engineering team had a weekly rota where one engineer would join us each week and help to migrate our applications. This meant that everyone on the team could gain confidence working with, and maintaining, the new platform. As well as this, we ran regular talks and workshops on various aspects of working with our new infrastructure.
We made completing the project, and disbanding the migration team, one of our Tech Team Goals 2019
About two months from completion, we decided on a deadline for turning off all the servers in the data centre. For any long-running project, you’ll find that there’s always more you can do as you get towards the end, and the final two months has a danger of turning into four months, six months or more. To avoid this, we regularly reprioritised and decided what corners we needed to cut to meet the deadline. For example, certain parts of our ETL infrastructure only run once per month, so we decided to not migrate them at all until after we shut off the data centre. Similarly, there were certain tasks that we had automated in the data centre that we decided to run manually and have the engineering team migrate them later down the line if we decided they needed automating.

Phase 5: Celebration!

After 18 months of hard work we had finally done it, the data centre was no more!

After such a huge project it’s important to celebrate your success!

celebration cakes (members of the migration team were code-named the badgers)!

What’s next?

After completing the migration, we spent a couple of months ironing out some of the tech debt and uncompleted work that had been de-prioritised to get the migration over the line.

Now, we’re working on the future of our platform, including:

keeping on top of tech debt and upgrading things.
being pro-active about improving our platform, so that we don’t have to migrate ever again.
building amazing features for our users that we couldn’t build before.

We’ll also be sharing more around our architecture, technology choices and implementation in future posts.

Thanks!

How we migrated to Google Cloud

How we migrated to Google Cloud

Why migrate?

How did we start?

Phase 1: MVP

Phase 2: Build confidence

Phase 3: Building the foundations

Phase 4: Execution

Phase 5: Celebration!

What’s next?

Recommend

Internal Server Error – Medium

Building a Growth Framework Pt II: No Tester Left Behind

How We Ran a Hack Day in the Age of Social Distancing

Medium

通用抽奖工具之需求分析

通用抽奖工具之系统设计

我的2019年终总结

代码模板 | Go设计模式实战

人生需要勇气

Rust：Trait

About Joyk