Trainline’s journey from MSMQ to SQS

Whenever I reach a milestone or finish something, I pause and reflect on the journey and what I have learnt. I’m going to talk about our journey of migrating one of our systems from MSMQ to SQS.

When we started, we had a system that looked like this:

Here are some of its characteristics:

fully asynchronous
built on NServiceBus using MSMQ for message transport
pub-sub model

It had its issues, but overall it worked well. However, when we started moving Trainline systems to the cloud (we use AWS), we carried out an audit and realised that this part of the system was not very cloud friendly.

What were the pain points to address?

No visibility

Message queues were local to machines. Every time we needed to take a look, we had to login remotely to those machines. We had fixed machines with fixed IP addresses. Our machines were like pets. While we were in a non-cloud-hosted environment, this approach was just about workable. But once you’re in the cloud, such practices have to go. Cloud hosting forces you to treat machines like cattle not pets, to use the popular, if slightly grizzly, analogy. Your system should be able to cope with a machine getting killed and a new one spinning up. Relying on fixed IPs is not workable as machines and IPs will change very frequently.

Flakiness

There were too many steps to make whole system work. Having too many queues was big contributor to increased flakiness. We had incidents when messages were stuck on outgoing queues due to network issues. Then subscriptions did not work because the subscription service had not processed the message. The asynchronous nature of failures like these meant they did not throw exceptions and were going unnoticed for a long time. The system was hard to diagnose and it was very time consuming to narrow down issues.

Not fully resilient

As the system was queue-based and asynchronous, it was resilient enough to cope with some scenarios, like a service restarting or a system being partially unavailable for a short period. But it was lacking resilience in other areas. As queues were local to machines, if we lost that machine, we had no way of tracking what was lost. Another example where resilience was lacking was that workers were dependent on the availability of distributor, so if a distributor went down, workers would not receive any message to process.

Scalability in AWS

A distributor managed a list of available workers. It kept track of each worker’s IP address and queue name to be able to forward messages. But in the cloud, we didn’t want to rely on fixed IPs or machines: if a worker node died, there wouldn’t be anything to tell the distributor about it.

Steep learning curve

There is a lot that you need to learn/know to start using NServiceBus effectively. It would take a long time for a new team members to get up and running. There were lots of configurations and framework-specific idiosyncrasies that you needed to know just to understand what is going on!

We need a better queuing system!

So we concluded that we needed to get a better queuing system and, ideally, to reduce the complexity built up around NServiceBus.

SQS vs RabbitMQ

There are many queuing systems, but we eventually narrowed it down to SQS and Rabbit as the final contenders.

RabbitMQ

• We have experience. We already use RabbitMQ in multiple parts of our system.
• All the infrastructure is already available/set up. So we would not have to invest time for that.

SQS

• It’s AWS. As we are hosting our system in AWS it makes sense to look for options available in your current ecosystem first.
• Low maintenance, as it’s hosted by Amazon and we consume it only as a service, we would have almost no maintenance overheads.

Here is how they compared on points which were important to us:

After reviewing all the points mentioned above, we decided to go ahead with SQS.

How our system looks now

What did we gain?

Scalability:

Subscriptions are managed at the SNS level and abstracted away from the worker. So adding more workers or removing any of them does not need any subscription management. As soon as any new worker comes up, it just starts reading from a fixed queue.

Visibility:

Now queues are external and easy to look at using AWS Console. If a worker is failing to process a message, after 5 attempts, the message is moved to our dead-letter queues. We have alerting setup on dead-letter queues. All of this happens without any need to log in to any worker nodes, which is a huge relief!

Reliability/Robustness:

Another benefit of having external queues is that now, if any worker is taken down, we just simply spin up a new one. There is no chance of losing any messages. And we can increase or decrease the number of workers as and when needed. This is the “cattle” principle in action!

Simplicity:

We no longer use NServiceBus for this part of our system. AWS SQS code is no more than 10 lines! Removing NServiceBus from the equation has significantly reduced the learning curve for new team members!

Conclusion

We still use NServiceBus in others parts of our system. We think it’s a very powerful framework and very useful in many use cases, especially when you have long-running sagas. But our needs for this part of the system are much simpler, so NServiceBus and RabbitMQ were overkill for that.

If you have also gone through, or are going through, a similar phase, we would love to hear your questions, thoughts and insights.

About the author

Balpreet is a full-stack developer. He loves building back-end APIs and Mobile Apps. He tries hard to keep things simple so that the minimum of effort is required by anyone trying to learn the system.

Trainline’s journey from MSMQ to SQS

Trainline’s journey from MSMQ to SQS

What were the pain points to address?

No visibility

Flakiness

Not fully resilient

Scalability in AWS

Steep learning curve

We need a better queuing system!

SQS vs RabbitMQ

RabbitMQ

SQS

How our system looks now

What did we gain?

Scalability:

Visibility:

Reliability/Robustness:

Simplicity:

Conclusion

About the author

Recommend

Faster Builds with NUnit-3 and Rake Multitask

There and back again: a {json:api} tale at Trainline

(Functional Programming) and <React/>

NodeJs — Staying alive

Vintage concepts, fresh applications — CS-in-JS

Dependency Injection in C++ Using Variadic Templates

An Open-Source Bionic Leg

[1904.09020] Genie: A Generator of Natural Language Semantic Parsers for Virtual...

Open Virtual Assistant Lab

An Open-Source Challenger to Popular Virtual Assistants

About Joyk