2

Detectify’s journey to an AWS multi-account strategy

 1 year ago
source link: https://blog.detectify.com/2023/04/13/aws-multi-account-strategy/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Detectify’s journey to an AWS multi-account strategy

/ April 13, 2023

In the past year, we’ve shifted our infrastructure from a single Amazon Web Services (AWS) account owned by our Platform team to multiple domain-specific accounts. For each product domain and environment, we have created AWS accounts, which has allowed us to improve stability and security by reducing the blast radius. This setup also provides excellent scalability with good cost observability across the organization.

We’ve introduced a multi-account strategy at Detectify following two revered resources: AWS best practices and Team Topologies. Each domain has its own AWS account per environment, and we’ve transferred full domain ownership to the teams that own the domain.

In this article, I’ll share our reasons for using a multi-account strategy, our journey, the lessons we learned, and some of the benefits we’ve reaped.

Why we implemented the AWS multi-account strategy

Detectify has been using cloud infrastructure for many years, and our entire infrastructure lives on AWS.

We began as simply as possible in the early days to get things done.

One team with great collaboration focused on getting stuff done

One team with great collaboration focused on getting stuff done

Over time, our service and team both grew, and the cognitive load became too big for a single team. As a result, our system became more complicated and required more focus on infrastructure to make it scalable, stable, and secure.

We decided to split the teams and introduce a Platform team to offload product teams with many complicated repetitive tasks, leaving them time to focus more on product development. 

Following the methodology of Team Topologies, we used four types of teams to categorize our teams and accounts:

Stream-aligned teams
These teams focus on a single stream of work, such as a single product, service, set of features, or user journey. Stream-aligned teams are positioned to deliver value autonomously and need to be cross-functional to build, test, and operate independently.

Platform teams
The objective of Platform teams is to make stream-aligned teams’ work more accessible by providing the internal structures and services stream-aligned teams need to build, test, and deliver value continuously.

Complicated subsystem teams
These distinctive teams help fill the gap by providing specialized expertise, so stream-aligned teams can focus on the work they do best.

Enabling teams
Finally, enabling teams provides support to stream-aligned teams, helping bridge knowledge and capability gaps by sharing best practices and educating teams on emerging technologies.

The aim of our Platform team would be to focus on building a platform that supports stream-aligned teams focused on product development, treating product teams as customers.

How Detectify introduced a Platform team

How Detectify introduced a Platform team

Stream-aligned teams focused on product development with high productivity rather than where and how their services or databases were deployed, which the Platform team took care of. The Platform team focused on setting up a container orchestration system using Amazon-managed Kubernetes (EKS). It was easy to deploy new changes, as our infrastructure was scalable and secure.

This setup had pros and cons, but overall, we found it manageable for a while.

Pros

  • The cognitive load on teams was manageable.
  • It was easy to deploy current and future services.
  • Stream-aligned teams didn’t need to worry about infrastructure.

Cons

  • Teams were often dependent on the Platform team.
  • Resource management and isolating the blast radius were challenging.
  • We faced complexity in scaling the organization.
  • We had difficulty tracking the cost for specific products and features.
  • There was a lack of developer freedom to try other AWS services.

After using this setup for a few years, Detectify had a new investment round. As a result, the organization grew and needed to introduce more teams. Thus, we went from two to 10 domain teams.

More domain teams introduced

More domain teams were introduced

Our system grew, as did the number of teams, resulting in the Platform team becoming a bottleneck. This was due to the fact that the Platform team couldn’t cope with all the new requests coming in from the other teams. We began to feel the consequences of the cons described above.

Our systems were becoming less stable, and the Platform team was always on their toes. Since they already had so much to maintain, the Platform team felt hesitant to add new features and add to their workload. As a result, it became challenging to maintain the stability and security of the infrastructure while delivering quickly enough.

From the principles of Team Topologies, we’ve learned that “Each service must be fully owned by a team with the sufficient cognitive capacity to build and operate it”.

With our setup at the time, we had failed on this point. Domain teams needed to own their services and maintain them entirely. As mentioned, giving the team permission to only use their resources in a shared AWS account and Kubernetes cluster is challenging. 

To resolve the issue, we decided to give domain teams more freedom, though in a more stable and secure way. To do so, we decided that our Platform Team would also become the Enabling team as well.  

The journey to AWS multi-account strategy

Detectify’s journey towards a AWS multi-account strategy started after some of our teams vocalized frustrations about being unable to develop quickly enough and to fully own what they’d built. We had found ourselves in a catch-22: Domain teams felt blocked by the Platform team, yet the Platform team needed to be able to support them faster by jeopardizing quality and security — which caused stress and friction for both parties. 

To resolve these issues, we introduced a number of key steps to keep us moving forward on our journey. 

Identify and interview stakeholders

First, we identified all stakeholders and discussed requirements, best practices, and recommendations. After interviewing all parties involved, we formed a better picture of their needs and current challenges:

  1. Domain teams wanted to fully own their systems without being dependent on the Platform team.
  2. Domain teams wanted to use AWS services that could solve business problems in a more accessible and scalable way.
  3. Domain teams were waiting  too long to get basic permissions to conduct their work, which was costly for the company.
  4. The Platform team understood security requirements and was concerned about violating some of them, especially given that meeting security requirements is a core value of Detectify’s business.
  5. Regardless of the next steps we had in mind, the Security and compliance team required us to uphold our security policy. 

We all wanted to get things done faster and even more securely, so the ultimate goal was clear. 

So how did we get there? Now, that’s the journey!

Designing account structure

We didn’t want to increase the cognitive load on any team, so we approached our AWS accounts with that in mind throughout this process. Our first thought was to create an account for each team and a separate one for the Platform team.

Team-based AWS accounts

Team-based AWS accounts

After some time, however, we found that this approach is not scalable nor future-proof. This is because, along with evolving organizational structures, the names of individual teams can also change.  The new name of a team needed to be reflected in the corresponding accounts and their related automation processes, and this is a time-consuming task.

After reading a lot about AWS best practices for a multi-account strategy as well as the teachings of Team Topologies, we went with a different approach. The approach that we chose is scalable, secure, and protects teams against cognitive overload.

In order to match our AWS organization to the domains in our company’s structure, we used AWS Control Tower, as described in the image below.

To break this down a bit, here’s what we have in the picture above:

Workloads: A workload account represents a domain. In our case, the word domain comes from Domain Driven Design (DDD), representing the sphere of knowledge and activity around which the application logic revolves. A team can own one or more domains, depending on the size of the domain and its importance to the company.

  • SDLC: This part represents the environment. We have “development” and “staging” in this Organization Unit (OU). Anything in non-production can be part of this OU.
  • Production: This OU represents the customer-facing environment.

Security: We have some security tooling in this OU to monitor the security configuration of the resources in the organization’s accounts. Besides that, we have some audit logs to monitor activities in all accounts.

  • Log Archive: An account used by the Platform team to access all logging information for our AWS accounts.
  • Audit: Account used by the Platform team to access the audit information made available by AWS Control Tower, including AWS Security Hub security check results.

Infrastructure:

  • Network: This account keeps all our network infrastructure. All the VPCs and NAT Gateways are handled in this account. The Platform team also owns this account, but it can be easily handed over to the Network team (once we introduce such a team).
  • Central services: We keep all central services in this account: DevOps Platform & Pipelines, Observability Platforms, Kafka clusters, and similar. The Platform team also owns this account.
  • Backups: We have an easy-to-maintain Backup policy configured on all workload accounts, where all databases and other storage is backed up to the workload backup vault and is then copied into the Backup account. This policy protects us in case of any harm in a specific workload account, and we can always restore data from the Backup account. In addition, we have a “break glass” policy, which means the Platform team can access this account in case of necessity.

Transitional: This OU represents our old accounts from where we are migrating all our services and databases.

Sandbox: This OU is not connected to our network — instead, it’s used for exploration and Innovation.

Vulnerable systems: We want to test our product to ensure that it will detect vulnerabilities in the system. However, such a system must not be deployed within our network, and for this reason, we have a particular OU for it with an isolated network infrastructure.

  • SDLC: Same explanation as for workloads environments.
  • Production: Same explanation as for workloads environments.

With each of these components set up, we established a governance framework to limit regions in which we want to deploy our services. Guardrails and security policies were set to ensure that we can’t deploy or make changes in our systems that would violate our security requirements. 

We continue to improve these policies in order to make our environments even more secure. We’ll write about our security and compliance measures in upcoming blog posts.

Benefits and outcomes

After switching to multi-account orchestration, we’ve gained better cost observability per domain and team. However, one of the most visible benefits of this setup is the reduced blast radius

We used to have frequent system disturbances due to Team A deploying a service to the shared infrastructure, which disturbed all other services in the infrastructure or some bug in the system that affected other systems. Such an incident would trigger an alert on all teams, including the Platform team.

System disturbances over time

We started with our multi-account orchestration journey at the end of 2021, and since then, we’ve gathered results that have demonstrated the increased stability of our systems.

Another more important benefit is a higher rate of both developer satisfaction and personal development. Most of our developers didn’t know much about AWS or infrastructure; now, they are pretty good at it! 

Previous to our current strategy, our Platform team mostly worked with Kubernetes, docker containers, and similar tools. They needed to gain more experience in other cloud services to be able to provide help to domain teams. Today, they’re experts in a broad range of AWS services!

Advice on setting up a AWS multi-account structure

Multi-account orchestration can provide many benefits for modern organizations, but it can also come with some cost in time consumption in the wrong area if you’re not being careful. 

When setting up a similar strategy, my fundamental piece of advice is to be strict:  

  • Start off with one or two teams. 
  • Limit the number of technologies your platform team can handle. 
  • Create good example projects or templates and, once tested enough, open the doors for more.
  • Take the most critical system that might disturb other systems in the previous infrastructure (Kubernetes, for example) and solve that before moving on to anything else.

The rest of the domain teams might become impatient at times, but business development will continue, and they will need more time on infrastructure coding when the time comes.

When you have many options, it requires a lot of time and effort to conduct the necessary research and development to make the right choice. Having options is great, but not having to choose is even better. Throughout our journey towards AWS multi-account strategy, we found this to be true.

Here are a few more tips that I’d suggest when taking on a similar initiative: 

  • Decide on infrastructure language and stick to it across the organization. 
  • If you are using Terraform or similar tools, create modules for all the expected services domain teams will be using (Lambda, ECS, API gateway, and others). 
  • Next, adapt the modules to your infrastructure and let domain teams use them. Being strict is the way to go. If something can be improved, it’s improved in one place only, not 10.
  • Don’t be fast on using many new AWS services all at once. Master the commonly used ones, then add the new ones per requirement.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK