Search at Cookpad: building new infrastructure

TL;DR

Teams who own the full stack of application development and infrastructure components have improved communication and collaboration.
Automating infrastructure operation using tools like Atlantis and Flux improves collaboration in multidisciplinary teams.
We built a new search infrastructure to develop search services by embedding an SRE to our search team.
This article focuses on the technical contents of the project and tips for soft skills will be explained in another article

Background

Cookpad is one of the World’s largest recipe sharing platforms. A core part of this platform is our search engine that allows our users to discover exciting recipes to cook.

Recently we have invested in the search experience, building new teams who have delivered a new version of our underlying search service, called search-v2. This team had embraced many new technologies (e.g. features powered by machine learning).

As well as building the next generation of search engine, this team has also embraced new technologies and ways of working in the underlying infrastructure that enables us to deliver this service to our users around the world.

I am a member of the team of Site Reliability Engineers at Cookpad. We support the infrastructure that our applications run on. Historically most of our Applications have been built with Ruby on Rails running on Amazon ECS (Elastic Container Service). Deployments on this infrastructure platform are fully supported by the SREs, without application engineers having to have deep knowledge of the platform itself.

The medium term goal of the SRE team is to move away from this model, and build tools and a platform where application engineers can own and operate their own infrastructure. We will achieve this by adopting new models of collaboration and by adopting industry standard tooling, like Kubernetes.

During the early phases of development, search-v2 was built on top of infrastructure that was shared between our search and machine learning teams. This was done so that both these teams could be early adopters of Kubernetes and the new model of operating their own infrastructure platform.

Now, in the interest of improved autonomy, the search team wanted to split this infrastructure so that the team owns its own vertical stack, with each stack incorporating its own applications and infrastructure.

Although the motivation of the search team makes sense, the team didn’t yet have enough expert knowledge of infrastructure migrations and the technology stack they wanted to use. Therefore, we started a collaboration project having an embedded member of the SRE team (me) to the search team to build new search infrastructure.

In this article, we will explain the technical aspects about how we built a new search infrastructure built by collaboration of an embedded SRE and our search team. Other points such as project management, tips for SRE embedding model and some such will be discussed in a future article.

Overview of search infrastructure

This figure illustrates the high level overview of the search infrastructure. This article demonstrates how developers apply daily changes via GitHub pull requests and Docker Image updates.

search-at-cookpad-building-new-infrastructure-dc58f4eab93f

As shown above, any changes to search infrastructure including kubernetes manifest applications are delivered over automation tools such as Atlantis(Terraform config application), Flux(Kubernetes manifest application) and such and such. Following sections will elaborate on the details of them.

Managing AWS resources using Terraform

Terraform project structure

In Cookpad, we primarily use AWS to run server resources. Since our search and machine learning teams have been sharing a GitHub repository to store all Terraform configurations in a single place, we decided to inherit the Terraform management strategy this time. The directory structure looks like below.

As you can find, we put all Terraform configurations in the terraform directory. This flat file and directory structure is easy to understand how we manage AWS resources if we don’t have too many resources. In the future, we may separate the configurations in different Terraform states and directory, but we decided to follow the current convention during the project.

In addition, we place an Atlantis configuration file at the root of the directory. Explanation of Atlantis is described later in this article.

Spinning up new EKS clusters

The search team was keen to use Kubernetes as a container orchestrator. In Cookpad, we primarily use AWS to run most of the server resources and we chose EKS to deliver Kubernetes on AWS. The search team is already slightly familiar with the combination thanks to the shared infrastructure with the machine learning team.

To launch new EKS clusters, we use Terraform EKS Module maintained by the SRE team at Cookpad. There are various ways to launch new EKS clusters on AWS and we have to understand many technical concepts for both AWS and Kubernetes to run EKS clusters. This module provides a way to provision an EKS cluster based on the current best practices employed by Cookpad’s SRE team. For example, the following Terraform configurations can spin up a new EKS cluster using spot instances with necessary VPC, IAM, Security Groups, and Auto Scaling groups.

Creation of VPC, IAM, and Security Group are optional and you can specify existing ones instead of creating them with the Terraform module.

Automation of Terraform plan and apply using Atlantis

Our search and machine learning teams use Atlantis, which automates `terraform plan` and `terraform apply` operations on GitHub pull requests. Because Terraform stores the current infrastructure and its configurations in state, Atlantis guarantees that there is only a single `terraform plan/apply` is executed at a time.

Atlantis itself is running on one of EKS clusters and deployed using Atlantis’ Helm chart and we place the following `atlantis.yaml` in the Terraform repository

With the above setup, we can execute `terraform plan` and `terraform apply` via Atlantis on each GitHub PR as illustrated below.

Managing Kubernetes manifests using Kustomize and Flux

Managing Kubernetes manifest using Kustomize

There are many ways to manage Kubernetes manifests such as vanilla manifests, Kustomize, Helm, kpt and others. We chose Kustomize to maintain Kubernetes manifests because of the simplicity that developers don’t have to learn too many new things in addition to the concept of Kubernetes manifests. Moreover, Kustomize is supported not only by kubectl but also many tools like ArgoCD and Flux, which we use to deliver GitOps for Kubernetes manifests.

The directory structure of the Kustomize is as follows. We collect all reusable manifests in base directory and put manifests for each EKS cluster under overlays directory.

Employing GitOps for Kubernetes manifests using Flux

As mentioned above, we achieve GitOps for Kubernetes manifests with Flux v1.

Flux is a tool that automatically ensures that the state of a cluster matches the config in git. It uses an operator in the cluster to trigger deployments inside Kubernetes, which means you don’t need a separate CD tool. It monitors all relevant image repositories, detects new images, triggers deployments and updates the desired running configuration based on that (and a configurable policy).

Using Flux, we detect newly pushed Docker images and commits to a main branch of the kustomize repository and generate Kustomize patches to reflect the latest changes. Some people may notice that there is a brand new Flux v2 project constructed with the GitOps Toolkit. However, we chose Flux v1 instead of v2 because of our familiarity with the software and technology maturity.

Moverover, we integrate Flux with Helm Operator. This operator enables us to manage Helm chart releases over Kustomize declaratively. The desired state of a Helm release is represented through a Kubernetes Custom Resource named HelmRelease. When the creation, mutation or removal of a HelmRelease resource occurs in clusters, Helm actions are performed by the operator. This way, we can combine Helm with Flux and it can be utilized to automate releases in a GitOps manner. Here is an example of a HelmRelease manifest to automatically generate elasticsearch-operator manifests with the operator.

To acknowledge what changes are deployed to clusters, we also send Flux events as a webhook with Flux Cloud. We use it to trigger Slack notifications when Flux applies changes to a cluster. This is deployed as a sidecar on the Flux deployment.

Switching infrastructure with confidence

Strategy to switch v1 and v2 safely

As described in the background section, we’ve been transitioning our search services from v1(Ruby/Rails) to v2(Python/Fast API). Because search-v1 handles our production workload with reasonable performance and availability, we want to also ensure that the new search-v2 services also can handle the same amount of requests with the same or better performance and availability. To achieve this goal, testing with production workload without causing cascading outages of search-v1 and other services was crucial. We utilized the three techniques to test the new search service.

Deliver v1 and v2 APIs using Strangler Fig pattern

Because search-v1 has many features, we cannot replace the whole service straight away. At Cookpad we love to implement and refactor code incrementally. So, we applied the Strangler Fig pattern to the search service replacement. The Strangler Fig pattern is a code refactoring/migration strategy. Instead of replicating all features from search-v1 in search-v2 (sometimes named a big bang replacement), we replaced search-v2 features little by little with search-v2.

Placing Circuit breaking proxy

To make our entire service available even if the search-v2 has problems during development, we also place circuit breakers in a proxy layer. In these circuit breakers, we monitor HTTP 5xx errors, latency, and network errors for search-v2. If search-v2 breaches the circuit breaker’s thresholds for those metrics, these circuit breakers immediately return HTTP 503. This way we can quickly share that the search-v2 features are not meeting our service level expectations so upstream services can stop passing requests to search-v2 until it satisfies those expectations.

In addition, we introduced a fallback mechanism in applications that use search-v2. As we use the Strangler Fig pattern, we have the same features in both search-v1 and search-v2. Thus, other services can get the same results by hitting search-v1. So, when there are errors coming from search-v2, we fallback from v2 to v1 and return appropriate results.

To implement this proxy and circuit breaker, we use Traefik. We have availability and latency SLOs for the search service. So, we generate Traefik configurations and the circuit breaking threshold based on the SLOs with our in-house tools. The following configuration is an example of the in-house tool configurations for Traefik.

Testing v2 APIs with traffic mirroring

Before releasing every search-v2 feature, we want to test the functionality, availability, and response time. To make this testing process simple and easy, we decided to mirror a portion of production requests, destined for search-v1, to search-v2. Traefik has a traffic mirroring function. Therefore, we are able to mirror requests sent to search-v1 to search-v2. If there are HTTP 5xx errors caused with the mirrored requests, Traefik just drops the response and we can investigate which functionality does not work as we expect.

The example configuration shown in the circuit breaker section specifies percentage for request mirroring for each target. The configurations are used to generate Traefik configurations via our in-house tools.

Conclusion

We made some technology choices like Kubernetes, Terraform, Atlantis, and Flux to build the new search infrastructure. They allowed the search team to be effective in developing their search features and operating the infrastructure.

Through the project, we have more confidence that application engineers are able to adopt new infrastructure tools by collaborating with SRE members. The project was clearly successful because it’s been about 3 months since I left the search team and the search team develops and operates their infrastructure without the SRE team’s support.

Successful factors of this project might be expanded in soft skills too. So, I will write up what I think is important (collaboration model, technical comfort zones, archetypes for building owned infrastructure) for this kind of project in another article.

Special Thanks!

Thanks Ed for giving me valuable feedback to write this article!

Search at Cookpad: building new infrastructure

Search at Cookpad: building new infrastructure

TL;DR

Background

Overview of search infrastructure

Managing AWS resources using Terraform

Terraform project structure

Spinning up new EKS clusters

Automation of Terraform plan and apply using Atlantis

Managing Kubernetes manifests using Kustomize and Flux

Managing Kubernetes manifest using Kustomize

Employing GitOps for Kubernetes manifests using Flux

Switching infrastructure with confidence

Strategy to switch v1 and v2 safely

Deliver v1 and v2 APIs using Strangler Fig pattern

Placing Circuit breaking proxy

Testing v2 APIs with traffic mirroring

Conclusion

Special Thanks!

Recommend

年收入80W+！这些律师账号是怎么做的？

安全意识电子邮件期刊

一篇文章弄清楚营销和运营、销售的区别和联系

直播间里的素人夫妻档，月均带货过亿

Newdex 上线 BSC

公益剧本杀《重聚》来袭！安慕希用心沟通Z世代

美国基础设施修正法案如何影响 Crypto 领域？

Building with Stencil: Tabs - Ionic Blog

B 站定义「兴趣破圈」

MEV 正在破坏以太坊的公平性，如何解决？

About Joyk