CircleCI News Last Updated Dec 21, 2022 12 min read

An update on CircleCI's reliability

Rob Zuber

Rob Zuber
Chief Technology Officer

A hard hat, binoculars, and walkie-talkie await the CircleCI tiger team.

Reliability update 2022-12-21

In the last couple of updates I’ve talked about some of the actions we’ve been taking to build longer-term reliability at the core of our service. In this update, I wanted to describe a bit of how we identified the underlying issues. Specifically, orienting around systemic issues, whether they are organizational or architectural.

Like many organizations, at CircleCI we have a “you build it, you run it” approach to software delivery, meaning teams are responsible for the full lifecycle of their own delivery streams. We also have shared practices for incident investigation and follow-up. However, with teams capable of managing this full lifecycle on their own, we started missing more systemic problems, thinking they were only present in the individual teams that were handling the follow-up.

Earlier this year, we turned our attention to those systemic issues by aggregating data from all of our sources. This included looking across post-incident reports, the associated historical data, and all of the follow-up work that had been done by individual teams. Inventorying all the data was important, but what stood out was what it took to get to this aggregate view needed to make changes.

Most of our tooling is oriented towards individual incidents rather than that aggregate review. Even the tools that showed data across incidents didn’t expose what we wanted. We looked at everything from where time was being spent in incident response to the classifications of causes so we could organize ourselves and our systems for highest impact.

While we were able to do most of the aggregation in a spreadsheet, much of post-incident follow-up is very narrative driven. Doing the work of structuring that historical data enough to draw conclusions was hard but has been very helpful in seeing the bigger picture. We were able to see things that weren’t clear from reading one incident report or by having a deep dive with one of our teams.

All of this work highlighted an interesting tension of maintaining our fast-moving, stream-aligned DevOps culture while bringing in a vantage point to eliminate system-wide challenges for our teams. This work provided us clearer insights into where we needed to address issues that were found in our organization or our architecture (or both). As a result, we are mitigating problems that are more systemic in nature, and providing guardrails so our teams can still move quickly and own what they build. With this aggregate view, we’re making progress on seeing and addressing points of failure before they occur.

To better building,

Rob Zuber

[email protected]

Reliability update 2022-10-27

Last month, I made a commitment to you that I’d bring you monthly updates on our reliability work: how it’s going, what’s working, and where we still have work to do.

I’d like to achieve three goals through these updates:

Reinforce our ongoing commitment to reliability,
Share more of our reliability roadmap, so you have broader context to any future updates we share on this work, and
Be transparent about our work and insights, so that the community can also benefit from our experiences on this journey.

In the software community, we all suffer when services go down, but we advance together by sharing our learnings.

Last month (scroll down for the full update), I shared that we’d been working on isolating parts of our platform in order to protect customers’ builds and keep them running.

Today, I want to share more detail about our current approach: system isolation in order to protect your builds no matter what. I want to emphasize that this isn’t simply something we did and reported on, but the principle of protecting builds will guide us through both upcoming reliability investments, as well as all new development work on the platform.

Like all platforms, we got here through evolution, so let me walk you through an (abridged) history of the CircleCI platform.

We started out with a monolith, like so many other companies. By default, in a monolith all your work is commingled. This necessarily creates coupling, and that leads to the potential for failures to cascade. With no separation, a failure somewhere in your codebase can lead to a failure anywhere else.

As we broke apart the monolith, we did so based on work stages, or what was happening at different points (such as workflow orchestration and job execution). That approach simplified our codebases and made delivery easier, but within those stages, we have a combination of active work and historical reporting on that work.

The work we’re doing now is to isolate at each of these stages, such that every component that is involved in running active builds can be protected from anything else.

We’re doing this work incrementally to ensure rapid results while minimizing disruptions. The first stage involved simple tools, like functions to disable historical viewing if needed. This creates a release valve.

We have also increased our use of read-only replicas for historical queries. And we are leveraging split deployments of some of these services to isolate compute resources even when the code is shared.

The next stage that we are moving into involves separating systems completely. In other words, code paths and data for real-time builds vs code paths and data for history. While replicas are helpful in distributing load, they require that all stores have the same volume of data. This can be solved with sharding, but even then you are stuck with a schema design that is trying to support both access patterns. When they are fully separate, we can optimize each design, both for scale and for product capabilities. We’re early in that approach but we’re again taking incremental steps to start realizing gains as quickly as possible.

This brings us to a good question: why didn’t we do this at the outset? Well, this is a need of scale, and doing it right out of the gate would be a mistake. Why? When you first create a platform like ours, and make early architecture decisions, I believe it’s incredibly important to make decisions that enable you to pivot quickly, and respond to the demands of your early customers. You don’t know what they’ll want, and therefore you don’t know what features your team will go on to build in order to support and delight those customers. It was possible at the beginning to imagine we’d face a project like this eventually, but I don’t think it’s possible to know which of an infinite set of scale tipping points we’d reach.

In all, what I’d like you to take from this update is that we are taking this seriously, and approaching it the way we approach all our work: incrementally, and with your needs at the core of our decisions. While we don’t want systems to break, it happens. Better isolation means we can march toward the real goal: ensuring your builds run, every day, no matter what else may be going on in our platform or the larger ecosystem.

If you’re curious about our incremental steps on this journey, check back here for monthly updates. And if not, that’s fine too; get back to building the things that support and delight your customers, but we wanted you to know what is going on behind the scenes at CircleCI.

To better building,

Rob Zuber

[email protected]

Reliability update 2022-09-19

Last week, the pipelines page was unavailable for a significant portion of a day. This prevented many teams from managing their work as expected. As an engineer and as a leader, I know how important it is to stay in flow, and have your tools there when you need them. We’re sorry for the disruption caused to your team’s work.

As I stated back in April (full post follows), my top priority as CTO is reducing the length and impact of incidents at CircleCI.

But when things look like they did last week, the headway we’ve made may not be apparent.

In addition to focusing on diagnostic speed since our original post, we’ve begun investing in protecting your ability to get work done (namely, run pipelines), even when things break. While our work is still in progress, we’ve made some key gains. But if you can’t see or feel the impact of this work so far, then we’re not succeeding. Not as a technical team, and not at creating the trusting relationship we want to build with you.

One gain worth noting is the work the team has done to begin segregating parts of our architecture. This lets us constrain incident impact and protect your pipelines when things otherwise go sideways. It allowed us to do things like temporarily shut off bits of the UI in order to make sure that pipelines could still run, which is what we did last week. But we didn’t share that with you. Instead, you saw that the site was down, and reasonably assumed that nothing had changed.

Again, we have more work to do here, and we remain deeply invested in it. We can’t stop things from breaking, but we can continue to find new ways to ensure your builds can always run, and give you better and more timely information about how to accomplish your work, even when Plan A fails.

Additionally, it’s been 5 months since our last reliability update, and we can do better. Going forward, I’m committing to updating you on our new reliability developments monthly. I welcome your feedback in the meantime.

Reliability update 2022-04-13

At CircleCI, our mission is to manage change so software teams can innovate faster. But lately, we know that our reliability hasn’t met our customers’ expectations. As the heart of our customers’ delivery pipelines, we know that when we go down, your ability to ship grinds to a halt as well. We’re sorry for the disruptions to your work and apologize for the inconvenience to you and your team.

What’s been happening

No single part of our platform or infrastructure is at fault for recent outages. Instead, we’ve seen a mix of sources of issues, from bug-causing updates to dependency issues, and upstream provider instability. The January update to our pricing plan brought increased traffic and usage to our platform. While we planned and modeled for this, it has contributed to us reaching inflection points in some of our systems.

While there is no clear pattern in the cause of recent incidents, we know our overall time to resolution has been too long. Diving into our incident response protocol has helped us uncover places where our team execution under pressure has not helped us. We fully embrace blameless engineering culture and the DevOps principle of “you build it, you run it,” but the distributed nature of both our system and our teams has made that connection, communication, and resolution difficult.

Why? Over the past 12 months, we’ve nearly doubled our engineering team. This growth has been intentional and provided some incredible velocity - last week alone we deployed over 850 times. But that growth also means our base of intuitive knowledge has become less central and cohesive. We need to rebuild both broad and deep systems understanding across all of our teams.

What we’re doing to move forward

For us, technology is all about people, and improving our reliability will take a people-first approach. As of last week, we’ve created a tiger team of on-call first responders, including myself, on on-call rotation. This is a global team of individuals empowered to both fix things quickly and effect long-term change through both process and technology. Our goal is to strengthen the impact of engineers who can drive an incident from identification to resolution and then help share insights with the larger team.

Historically, we’ve focused our reliability efforts on system “hot spots” that were known sources of downtime, including fleet management and machine provisioning. We’ve made deep investments there that have paid off. But as our organization has grown, our issues have been less about service-level disruptions, and more about the complex interactions of a large distributed system. Our goal for this tiger team is to get you back to working as quickly as possible, then use what we learn to resolve the underlying causes of those incidents.

We’re also making investments to our platform to build and rebuild with the future in mind. We recently hired a new chief architect to lead our efforts in platform scalability and building for long-term product innovation.

How you will know we’re making progress

While it would be unwise (and unbelievable!) to promise that we will never have another incident, we can commit to making them less of a burden for our customers.

As we continue to invest in our long-term platform stability, our short-term focus is on reducing incident length. For incidents where customer impact exceeds one hour, we commit to publishing an incident report on status.circleci.com.

As CTO, improving incident response is my top priority. We know we have work to do here, and we’re confident that the plans and team we have in place will help us make immediate improvements. Thank you to our customers and community for your ongoing support and patience.

CircleCI News

CircleCI security alert: Rotate any secrets stored in CircleCI

An update on CircleCI's reliability

Reliability update 2022-12-21

Reliability update 2022-10-27

Reliability update 2022-09-19

Reliability update 2022-04-13

What’s been happening

What we’re doing to move forward

How you will know we’re making progress

Recommend

类似iPhone 14！高通推出骁龙卫星可双向发送消息

分布式文件系统之FastDFS - 与牛竞技

星巴克向中国乡村发展基金会捐款1700万元，啡旅融合新模式赋能乡村振兴

达尔优发布新款A81三模机械键盘拥有黑透版、紫金版、冰霜蓝三个版本

不止有IMAX认证大屏荣耀平板V8 Pro评测

MagicOS 7.0 深度体验：「最佳打工系统」奖

Crypto Analysts: Start 2023 With Big Eyes Coin And Four Cryptocurrencies

【GET2022】丁玲爱心支教团周欣：义务“送教”的富足与缺失

Honor 80 Pro straight screen version with SD8+ Gen1 & a 160MP camera release...

Mocha 猫咖分析 — 有趣的商业形态（上）

About Joyk