4

Statement regarding the ongoing SourceHut outage

 8 months ago
source link: https://outage.sr.ht/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
January 12, 2024 by Drew DeVault

Statement regarding the ongoing SourceHut outage

Current service availability, subject to DNS propagation delays:

  • meta.sr.ht: read-only
  • git.sr.ht: read-only*
  • todo.sr.ht: read-only
  • lists.sr.ht: read-only

* This is based on a backup restored from a few hours prior to the start of the DDoS attack and is slightly out of date. We are refreshing the backup with the changes which took place since, which affects about 1,000 git repositories and will take some more time to complete. Furthermore, object storage is still being restored, so releases attached to git tags are not currently available.

Update 2024-01-12 16:38 UTC: todo and lists are coming online in read-only mode.

Update 2024-01-12 16:13 UTC: We are beginning to bring some services online. meta and git will become available as DNS propegates to your local nameserver.


My name is Drew, I’m the founder of SourceHut and one of three SourceHut staff members working on the outage, alongside my colleagues Simon and Conrad. As you have noticed, SourceHut is down. I offer my deepest apologies for this situation. We have made a name for ourselves for reliability, and this is the most severe and prolonged outage we have ever faced. We spend a lot of time planning to make sure this does not happen, and we failed. We have all hands on deck working the problem to restore service as soon as possible.

In our emergency planning models, we have procedures in place for many kinds of eventualities. What has happened this week is essentially our worst-case scenario: “what if the primary datacenter just disappeared tomorrow?” We ask this question of ourselves seriously, and make serious plans for what we’d do if this were to pass, and we are executing those plans now – though we had hoped that we would never have to.

I humbly ask for your patience and support as we deal with a very difficult situation, and, again, I offer my deepest apologies that this situation has come to pass.

What is happening?

At 06:30 UTC on January 10th, two days prior to the time of writing, a distributed denial of service attack (DDoS) began targetting SourceHut. We still do not know many details – we don’t know who they are or why they are targetting us, but we do know that they are targetting SourceHut specifically.

We deal with ordinary DDoS attacks in the normal course of operations, and we are generally able to mitigate them on our end. However, this is not an ordinary DDoS attack; the attacker posesses considerable resources and is operating at a scale beyond that which we have the means to mitigate ourselves. In response, before we could do much ourselves to understand or mitigate the problem, our upstream network provider null routed SourceHut entirely, rendering both the internet at large, and SourceHut staff, unable to reach our servers.

The primary datacenter, PHL, was affected by this problem. We rent colocation space from our PHL supplier, where we have our own servers installed. We purchase networking through our provider, who allocates us a block out of their AS, and who upstreams with Cogent, which is the upstream that ultimately black holed us. Unfortunately, our colocation provider went through two acquisitions in the past year, and we failed to notice that our account had been forgotten as they migrated between ticketing systems through one of these acquisitions. Thus unable to page them, we were initially forced to wait until their normal office hours began to contact them, 7 hours after the start of the incident.

When we did get them on the phone, our access to support ticketing was restored, they apologised profusely for the mistake, and we were able to work with them on restoring service and addressing the problems we were facing. This led to SourceHut’s availability being partially restored on the evening of January 10th, until the DDoS escalated in the early hours of January 11th, after which point our provider was forced to null route us again.

We have seen some collateral damage as well. You may have noticed that Hacker News was down on January 10th; we believe that was ultimately due to Cogent’s heavy handed approach to mitigating the DDoS targetting SourceHut (sorry, HN, glad you got it sorted). Last night, a non-profit free software forge, Codeberg, also became subject to a DDoS, which is still ongoing and may be caused by the same actors. This caused our status page to go offline – Codeberg has been kind enough to host it for us so that it’s reachable during an outage – we’re not sure if Codeberg was targetted because they hosted our status page or if this is part of a broader attack on free software forge platforms.

What are we doing about it?

We maintain three sites, PHL, FRE, and AMS. PHL is our primary and is offline, FRE is our backup site, and AMS is a research installation we eventually hoped to use to migrate our platform to European hosting. As we initially had no access whatsoever to PHL, we began restoring from backups to AMS to set up a parallel installation of SourceHut from scratch.

We have since received some assistance from our PHL provider in regaining access to our PHL servers out of band, which is speeding up affairs, but we do not expect to get PHL online soon and we are proceeding with the AMS installation for now.

The prognosis on user data loss is good. Our backups are working and regularly tested, the last full backup of git and hg was taken a few hours before the DDoS began, and we have out-of-band access to the live PHL servers where all changes which occured since the most recent backup are safely preserved. The database is replicated in real-time and was only seconds behind production before it went offline.

We have replicated the production database in AMS and started spinning up SourceHut services there: we have meta, todo, lists, paste, and the project hub fully operational against production data in our staging environment here. We are still working on the following services in order of priority:

  1. git.sr.ht
  2. hg.sr.ht
  3. pages.sr.ht
  4. chat.sr.ht
  5. man.sr.ht
  6. builds.sr.ht

These services, particularly git and hg, require large transfers of data across our networks to restore from backups, and will take some time. Chat does not require particularly large amounts of data to be managed, but has special networking concerns that we are addressing as well.

Our goal is to enable read-only access for the community as quickly as possible, then work on full read/write access following that. Object storage (used for git/hg releases, build artifacts, and SourceHut pages) presents a special set of problems; we are working on those separately. Finding suitable compute to run build jobs is another issue which requires special attention, but we have a plan for this as well.

One of our main concerns right now is finding a way of getting back online on a new network without the DDoS immediately following us there, and we have reason to believe that it will. A layer 3 DDoS like the one we are facing is complex and expensive to mitigate. We spoke to CloudFlare and were quoted a number we cannot reasonably achieve within our financial means, but we are investigating other solutions which may be more affordable and have a few avenues for research today, though we cannot disclose too many details without risking alerting the attackers to our plans.

How you can help

What we need the most right now is your patience and understanding. Mitigating this sort of attack is a marathon, not a sprint, and we have to be careful not to overwork our staff, ensure we’re getting enough sleep, and so on – we are working as hard as we can. There are many people hard at work on this problem for you – I’d like to thank Simon and Conrad in particular for their work, as well as the datacenter and network operators upstream of us who are doing their best as well.

You can receive updates on this page, so long as we’re able to keep it online (low priority), as well as on Mastodon, where we are posting updates as well. This is also a good place to share your words of support and encouragement, as well as the #sr.ht IRC channel on Libera Chat. My inbox at [email protected] is also working (not without some effort, I’ll add), if you wish to send your support or offer any resources that might help.

Thank you for your patience and support. We are working to make things right with you.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK