Reading material for Operations & Datacenter engineers and managers

Check Out these projects, papers and blog posts if you're working on Geo redundant Datacenters or even if you only need to have your software hosted there. It's good to know what you're in for.

  Collected these for a colleague, these have been super useful over 
  the past 15+ years and and will most likely help and/or entertain you. 
  May be extended in the future.
  -- azet (@azet.org)

DNS geo & anycast:

https://dnsdist.org

https://yetiops.net/posts/anycast-bgp/

https://blog.cloudflare.com/a-brief-anycast-primer/ & https://www.cloudflare.com/en-gb/learning/dns/what-is-anycast-dns/

load balancing:

tcp/udp at the edge

Good general overview before you dive into any particular project: https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/

open source load balancers / designs & projects (f5 is nothing compared to most of these):

https://traefik.io/traefik/ - https://github.com/traefik/traefik

https://github.blog/2016-09-22-introducing-glb/

https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/

https://github.com/google/seesaw + https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf

dynamic route based fail-over, load balancing- & sharing at the border

https://vincent.bernat.ch/en/blog/2018-multi-tier-loadbalancer

https://blog.cloudflare.com/cloudflares-architecture-eliminating-single-p/

https://vincent.bernat.ch/en/blog/2013-exabgp-highavailability - https://github.com/Exa-Networks/exabgp

https://youtu.be/jJTqGFs4LNo?si=CnjAJChNNSA45MaW

https://metallb.universe.tf/concepts/bgp/

IPv4 "mobility" & exhaustion:

https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/

Datacenter level considerations:

Introductory: https://www.youtube.com/watch?v=KAFDI8j_h00
[watch me! very informative and amusing talk on Facebook/Meta switching from classical data-center use to Open Compute Platform from LISA2014 - a few years back - right when the first wave of big impact changes started to hit. It's very interesting to see how much has changed since, but even more so, how much has changed between the speaker starting at the company and giving the talk, with previous experience running Microsoft and Telco Hyperscale Datacenters]

Compliance

https://www.youtube.com/watch?v=Ow9s2c7zYXc

Rack/Power considerations

Thermals: Cooling (HVAC) - means energy and heat transfer

Power Usage Effectivenesss

PUE stands for Power Usage Effectiveness, it's a value that can define if an entire datacenter was built well - sometimes when it was built and as progress goes on it's not uncommon to see Hyperscale Dataceters or HPC Sites with PUE values of close to 1. Which means no waste.

The calculation of power usage effectiveness is total facility power/IT equipment energy = PUE.

https://www.datacenterknowledge.com/sustainability/what-data-center-pue-defining-power-usage-effectiveness

https://www.youtube.com/watch?v=5KM4Imy9LN8

https://www.youtube.com/watch?v=_Mmoh5cZx9s

https://sustainability.fb.com/data-centers/

[It has become essential to think of waste energy like over-cooled DC rooms. Today many run Cold Ailes at 30-40c instead of 20-25 as was in the past, this makes a huge impact in energy cost, efficiency and it does actually lower power consumption if you have 80-95% efficient PSUs, FANs and all othter kinds of equipment you see everywhere in a Dataceter over and over again running continoouisly. Cooling in data-centers is more than just hot and cold ailse - it's becomming more and more frequent for biigger datacenter prooviders and cloud companies to use cold-air intake. If you have a DC in Norway thats really good in winter! Others use geothermal or water ways to cool. In past HPC engangements I've seen bigger sites use the outpout of thermal heating from their TOP100 Machine to heat the adjacent buildings on campus. It's a concept known as "green computing" but I never liked the term because for the most part cloud providers like AWS will do ANYTHING to save $$$, "if it's good - that's a nice thing to market, if its bad, well lets just see it doesn't become a legal issue" - but credit where credit is due, most of these reallly smart energy efficiency schemes and green computing initiatives are run by engineers that do actually care]

DCF Security & Logging

Silent Data Corruption / HW Faults at scale

Network and Interconnect / Peering / Border & Edge

https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
[Spine/Leaf Topologies, now de-facto standard among all major enterprise industry vendors]

Scaling L3 Switching to a unified Open Compute Standard on 100, 200, 400, 800G:

Running a globally-distributed Network, WAN and data-center interconnects at planetary scale:

Net.: Resilience, Congestion Control, QoS/CoS:

Net.: Topologies:

Net.: Change Management / Life Cycle / Zero Trust

Hardware Security Modules + Key Storage

Storage Key Management

https://www.youtube.com/watch?v=ZoTg-wwZ6Yw

"Planet Scale"

https://research.google/pubs/spanner-googles-globally-distributed-database/

https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/

Time services & issues: the infamous leap second (don't forget about the leap day)

[Support fault-tolerant, highly-available NTP, PTP. You need to support both realistically if you're not just running a few rented racks. Many Routers, Switches and Telco Gear (4G, 5G,..) require PTP, NTP is for Servers and VMs]

https://developers.google.com/time/smear

Post mortems of outages you can learn from:

(most importantly: write a good post mortem in the first place, provide a concise time-line, response times, things that went well & things that didn't. provide a central place for customers and affected parties to call in - i.e. a "war room" so your engineers can do their work without their mobile rining every 20secs)

https://github.com/aphyr/partitions-post/blob/master/README.markdown
(this is a true gem that was passed around among SRE & software engineers some years back. partitions do exist.)

https://github.com/danluu/post-mortems

DNS geo & anycast:

load balancing:

tcp/udp at the edge

open source load balancers / designs & projects (f5 is nothing compared to most of these):

dynamic route based fail-over, load balancing- & sharing at the border

IPv4 "mobility" & exhaustion:

Datacenter level considerations:

Compliance

Rack/Power considerations

Thermals: Cooling (HVAC) - means energy and heat transfer

Power Usage Effectivenesss

DCF Security & Logging

Silent Data Corruption / HW Faults at scale

Network and Interconnect / Peering / Border & Edge

Scaling L3 Switching to a unified Open Compute Standard on 100, 200, 400, 800G:

Running a globally-distributed Network, WAN and data-center interconnects at planetary scale:

Net.: Resilience, Congestion Control, QoS/CoS:

Net.: Topologies:

Net.: Change Management / Life Cycle / Zero Trust

Hardware Security Modules + Key Storage

Storage Key Management

"Planet Scale"

Time services & issues: the infamous leap second (don't forget about the leap day)

Post mortems of outages you can learn from:

Recommend

US agency tasked with curbing risks of AI lacks funding to do the job

特斯拉Model 3高地在美国频繁出现，暗示发布日期临近

你还会玩吗！暴雪国服回归正逐步加快：网易、腾讯等谁会接手

英特尔CEO：摩尔定律面临节奏放缓但仍未消亡

web开源直播通讯平台webtim

Habitile - Habit tracking made simple | Product Hunt

Here is the List of 50 Famous Celebrities and Their Height

【Python】【OpenCV】关于cv2.findContours()轮廓索引（编号）解析（RETR_TREE） - Va...

Your Website Search Hurts My Feelings

努比亚Z60 Ultra公布首销战报销售额一秒破亿

About Joyk