Did Rust cause the Nov 18th 2025 CloudFlare outage?

Summary

On November 18th, 2025, a CloudFlare outage took down OpenAI, Anthropic, and an estimated 20% of the internet. The incident was traced to a misused .unwrap() call in Rust code that triggered a panic in production. This immediately ignited debate across programming language communities: Was this a failure of language design, developer error, or code review processes? Different communities drew vastly different conclusions about what the outage revealed regarding the adoption of newer languages for critical infrastructure.

Contributions & Discussions

The CloudFlare Blog

Cloudflare outage on November 18, 2025

Primary source: Official post-mortem - Cloudflare's CEO provides detailed technical analysis attributing the outage to a database permissions change that generated oversized configuration files, which triggered a hardcoded limit check and caused .unwrap() panic in FL2 (Rust) and silent bot-score failures in FL (C). Document emphasizes multiple system-level failures including insufficient input validation, missing kill switches, and inadequate error handling across both codebases. Commits to treating internal configuration files with same validation rigor as user input.

Matthew Prince, Co-founder & CEO of Cloudflare
View source
Hacker News

Cloudflare outage on November 18, 2025 post mortem

Hacker News community analysis (916 comments) reveals deep schism between language-level attribution and systems-level analysis. Discussion surface three competing narratives: (1) Rust critics use incident to challenge language safety claims, noting .unwrap() creates panic-based failure modes similar to null pointer exceptions, (2) systems engineers argue root cause was operational (rapid global config deployment without adequate rollback/observability rather than code quality), (3) multiple-failure advocates cite "Swiss cheese model" identifying breakdowns in input validation, testing, database query design, and monitoring.

Notable technical details include misdiagnosis as DDoS attack delaying response by 2 hours, and distinction that customers without Bot Management enabled experienced different failure modes (false positive bot scores vs. 5xx errors). Rust community responds by defending .unwrap() as legitimate tool when properly used, pointing to available clippy lints and arguing for organizational standards rather than language changes.

Thread demonstrates how same incident generates conflicting lessons based on lens applied: language advocates see validation/refutation of design choices, reliability engineers see process gaps, and scale-focused engineers emphasize consequences of rapid propagation without circuit breakers. CloudFlare executives' direct participation and detailed technical disclosure notably influences community tone toward constructive analysis over blame.

eastdakota aka Matthew Prince, CEO of CloudFlare
View source
r/programming subreddit

Cloudflare outage on November 18, 2025 - official response

Multi-language programming community analyzing incident through engineering process lens: participants debate whether .unwrap() represented legitimate "fail fast" design for invalid configuration versus preventable error, with Rust-experienced commenters recommending .expect() with descriptive messages and linter rules (#![deny(clippy::unwrap_used)]). Discussion identifies systemic failures beyond the panic: configuration validation gaps, missing canary deployments, 2-hour TTR despite recent database change. Thread contextualizes incident within broader infrastructure fragility (concurrent AWS, GitHub, Azure, 1Password outages) and praises Cloudflare's transparent post-mortem culture. Notable for distinguishing language capabilities from developer discipline and avoiding tribal positioning, commenters familiar with Rust's error-handling model critique the specific implementation while acknowledging trade-offs between crashing on invalid state versus graceful degradation with potential data corruption.

Reddit user
View source
r/rust subreddit

Cloudflare outage on November 18, 2025 - Caused by single .unwrap()

The Rust community's discussion on r/rust immediately focused on the full failure chain rather than the .unwrap() itself. Key themes:

  • the database query returning duplicates was the root cause;
  • the panic prevented silent corruption that the previous C system exhibited;
  • the real failures were deployment processes (no canary, delayed alerting);
  • .unwrap() has legitimate uses for invariants.

Notable pushback against 'ban all unwraps' narratives, with emphasis on proper tooling (clippy lints) and development practices. The community positioned this as 'down vs. catastrophic security hole', referencing 2017's Cloudbleed for context.

Reddit user
View source
X (formerly Twitter)

This would never happen in Erlang/Elixir.

Not because they’re better languages.

But the philosophy of those languages is explicitly to assume things will fail - for all sorts of unknown reasons.

Assuming things will never fail will bite you in the end.


Sam Aaron (Sonic Pi creator) reframes debate away from language features toward design philosophy. Argues Erlang/Elixir's 'assume everything fails' mindset prevents this class of error regardless of type system strength. Notable for avoiding language superiority claims while highlighting philosophical differences. Thread spawns substantive discussion about exceptions vs Results, testing culture (Clojure), and whether OTP supervision actually solves this problem.

Sam Aaron
View source
Ada Forum

Rust took out cloudflare

Ada community discussion comparing error handling approaches: participants debate whether SPARK formal verification could have prevented the incident, discuss panic vs. exception recovery models, and acknowledge that while bad code transcends language choice, different type systems and runtime models create distinct failure modes. Multiple contributors note Rust's approach forces explicit handling but makes recovery difficult, while Ada's exceptions enable graceful degradation but lack signature visibility. Thread evolves into broader architectural discussion of fault tolerance patterns (Erlang supervision) and formal methods.

Lucretia, Ada Forum user
View source
LinkedIn

LinkedIn discussion thread initiated by John De Goes examining whether the CloudFlare outage represents a Rust-specific failure or a universal developer trust problem. Thread evolves beyond language wars into substantive debate on type system limitations, formal verification approaches, and organizational practices. Notable for cross-community participation (Rust, Scala, Haskell, functional programming practitioners) and references to formal methods literature. Key tension: static types as 'first level debugging' vs. Rich Hickey's observation that 'most bugs pass the type checker.' Multiple commenters identify simple preventative measures (clippy lints) that weren't applied, shifting blame to process failures rather than language design.

John De Goes, creator of ZIO
View source
The Register

Cloudflare broke itself – and a big chunk of the Internet – with a bad database query

General tech practitioner community emphasizes process failures over language choice. Discussion reveals widespread experience with similar testing gaps and change control failures. Notable for cross-organizational war stories, centralization concerns, and pragmatic 'any language can fail' consensus that pushes back against both Rust advocacy and Rust criticism. Highlights gap between language safety guarantees and operational deployment practices.

Forum user
View source
LinkedIn

unwrap() in Rust: Cloudflare’s 2025 Outage: The First Billion-Dollar Rust Panic

Written one day after outage affecting millions. Argues unwrap() represents systemic infrastructure risk despite Rust's memory safety guarantees, implicitly challenging White House guidance that treated memory-safe languages as outage prevention. Acknowledges C code produced initial bug but frames Rust's panic-on-error as architectural failure: 'Rust didn't fail. A human bypassed Rust's safety contracts.' Demonstrates tension between correctness (Rust's goal) and availability (operations' goal) in polyglot production systems. Predicts industry shift toward panic-free Rust with mandatory linting.

David Kiarie Macharia
View source
Hackaday

How One Uncaught Rust Exception Took Out Cloudflare

Hackaday's analysis frames the incident as a Rust language failure, arguing that 'just because you're writing in a shiny new language that never misses an opportunity to crow about how memory safe it is, doesn't mean that you can skip due diligence on input validation.' The article questions whether the FL2 rewrite was justified, suggesting Cloudflare should return to the 'clearly bulletproof FL proxy.'

The comment section (74 comments at the time of writing this summary) reveals sharp disagreement with this framing. Multiple commenters note that the old C proxy also failed on the same input - but silently generated incorrect bot scores instead of crashing. Others point out that .unwrap() represents an explicit developer choice to panic rather than a language deficiency. The discussion evolves into broader debates about rewrite justification, the value of 'fail fast' vs. 'fail silent' error handling, and whether language choice matters compared to development practices.

Maya Posch
View source
Substack: Low Latency Trading Insights

The Rust Community Knew This Was Coming

Excavates 2018-2019 Rust Internals forum discussions showing developers explicitly warned that marketing terminology ("memory safety," "fearless concurrency") would create false expectations. Documents proposed alternatives that were rejected. Argues Cloudflare's reputational damage was predictable consequence of known gap between technical precision and marketing claims. Introduces "marketing debt" framing. Primary source documentation distinguishes this from real-time reactions.

Henrique Bucher
View source
Ziggit

Cloudflare and `.?`

The Zig community forum examined how their .? operator (equivalent to orelse unreachable) compares to Rust's .unwrap(), ultimately shifting focus to architectural questions about when subsystems should panic versus return errors. The discussion distinguished between microservice 'fail fast' wisdom and performance-critical monoliths where subsystem isolation matters, while several participants noted that Cloudflare's actual failure was process-related (lack of staged rollouts) rather than language-related. Notable technical contribution: demonstration of Zig's custom panic handlers that can disable specific panic sources at compile time.

Ziggit user
View source
LinkedIn

Italian LinkedIn engineering discussion uses CloudFlare incident as evidence against adopting "trendy" languages (Rust, Elixir, Zig) in enterprise contexts. Thread argues most applications are I/O-bound and should prioritize mature ecosystems (.NET, Java Spring) for business reasons: easier hiring, established libraries, supply chain security. Discussion reveals incident being weaponized in broader language adoption debates disconnected from technical root causes. Notable counter-argument: developer choice and containerization make language selection less critical than organizational mandates suggest.

Gabriele Santomaggio, Distributed Systems expert | RabbitMQ Team Member
View source