PJFP.com

Pursuit of Joy, Fulfillment, and Purpose

Tag: cloudflare status update

  • Cloudflare Down November 18 2025: Massive Global Outage Takes X (Twitter), ChatGPT, Discord, Spotify, League of Legends & Thousands of Websites Offline

    FINAL UPDATE – Post-Mortem Released: Cloudflare has released the detailed post-mortem for the November 18 event. The outage was caused by an internal software error triggered by a database permission change, not a cyberattack[cite: 25, 26]. Below is the technical breakdown of exactly what went wrong.


    TL;DR – The Summary

    • Start Time: 11:20 UTC – Significant traffic delivery failures began immediately following a database update.
    • The Root Cause: A permission change to a ClickHouse database caused a “feature file” (used for Bot Management) to double in size due to duplicate rows[cite: 26, 27, 81].
    • The Failure: The file grew beyond a hard-coded limit (200 features) in the new “FL2” proxy engine, causing the Rust-based code to crash (panic)[cite: 190, 191, 194].
    • Resolution: 17:06 UTC – All systems fully restored (Main traffic recovered by 14:30 UTC)[cite: 32, 90].

    The Technical Details: A “Panic” in the Proxy

    The outage was a classic “cascading failure” scenario. Here is the simplified chain of events from the report:

    • The Trigger (11:05 UTC): Engineers applied a permission change to a ClickHouse database cluster to improve security. This inadvertently caused a query to return duplicate rows[cite: 160, 172].
    • The Bloat: This bad data flowed into a configuration file used by the Bot Management system, causing it to exceed its expected size[cite: 27, 125].
    • The Crash: Cloudflare’s proxy software (specifically the FL2 engine written in Rust) had a memory preallocation limit of 200 features. When the bloated file hit this limit, the code triggered a panic (specifically called Result::unwrap() on an Err value), causing the service to fail with HTTP 500 errors[cite: 190, 218, 219].
    • The Confusion: To make matters worse, Cloudflare’s external Status Page also went down (returning 504 Gateway Timeouts) due to a coincidence, leading engineers to initially suspect a massive coordinated cyberattack.

    Official Timeline (UTC)

    Time (UTC) Status Event Description
    17:06 Resolved All services resolved. Remaining long-tail services restarted and full operations restored[cite: 268].
    14:30 Remediating Main impact resolved. A known-good configuration file was manually deployed; core traffic began flowing normally [cite: 32, 268].
    13:37 Identified Engineers identified the Bot Management file as the trigger and stopped the automatic propagation of the bad file [cite: 268].
    13:05 Mitigating A bypass was implemented for Workers KV and Access to route around the failing proxy engine, reducing error rates [cite: 267].
    11:20 Outage Starts Network begins experiencing significant failures to deliver core traffic .
    11:05 Trigger Database access control change deployed[cite: 267].

    Final Thoughts

    Cloudflare’s CEO Matthew Prince was direct in the post-mortem: “We know we let you down today”[cite: 37]. The company has identified the specific code path that failed and is implementing “global kill switches” for features to prevent a single configuration file from taking down the network in the future[cite: 259].

    Read the full technical post-mortem: Cloudflare Blog: 18 November 2025 Outage