November 18, 2025 will go down in history as the day Cloudflare, one of the fundamental pillars of Internet infrastructure, suffered its worst outage since 2019. According to the company's official statement, an internal technical issue caused the Bot Management system to generate a configuration file with duplicated features that exceeded the system limit, paralyzing much of the network for approximately 6 hours (11:20 - 17:06 UTC). From social networks to productivity tools, the global dependence on this infrastructure was exposed in the most dramatic way possible.
What is Cloudflare and why is its outage so significant?
Cloudflare is a company that offers security and performance services for websites, acting as an intermediary between servers and end users. Its content delivery network (CDN) and DDoS attack protection make it an essential component for millions of sites worldwide. According to official data, Cloudflare manages approximately 20% of global web traffic, protecting and accelerating millions of websites, applications and services.
When Cloudflare experiences problems, the impact is felt globally. It's not just another service: it's the backbone of much of the modern Internet. Its infrastructure is designed to be resilient, but even the most robust systems can fail under certain extraordinary circumstances.
Detailed timeline of the outage
Incident timeline (according to Cloudflare's official statement)
- 11:05 UTC - Database access control change deployed. Normal state.
- 11:20 UTC - Impact begins. Deployment reaches customer environments, first errors observed in customer HTTP traffic.
- 11:32 - 13:05 UTC - Team investigates elevated traffic levels and errors in Workers KV service. Mitigations implemented such as traffic manipulation and account limiting.
- 13:05 UTC - Bypass implemented for Workers KV and Cloudflare Access, reducing impact.
- 13:37 UTC - Work focuses on reverting Bot Management configuration file to a known good version.
- 14:24 UTC - Creation and propagation of new Bot Management configuration files stopped. Bot Management module identified as source of 500 errors.
- 14:30 UTC - Main impact resolved. Correct Bot Management configuration file deployed globally and most services begin operating correctly.
- 17:06 UTC - All services resolved. All downstream services restarted and all operations completely restored.
Affected platforms: the domino effect
Cloudflare's outage had a domino effect on multiple critical services. The list of affected platforms is impressive and demonstrates the global dependence on this infrastructure:
Even Downdetector, the site specialized in monitoring service outages, was affected due to its own dependence on Cloudflare. This created an ironic situation where the tool designed to detect problems was experiencing the same problems it was trying to monitor.
What really happened? The official technical explanation
According to Cloudflare's official statement, the problem was NOT caused by a cyberattack or malicious activity. In reality, it was a much more specific internal technical issue:
🔍 Root cause of the incident
The problem was triggered by a change in permissions of one of the database systems (ClickHouse) that caused the database to generate multiple duplicate entries in a "features file" used by Cloudflare's Bot Management system.
This features file doubled in size and was propagated to all machines on the network. The software that reads this file had a size limit that was below the duplicated size, causing the software to fail and generate HTTP 500 errors.
The technical problem in detail
Cloudflare's Bot Management system uses a machine learning model that requires a "features" configuration file. This file is updated every few minutes and distributed across the network to keep the system updated against new bot threats.
The problem began when a change in ClickHouse database access control (deployed at 11:05 UTC) caused a query to generate duplicate rows in the response. This doubled the number of features in the final file, increasing its size beyond the 200 feature limit the system had configured.
💡 Why was there a limit?
Cloudflare has a 200 feature limit for the Bot Management system because, for performance reasons, it pre-allocates memory for features. They normally use approximately 60 features, so the 200 limit was well above normal usage. When the file with more than 200 features reached the servers, the limit was reached and the system panicked, generating 500 errors.
What made this incident especially confusing was that the file was generated every five minutes, and was only generated incorrectly if the query ran on a part of the ClickHouse cluster that had been updated. This caused the system to recover and fail alternately, making it initially appear to be a large-scale DDoS attack.
Global impact: services affected according to Cloudflare
According to Cloudflare's official statement, the following services were impacted during the incident:
Affected services:
- Core CDN and security services: HTTP 5xx errors in customer traffic
- Turnstile: Failed to load, preventing dashboard access
- Workers KV: Significantly elevated level of HTTP 5xx errors
- Dashboard: Most users could not log in due to Turnstile
- Email Security: Temporary loss of access to IP reputation source, reducing spam detection accuracy
- Cloudflare Access: Widespread authentication failures from the start of the incident until 13:05 UTC
In addition to HTTP 5xx errors, Cloudflare also observed significant increases in latency in its CDN responses during the impact period. This was due to large amounts of CPU being consumed by debugging and observability systems that automatically enhance uncaught errors with additional debugging information.
Lessons learned: what can we do?
This incident underscores the critical need for redundant systems and mitigation strategies to face possible failures in critical services. The technology community must reflect on how to build a more robust Internet less susceptible to massive disruptions.
💡 Recommendations for companies
- Provider diversification: Do not depend exclusively on a single CDN or infrastructure provider.
- Contingency plan: Have clear backup plans tested regularly.
- Proactive monitoring: Implement early warning systems to detect problems before they affect users.
- Transparent communication: Keep users informed during incidents.
- Post-mortem analysis: Conduct detailed analyses after each incident to prevent future problems.
Conclusion: a reminder of Internet's fragility
Cloudflare's November 18, 2025 outage exposed the vulnerability of the Internet when one of its main providers experiences problems. Millions of users and companies were affected, highlighting the importance of diversifying technological dependencies and strengthening the resilience of digital infrastructure.
This historic incident reminds us that, although the Internet seems omnipresent and robust, it remains a complex network of interdependent systems. A single point of failure can have global consequences. The lesson is clear: redundancy and preparation are not optional, they are essential.
Need help with your IT infrastructure?
At everyWAN we help companies build resilient infrastructures prepared for any eventuality. Contact us for a personalized consultation.
Contact everyWAN
ES
CAT
EN