When a Cloudflare outage disrupted large numbers of websites and online services yesterday, the company initially thought it was hit by a “hyper-scale” DDoS (distributed denial-of-service) attack.
“I worry this is the big botnet flexing,” Cloudflare co-founder and CEO Matthew Prince wrote in an internal chat room yesterday, while he and others discussed whether Cloudflare was being hit by attacks from the prolific Aisuru botnet. But upon further investigation, Cloudflare staff realized the problem had an internal cause: an important file had unexpectedly doubled in size and propagated across the network.
This caused trouble for software that needs to read the file to maintain the Cloudflare bot management system that uses a machine learning model to protect against security threats. Cloudflare’s core CDN, security services, and several other services were affected.
“After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file,” Prince wrote in a post-mortem of the outage.
Prince explained that the problem “was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.”
These machines run software that routes traffic across the Cloudflare network. The software “reads this feature file to keep our Bot Management system up to date with ever changing threats,” Prince wrote. “The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.”
Sorry for the pain, Internet
After replacing the bloated feature file with an earlier version, the flow of core traffic “largely” returned to normal, Prince wrote. But it took another two-and-a-half hours “to mitigate increased load on various parts of our network as traffic rushed back online.”
Like Amazon Web Services, Cloudflare is relied upon by many online services and can take down much of the web when it has a technical problem. “On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today,” Prince wrote, saying that any outage is unacceptable because of “Cloudflare’s importance in the Internet ecosystem.”
Cloudflare’s bot management system classifies bots as good or bad with “a machine learning model that we use to generate bot scores for every request traversing our network,” Prince wrote. “Our customers use bot scores to control which bots are allowed to access their sites—or not.”
Prince explained that the configuration file this system relies upon describes “features,” or individual traits “used by the machine learning model to make a prediction about whether the request was automated or not.” This file is updated every five minutes “and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly.”
Unexpected query response
Each new version of the file is generated by a query running on a ClickHouse database cluster, Prince wrote. When Cloudflare made a change granting additional permissions to database users, the query response suddenly contained more metadata than it previously had.
Cloudflare staff assumed “that the list of columns returned by a query like this would only include the ‘default’ database.” But the query didn’t include a filter for the database name, causing it to return duplicates of columns, Prince wrote.
This is the type of query that Cloudflare’s bot management system uses “to construct each input ‘feature’ for the file,” he wrote. The extra metadata more than doubled the rows in the response, “ultimately affecting the number of rows (i.e. features) in the final file output,” Prince wrote.
Cloudflare’s proxy service has limits to prevent excessive memory consumption, with the bot management system having “a limit on the number of machine learning features that can be used at runtime.” This limit is 200, well above the actual number of features used.
“When the bad file with more than 200 features was propagated to our servers, this limit was hit—resulting in the system panicking” and outputting errors, Prince wrote.
Worst Cloudflare outage since 2019
The number of 5xx error HTTP status codes served by the Cloudflare network is normally “very low” but soared after the bad file spread across the network. “The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file,” Prince wrote. “What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.”
This unusual behavior was explained by the fact “that the file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management,” Prince wrote. “Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.”
This fluctuation initially “led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state,” he wrote.
Prince said that Cloudflare “solved the problem by stopping the generation and propagation of the bad feature file and manually inserting a known good file into the feature file distribution queue,” and then “forcing a restart of our core proxy.” The team then worked on “restarting remaining services that had entered a bad state” until the 5xx error code volume returned to normal later in the day.
Prince said the outage was Cloudflare’s worst since 2019 and that the firm is taking steps to protect against similar failures in the future. Cloudflare will work on “hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input; enabling more global kill switches for features; eliminating the ability for core dumps or other error reports to overwhelm system resources; [and] reviewing failure modes for error conditions across all core proxy modules,” according to Prince.
While Prince can’t promise that Cloudflare will never have another outage of the same scale, he said that previous outages have “always led to us building new, more resilient systems.”
