CloudFlare Was Down Due To Edge Routers Crashing, Taking Down 785,000 Websites Including 4chan, Wikileaks, Metallica.com
Security and caching service CloudFlare was been down for close to an hour due to an issue with its edge routers. As the service adds a layer between 785,000 websites and their users in order to speed up traffic and prevent DDoS attacks and other security issues, all of those websites were affected — 4chan was one of them.
“At around 9:47 UTC (1:47 AM in California), a change got pushed out. It caused the edge routers in our network to crash,” co-founder and CEO Matthew Prince told us over the phone. “I don’t want to throw the routers’ vendor under the bus, but it caused them to crash. If you sent a packet to one of our IP addresses, you would get back a response that there was no router,” he continued.
Even though the operations team is monitoring the service 24/7, the routers crashed in such a way that they had to be manually rebooted. It means calling all the data centers around the world and reverting the rule. The DNS and proxy servers were perfectly fine, they were just unreachable.
“Starting about 30 minutes after the first report, the routers started coming online,” Prince said. But for a little while, the service was still not available because all the traffic was redirected to the first data centers that were going back online. It took about an hour to see normal traffic going in and out of CloudFlare’s servers.
“This is a completely unacceptable event to us,” Prince said. “In our four years of life, this is our third significant outage,” he continued.
The company’s own website and status page were down, giving an Error 502. The Twitter status account was its only communication channel still available.
CloudFlare generates so many pageviews that it would be the tenth website in the world. Back in September, the company announced at TechCrunch Disrupt SF that it served 70 billion monthly pageviews to 600 million unique visitors. Prince corrected that number and claimed that the service now serves “well over 100 billion pageviews” every month.
Its downtime shows developers that relying on a third-party service can lead to some issues. In CloudFlare’s case, the service handles a crucial task, delivering traffic.
Prince thought about CloudFlare’s paying customers as well and the company will act accordingly to make up for the outage: “we are extremely disappointed and we’ll definitely be honoring our paying customers.”
we're experencing a network-wide issue. Looking into the root cause.—
CloudFlareStatus (@CloudFlareSys) March 03, 2013
Update –> It appears to be a bad routing issue. Team is working to mitigate. Service should be coming back for all customers.—
CloudFlareStatus (@CloudFlareSys) March 03, 2013
@zomidaily Your site should be coming back online. Issue is clearing up.—
CloudFlareStatus (@CloudFlareSys) March 03, 2013