Data center outages are becoming less frequent and less severe, according to new research, highlighting greater resilience across the industry.
As the number of data centers continues to expand, the total number of data center-related outages can be expected to increase. But new research from the Uptime Institute shows that there has been a steady downward trend in both the frequency and severity of outages over the past few years.
According to the institute's 2023 data center survey, just over half (55%) of data center operators said they had experienced an outage in the past three years. But that's down from 60% in 2022 and 69% in 2021.
Similarly, only one in ten outages in 2023 were considered serious or severe, marking a decrease compared to 2022 and 2021.
Uptime said a key factor behind the uptime improvement is that, year after year, most organizations are investing more in physical infrastructure redundancy.
“While the industry may move further toward distributed and software-based resilience models, maintaining and increasing redundancy at the site level remains a high priority for most operators,” he said.
According to their data, only a small proportion of enterprise, colocation or cloud providers were reducing redundancy; Across all of those groups, about a third were increasing their cooling and power redundancy levels, while the rest were keeping them stable.
Still, the report found that outages, while rarer, are becoming more expensive. More than half of respondents said their most recent significant outage cost more than $100,000, while an unlucky 16% reported costs of more than $1 million.
The industry advisory firm estimated that, on average, between 10 and 20 high-profile IT outages or data center events occur each year around the world that cause serious or severe financial losses along with disruptions to both businesses. as for consumers.
Power is a key factor in data center outages
Uptime research found that on-site power distribution disruptions are the most common cause behind shocking outages.
“This is not surprising given the intolerance of IT hardware to any significant power disturbance, such as voltage fluctuations or complete power loss, lasting more than fractions of a second,” he said.
On the contrary, cooling system failures can last (a little) longer without problems. The study noted that while IT-based failures may occur more frequently, their impact on specific applications or data sets is often isolated.
Issues with third-party providers are also increasing as a factor, which Uptime says reflects the growing reliance on SaaS and colocation providers. Other less common problems included problems with networks and fire suppression systems.
Human error remains a widespread problem
But regardless of how the problem manifests itself, humans are likely responsible for it.
Uptime said human error can be the result of a number of factors, such as poor training, the quality of procedures implemented, staff fatigue and the enormous complexity of operating the equipment involved.
Based on 25 years of data, Uptime estimated that human error, either directly or indirectly, contributed to between two-thirds and four-fifths of all incidents.
These disruptions are primarily due to staff not following procedures (listed by 48% of respondents who had a serious disruption in the past three years) or because the procedures themselves are simply inadequate (43%).
Four in five respondents said their most recent major outage could have been avoided with better management, processes and configuration.
“This suggests that, as in previous years, there is an opportunity to reduce disruptions through training and process review,” the report says.
The report adds that the aftershocks of the COVID-19 pandemic continue to have an impact on the data center industry. For example, supply chain disruptions continue to slow capital projects, which has led many organizations to delay maintenance and infrastructure upgrades.
These projects can cause disruptions, so there may be a recovery at some point in the future.
Uptime also warned that the global shift toward more transactive, dynamic and renewable power grids could reduce grid reliability. This could mean more outages in the future, as outages often occur when an uninterruptible power supply system or generator fails to respond to a grid outage.
Extreme weather events exacerbated by climate change have also been associated with data center outages in recent years.
“This trend is likely to intensify and increase the risks of outages until preventative measures are taken,” he warned.