On July 11th, 2022 Adeptcloud experienced a critical outage at our Phoenix datacenter around 8:00 AM CST. At this time, we have determined the cause to be a faulty secondary PDU inside our cabinet. The irregular power issues caused by the faulty secondary PDU caused the primary PDU to trip and caused a brief power outage. Datacenter technicians confirmed that equipment was powering on when they arrived at the cabinet and we regained access to our out-of-band management network (provided through our Cisco high-availability firewalls) shortly thereafter. After ensuring all management virtual machines were powered on and operating correctly, we slowly began to power on customer environments in a staggered fashion while fielding partner calls and emails regarding the issue.
We attempted to do our best to bring customer environments back online as fast as we could without also creating additional sudden power issues or overloading Nimble storage arrays.
We are providing a timeline of that day as it relates to the outage below.
• 8:01 AM – Alerts and reports received for total connectivity loss at the Phoenix Datacenter.
• 8:10 AM – Total connectivity loss confirmed by Adeptcloud staff.
• 8:15 AM – Datacenter team ticket and call in regards to the ongoing issue.
• 8:25 AM – Datacenter technicians dispatched to the cabinet with issues.
• 8:30 AM – Additional details gathered by Adeptcloud staff, signs pointing to power loss issue.
• 8:40 AM – Datacenter team provided details that all equipment is powered on but that a secondary PDU breaker was tripped.
• 8:55 AM – One of the two Cisco firewalls (in HA mode) online and accessible.
• 8:57 AM – Secondary Cisco firewall (in HA mode) online and accessible.
• 9:08 AM – Management virtual machines powering on across multiple ESXi hosts.
• 9:25 AM – Virtual firewalls powering on across multiple ESXi hosts.
• 9:30 AM – Tenant Active Directory servers being powered on in a staggered fashion.
• 9:45 AM – Datacenter team continuing to investigate original cause of power loss.
• 10:00 AM – Tenant Horizon servers being powered on in a staggered fashion.
• 10:30 AM – Linux based virtual machines being powered on.
• 10:35 AM – Manual confirmation of network connectivity to tenant horizon servers.
• 11:00 AM – Selective powering on of remote desktop servers and custom LOB servers.
• 11:20 AM – Nearly 90% of all customer environments back online and functional.
• 12:10 PM – All 100% of customer environments back online and functional.
• 4:00 PM – Receipt of further alerts regarding PSU switchovers from the servers.
• 4:20 PM – Review of datacenter documents, PDU power draws, server power draw statistics.
• 7:05 PM – Replacement of B-side PDU by datacenter staff.
We are continuing to monitor the power usage across our environment and are working closely with datacenter staff to ensure a future PDU fault does not cause a total power loss scenario.
The secondary (B-side) PDU in our cabinet was replaced and we are actively exploring available options to prevent such an issue in the future.