Outage at Primary data center

Incident Report for Adeptcore, Inc.

Postmortem

We attempted to do our best to bring customer environments back online as fast as we could without also creating additional sudden power issues or overloading Nimble storage arrays.

We are providing a timeline of that day as it relates to the outage below.

• 8:01 AM – Alerts and reports received for total connectivity loss at the Phoenix Datacenter.

• 8:10 AM – Total connectivity loss confirmed by Adeptcloud staff.

• 8:15 AM – Datacenter team ticket and call in regards to the ongoing issue.

• 8:25 AM – Datacenter technicians dispatched to the cabinet with issues.

• 8:30 AM – Additional details gathered by Adeptcloud staff, signs pointing to power loss issue.

• 8:40 AM – Datacenter team provided details that all equipment is powered on but that a secondary PDU breaker was tripped.

• 8:55 AM – One of the two Cisco firewalls (in HA mode) online and accessible.

• 8:57 AM – Secondary Cisco firewall (in HA mode) online and accessible.

• 9:08 AM – Management virtual machines powering on across multiple ESXi hosts.

• 9:25 AM – Virtual firewalls powering on across multiple ESXi hosts.

• 9:30 AM – Tenant Active Directory servers being powered on in a staggered fashion.

• 9:45 AM – Datacenter team continuing to investigate original cause of power loss.

• 10:00 AM – Tenant Horizon servers being powered on in a staggered fashion.

• 10:30 AM – Linux based virtual machines being powered on.

• 10:35 AM – Manual confirmation of network connectivity to tenant horizon servers.

• 11:00 AM – Selective powering on of remote desktop servers and custom LOB servers.

• 11:20 AM – Nearly 90% of all customer environments back online and functional.

• 12:10 PM – All 100% of customer environments back online and functional.

• 4:00 PM – Receipt of further alerts regarding PSU switchovers from the servers.

• 4:20 PM – Review of datacenter documents, PDU power draws, server power draw statistics.

• 7:05 PM – Replacement of B-side PDU by datacenter staff.

We are continuing to monitor the power usage across our environment and are working closely with datacenter staff to ensure a future PDU fault does not cause a total power loss scenario.

The secondary (B-side) PDU in our cabinet was replaced and we are actively exploring available options to prevent such an issue in the future.

Posted Jul 13, 2022 - 15:25 CDT

Resolved

On July 11th, 2022 Adeptcloud experienced a critical outage at our Phoenix datacenter around 8:00 AM CST. At this time, we have determined the cause to be a faulty secondary PDU inside our cabinet. The irregular power issues caused by the faulty secondary PDU caused the primary PDU to trip and caused a brief power outage. Datacenter technicians confirmed that equipment was powering on when they arrived at the cabinet and we regained access to our out-of-band management network (provided through our Cisco high-availability firewalls) shortly thereafter. After ensuring all management virtual machines were powered on and operating correctly, we slowly began to power on customer environments in a staggered fashion while fielding partner calls and emails regarding the issue. We attempted to do our best to bring customer environments back online as fast as we could without also creating additional sudden power issues or overloading Nimble storage arrays. We are providing a timeline of that day as it relates to the outage below.

• 8:01 AM – Alerts and reports received for total connectivity loss at the Phoenix Datacenter.
• 8:10 AM – Total connectivity loss confirmed by Adeptcloud staff.
• 8:15 AM – Datacenter team ticket and call in regards to the ongoing issue.
• 8:25 AM – Datacenter technicians dispatched to the cabinet with issues.
• 8:30 AM – Additional details gathered by Adeptcloud staff, signs pointing to power loss issue.
• 8:40 AM – Datacenter team provided details that all equipment is powered on but that a secondary PDU breaker was tripped.
• 8:55 AM – One of the two Cisco firewalls (in HA mode) online and accessible.
• 8:57 AM – Secondary Cisco firewall (in HA mode) online and accessible.
• 9:08 AM – Management virtual machines powering on across multiple ESXi hosts.
• 9:25 AM – Virtual firewalls powering on across multiple ESXi hosts.
• 9:30 AM – Tenant Active Directory servers being powered on in a staggered fashion.
• 9:45 AM – Datacenter team continuing to investigate original cause of power loss.
• 10:00 AM – Tenant Horizon servers being powered on in a staggered fashion.
• 10:30 AM – Linux based virtual machines being powered on.
• 10:35 AM – Manual confirmation of network connectivity to tenant horizon servers.
• 11:00 AM – Selective powering on of remote desktop servers and custom LOB servers.
• 11:20 AM – Nearly 90% of all customer environments back online and functional.
• 12:10 PM – All 100% of customer environments back online and functional.
• 4:00 PM – Receipt of further alerts regarding PSU switchovers from the servers.
• 4:20 PM – Review of datacenter documents, PDU power draws, server power draw statistics.
• 7:05 PM – Replacement of B-side PDU by datacenter staff.

We are continuing to monitor the power usage across our environment and are working closely with datacenter staff to ensure a future PDU fault does not cause a total power loss scenario. The secondary (B-side) PDU in our cabinet was replaced and we are actively exploring available options to prevent such an issue in the future.

Posted Jul 13, 2022 - 15:23 CDT

Monitoring

We've confirmed that all customer servers are online and operational. We are continuing to monitor the infrastructure and are continuing to work with the datacenter team.

We will be publishing a post-mortem on this situation within the next 48 hours once we have gathered all relevant details.

Posted Jul 11, 2022 - 12:12 CDT

Update

The vast majority of customer servers have been brought back online. Some Remote Desktop and File servers are currently still being brought back online. A final update will be posted once all of the affected servers have finished powering on.

The datacenter team is working to determine the reason for the original power loss scenario at this time.

Posted Jul 11, 2022 - 11:21 CDT

Update

We are still in the process of bringing customer environments back online. We are bringing them online in a staggered fashion, and as a result, some customers may see some servers show up online before others. We will continue to post updates as we have them.

Posted Jul 11, 2022 - 10:28 CDT

Update

Customer environments are currently in the process of coming back online. We will provide another update once all the environments should be operational.

Posted Jul 11, 2022 - 09:31 CDT

Update

The environment is currently slowly coming back online and customer environments will follow suit once we have verified all network and infrastructure components are operating correctly.

Posted Jul 11, 2022 - 09:19 CDT

Update

Identified issue has been isolated to a power event at PHX DC, back end infrastructure has been restored and we are working on restoring customer services, next update to follow in in 20 minutes.

Posted Jul 11, 2022 - 08:53 CDT

Identified

we have identified the issue working with DC OPS and are working on restoring services right now next update to follow in 20 minutes.

Posted Jul 11, 2022 - 08:40 CDT

Investigating

We are aware of an ongoing issue affecting customers at our primary data center and are working on a resolution. Our engineers are working to identify, isolate, and resolve this issue as quickly as possible. We will update this page as soon as we have any updates.

Posted Jul 11, 2022 - 08:19 CDT

This incident affected: Adeptcloud Infrastructure (ACP - Network).