This has been resolved with no incident for several months after an overhaul of our power infrastructure at the datacenter.
Mar 23, 12:07 CDT
Updates from datacenter visit that occurred on 10/20/2022 and 10/21/2022.
Below is an update for everyone from Denis' trip to the datacenter on October 20th and 21st.
- ISCSI connectivity was confirmed to be happening across two NICs on each host and across two separate PCI network cards.
- ISCSI connectivity was tested and confirmed with the Datacenter team. We tested each switch and network link across the entire ISCSI path from our network racks back to the datacenter provided Nimble storage arrays. We have re-routed our ISCSI connections with the datacenter to occur across two separate network paths. Our drops come into the Datacenter storage cage from both the East and the West sides now.
- We performed in-rack testing of ISCSI network redundancy across the switching fabric.
- ISCSI is configured in MPIO, should any portions of the paths go down, the traffic is rerouted via other paths.
- The power has been upgraded to 208v 30AMP circuits for both the primary and secondary circuits.
- New PDU's were installed into our network racks.
- We've tested the redundancy of each hosts' power, network connectivity and SAN connectivity.
- We've tested the redundancy of each switch, firewall, network connectivity and their power inputs.
- We've performed thorough testing of the PDU's in each rack. During this testing, we shut down A-side PDU's in each rack and confirmed a seamless transition to B-side PDU's without any interruptions or outages occurring. This was tested and confirmed three times overnight.
> The same tests were performed on the B-side PDU's in which we confirmed a seamless transition to the A-side PDU's without interruptions or outages occurring.
> Additional testing was performed on each PDU Bank and breaker.
- All the power cords inside of our racks were replaced with Z-lock locking power cords.
- We have spoken with and established a policy with the datacenter provider for two specific remote hand technicians for the time being. Both of these technicians were involved in some of the testing and received in-person information and walkthrough in regards to our configuration, infrastructure and specific needs. They were amazing in helping us during testing and with getting everything replaced within our established timeframe.
Additional updates on future upgrades and changes to our infrastructure will be provided on December 23rd, 2022 in line with our original timeframe.
Oct 25, 12:47 CDT
We experienced a loss of our primary SAN network on 10/11/2022 around 4:10 PM CST. The issue was determined to have occurred as a result of an incorrect datacenter networking configuration that was applied shortly before this occurred. Our primary SAN network connection was restored at 4:55 PM CST. Upon restoration of the datacenter networking configuration, the virtual machine bootup took longer than expected due to insufficient availability in IOPs on the datacenter Nimble storage arrays. As a result of this outage, we will be making the below changes to our infrastructure.
- Immediate Action Plan:
Denis will be arriving at the PHX DC on Thursday Morning (10/20/2022) to perform the following tasks:
Review networking and storage network redundancy from Adeptcloud equipment to upstream datacenter switches and storage arrays.
Review networking and power redundancy in Adeptcloud environment between the hosts and other rack devices.
Replacement of all power equipment in the network racks with 208V 30AMP circuits; followed by the replacement of all power cables with locking cables.
Power will be upgraded for storage upgrades that will occur around December.
Redundancy testing will not cause any downtime as hosts will be cleared of all client virtual machines at the time of testing.
> Important Note
We will be posting another update here on 10/24/2022 in regards to this week's datacenter visit and the items addressed in the immediate action plan.
- Short Term Action Plan
Following this visit, the following will occur in the next sixty days:
Denis will be arriving at the PHX DC again to setup new storage upgrades.
An installation of storage servers and a subsequent migration from datacenter Nimble environment which will help us ensure guaranteed IOPS; specifically for boot storms and failover situations.
Storage failover testing (to occur while empty and with no customer virtual machines running on storage).
New environmental monitors being installed into the network racks for continued optimal operating temperatures and humidity.
> Important Note
We will be posting another update here on 12/23/2022 in regards to the subsequent datacenter visit and the items addressed in the short term action plan.
- Continuous Action Plan
Subsequently, the following measures will be implemented in the next 90 days:
Adeptcloud staff will schedule visits to the PHX DC on a quarterly basis to test redundancy and failover plans.
We will be limiting DC Access to our network rack and equipment to emergency use only by remote hand technicians.
All future hardware upgrades and server installations will be performed by Adeptcloud staff or with our staff present during any such installations performed by the datacenter.
Oct 17, 15:34 CDT
At this time all servers appear to be back online and functionality has been restored. A full after-action report and post-mortem will be published and made available by the end of the week.
Oct 11, 19:43 CDT
We are currently monitoring the virtual machines coming up. Approximately 30% of virtual machines are back online at this time. Configuration has been restored by NETOPS for SAN Network. We are working on verifying all virtual machines are coming up.
Oct 11, 17:35 CDT
We have identified the issue to be related to our storage network. Primary SAN network went down, DC team is now investigating the issue and restoring the network config. We expect this to be completed in the next 15 minutes and for VM's to start coming back online.
Oct 11, 16:45 CDT
We are aware of reports of issues with some servers on the PHX data center and are currently investigating the cause. We will post updates here as we gather more information.
Oct 11, 16:13 CDT