PHX Datacenter I/O latency
Incident Report for Adeptcore, Inc.
Postmortem

Message to our MSP Partners regarding storage issues on 6/6/2023:

Our Environment:

Adeptcloud maintains two separate Nimble storage array clusters that are provided to it by its datacenter. All physical hosts are connected to the storage through two separate switches and across two separate uplinks to each storage array. The decision to have two separate storage arrays came from our previous experiences with previous providers as a way to limit the impact to our entire environment all at once. This design decision did work as intended and while a sizable portion of clients were affected, an equal sized portion of clients were not. Each array consists of dual controllers and is provisioned for 5 10TB volumes and 2 5TB volumes attached to the Adeptcloud Infrastructure through two separate physical paths. Names: NIMBLE Storage Cluster 100 and NIMBLE Storage Cluster 200. Volumes: NIMBLE-SAN300, NIMBLE-SAN400, NIMBLE-SAN500, NIMBLE-SAN600, NIMBLE-SAN700, NIMBLE-SAN800, NIMBLE-SAN900 and NIMBLE-SAN101, NIMBLE-SAN201, NIMBLE-SAN301, NIMBLE-SAN401, NIMBLE-SAN501.

What Happened:

Around 10AM CST on 6/6/2023 we received reports of virtual machines showing offline in partner tools, RMM and ScreenConnect, followed by users not being able to connect to their RDS servers.

At the same time that we started to receive tickets and calls from our partners, we received monitoring notifications for high storage latency affecting PHX Clusters.

During our investigation into the reports of clients disconnecting and the latency alerts we pulled up our storage monitoring dashboards and immediately noticed that NIMBLE Storage Cluster 200 volumes were reporting extremely high latency with one volume showing a latency of 22 MS for read at 10:01AM reporting time and 12.5K MS for read at 10:06AM reporting time.

Screenshot of reporting on volume NIMBLE-SAN301 http://i.adeptcore.com/GBsbLjsz709t.png and screenshot of top 10 datastore read latency during the affected time range: http://i.adeptcore.com/wAsqyak3R4VX.png

From the time the issues were reported and we received alerts to the time we determined that the issue was caused by storage latency and saw the aforementioned spike, approximately 9 minutes had passed. At 10:10AM reporting time the latency had subsided from its peak to a fraction of its high and by 10:15AM the latency, while still higher than we normally like to see, was back to usable and customers were reporting that issues were resolved. Some clients VM's had to be rebooted/services restarted.

For the next hour, the latency was stabilized and we ruled out all performance bottlenecks on our side, we checked our switching and port statistics, we checked I/O for individual VM's, internal services and connectivity from our environment to the SAN environment and confirmed everything to be in order. Network latency never exceeded 0.300MS during the worst of the spikes from us to the datacenter SAN controllers.

10:40AM CST Ticket opened with Datacenter.

We began moving clients to Nimble Storage Cluster 100 but the migrations were taking extremely long and so we cancelled this operation to not stress the existing storage array any further.

At 11:10AM CST - 11:20AM CST another spike of latency occurred with the highest latency being 1/12 of the previous spike. The latency spiked to approximately 1K MS from a previous high of 12.5K MS. This once again quickly subsided and was a little different than the first spike. Even though this only affected NIMBLE Storage Cluster 200 and it affected one volume (NIMBLE-SAN101), the other volumes reported higher than average latency but still in the higher end of usable service as it concerns latency. Screenshot of the top 10 latency metrics at 11:20AM reporting time: http://i.adeptcore.com/jkYFCXfNDYtS.png

We started closely examining all virtual machines running on that Volume to make sure we did not miss the situation possibly being caused by a rogue VM that was demanding IOPs. We were unable to find a culprit in our environment and all reporting was still showing individual VMs requesting normal levels of IOPs. After this spike of latency, the average latency settled at an even lower level than after the first spike and from the looks of things everything was stabilizing again, and customers were able to get back in.

At 12:00PM CST another spike that was even lower than previous one at 11:10AM CST occurred and this spike, even with a lower latency spike than the previous two, was still high enough to cause a subset of the original VMs to go down again. This latency spike lasted for about 45 minutes. http://i.adeptcore.com/IAuMt2Ane4Uw.png as shown in this screenshot, the latency, while much lower than the spikes before this one, lasted much longer resulting in a prolonged issue.

We noticed these spikes occurring approximately every hour with another spike occurring at 2:15PM CST.

The last and largest spike (since the first one) occurred at 3:40PM CST and stabilized by 3:45PM CST. After this spike we noticed that the latency had returned to normal levels and was back at our good range of under 15ms. We continued to monitor the latency and it got lower and stayed below what we normally see on a day-to-day basis. We noticed that after the fix was applied and throughout the day on 6/7/2023 we are still seeing lower than average latency for NIMBLE Storage Cluster 200 even while CPU and RAM utilization has returned to average day-to-day consumption. This is a sign that the same fix that resolved issues with latency spikes also improved performance on the cluster by a significant factor. As of the time of writing this report, the average latency for the affected cluster is sitting at 4.8MS across all five volumes.

 

How does storage latency impact the virtual machines:

When the backend storage experiences high latency issues, the symptoms may look similar to a network outage to affected clients. You might have seen servers dropping out in your RMM tools, Remote Control tools such as ScreenConnect and users not being able to log in due to various errors. The actual virtual machine, however, experiences an extreme delay reading and writing to storage, which in turn causes services such as VMware Tools, Horizon Agent, and others to freeze. Some of the affected virtual machines may then experience a crash because Windows is not able to cope with the long I/O wait time.

 

Timeline of communication with datacenter:

11:00AM CST - Datacenter began reviewing the issue.

11:45AM CST - Datacenter communication to Adeptcloud

12:54PM CST - Datacenter communication to Adeptcloud

1:02PM CST - Datacenter communication to Adeptcloud

1:35PM CST - Datacenter communication to Adeptcloud

2:04PM CST - Datacenter communication to Adeptcloud informing us of the Vendor Escalation

2:33PM CST - Datacenter communication to Adeptcloud

2:57PM CST - Datacenter communication to Adeptcloud

2:59PM CST - Datacenter communication to Adeptcloud

3:19PM CST - Datacenter communication to Adeptcloud

4:25PM CST - Datacenter communication to Adeptcloud informing us of Vendor Resolution

4:28PM CST - Datacenter communication to Adeptcloud

4:32PM CST - Root cause is being investigated. Vendor will provide report to Datacenter which will then provide to us. The vendor is HPE as these are Nimble storage arrays.

 

We were responding to datacenter communications during this time as well which consisted of troubleshooting data, updating requests and responses as well as metrics analysis. We were informed that this brought down the datacenters VPDC environments as well, as this storage is used by them for providing cloud services.

We are still awaiting a complete root cause analysis to see what caused the latency and what actions the vendor performed to resolve the situation. The datacenter will provide the report to us in 7-10 business days so we are expecting to receive it by the end of next week. We will share details with partners as soon as we receive them and have a chance to review them and ensure they are detailed enough.

 

What we are doing to alleviate this problem from happening in the future:

 

Our first SSD storage array is already built out with 80TB RAW capacity which will in turn be 30TB of usable capacity.

89.6TB capacity of NVME being is being added to local host clusters with 40TB usable capacity expected.

 

We have been working on a storage solution since, in our experience, most of these kind of issues are outside of our control and this brings a great level of stress to us internally and to our partners and most importantly our partners tenants. We have been working on a solution to storage for the last six months with internal testing, building and buying different solutions to see what we are expecting performance and reliability wise. Our goal has been to increase reliability and resolution times of anything storage related as well as increasing performance by moving off hybrid storage arrays. We do not want to accelerate our timeline and risk a situation where we end up causing more issues than solving. This is why we have been building and testing everything internally and benchmarking our results and setting up different failover scenarios to account for things we have experienced the last few years. With that in mind, storage related issues have been occurring due to networking. We have had two major network outages related to storage networks on the providers side of networking equipment. Bringing storage into our own racks will alleviate these issues as we will have 100% control of the paths from start to finish. This will be the first outage that was caused by a storage providers storage performance, and while we do not yet have the exact cause, in our experience this almost feels like a read cache on a hybrid array failed and then propagated the issue in waves throughout the day. We do not want to make any assumptions and are awaiting the root cause analysis from our datacenter provider and their vendor’s documentation. Our testing of different storage solutions has helped us determined that we are able to achieve approximately 300K IOPs over an ISCSI connection while keeping array sizes fairly small (within a 2U configuration). This is without any caching technology and just based on flash performance running connected to a live ESXI hosts via ISCSI over the same switching that we use in all our datacenters. Performance-wise we expect this to be a huge win. Adeptcloud having total control of the storage throughout the entire stack and network will make reliability and resolution much faster and better. Our goal is to achieve 99.99% uptime on all storage with a maximum of 52m 9.8s of unexpected storage outages in the first year and 100% uptime in the second year onwards. We plan on increasing storage capacity to 2.5x the active datasets and rotate through our storage arrays acting as active monthly, thereby giving each array time to be active while maintenance and routine checks can be conducted on the ones we are rotating out.

 

We are also currently experimenting with a way to not rely on storage arrays failing over to one another in case of outages and to have the extra capacity for immediate spin up without user intervention on a separate array at any time. We understand that something may fail at the worst possible moments, and our goal is getting users up and working as fast as humanly possible with as little troubleshooting as possible.

 

We plan to host quarterly town halls with partners to go over our solutions and upgrade schedules and we want to be as open in our communications with partners as possible. This will help ensure we provide you with a place to voice any concerns, have a way to communicate it to us and to give us ideas on ways we can improve the service, because at the end of the day this affects you and your clients as well as our reputation. We got started in this business with the support of our wonderful partners. Some of our partners have been with us since we had our first server and had to schedule down time to upgrade RAM to be able to bring on more partners. Some of our other partners have just only joined and started to offer our cloud services.

 

Next Steps:

We hope to see you at our town hall regarding this issue we are trying to schedule for 5PM CST this Friday 6/9/2023. We know this is  short notice so all affected partners will receive an email tomorrow so we can plan this for this Friday or if we have to move it to next week.

Thank you,

Denis Zhirovetskiy

Posted Jun 07, 2023 - 17:39 CDT

Resolved
This incident has been resolved.
Posted Jun 07, 2023 - 17:32 CDT
Monitoring
we have received confirmation that the latency issue has been resolved, we are leaving this in monitoring status for 24 hours, we have requested a root cause analysis from HPE and are awaiting confirmation of all details related to this event, we will post another update within 24 hours for our plan of action
Posted Jun 06, 2023 - 16:31 CDT
Update
all datastores on nimble storage 2 have returned to pre-issue latency numbers we are awaiting confirmation of root cause as well as confirmation that the issue will not reoccur
Posted Jun 06, 2023 - 16:00 CDT
Update
We are currently detecting another spike and are still working with datacenter engineers and vendor (HPE) on a resolution.
Posted Jun 06, 2023 - 15:41 CDT
Update
we are seeing latency has been stable since the last spike, DC Engineering and Vendor (HPE) are working on resolving the issue with Nimble San 2, majority of VMs should be fine, we are noticing if a VM froze up due to a storage latency spike that CPU utilization goes up while it is trying to catch up on the missed I/O
Posted Jun 06, 2023 - 15:27 CDT
Update
datacenter engineering has escalated the issue to the vendor "HPE" engineering team and both are actively working on resolving the issue.
Posted Jun 06, 2023 - 15:03 CDT
Update
We are seeing latency spikes continue to drop and back to normal utilization as of right now. It seems the latency spikes have been coming in frequencies of 1 hour. We are continuing to work with the datacenter engineers to resolve this issue and our indicators are pointing to a read cache issue leading to read latency on NIMBLE-SAN 2 which is the cause for the issues.
Posted Jun 06, 2023 - 14:57 CDT
Update
1/5 datastores on the second nimble array are still continuing to show higher than normal spikes, we are working with storage engineers at our data center to determine root cause and resolve the issue
Posted Jun 06, 2023 - 14:30 CDT
Update
latency is once again subsiding in the last 15 min it has stabilized to normal levels again, and VMs might take a couple of moments to catch up. We are running migration to a different nimble array but doing it carefully so as to not add to the latency problem
Posted Jun 06, 2023 - 12:53 CDT
Identified
We are seeing increased storage latency at a portion of our PHX datacenter. The latency has subsided and we are migrating VM's which is taking longer than usual due to the latency. VM's may experience higher than normal queue depths which will present as slow or degraded performance in most cases. We will update you as we get more information
Posted Jun 06, 2023 - 12:28 CDT
This incident affected: Adeptcloud Infrastructure (ACP - Storage).