Storage Performance for PHX Cluster 100

Incident Report for Adeptcore, Inc.

Resolved

We have continued to monitor the performance and latency of all storage arrays throughout the day today and have not noticed any other issues.

We have also received confirmation from the datacenter storage team that a component failure is behind the increased latency and degraded performance of the storage array yesterday. The storage array re-balanced the cache and some of the workload on the array was also offloaded to ensure proper operations until the component could be replaced.

We consider this issue to be resolved and do not anticipate any other performance related issues occurring, as such, we will be closing this incident and encourage anyone who may be experiencing any issues to contact us directly.

Thank you for your patience and understanding.

Posted Oct 02, 2024 - 16:30 CDT

Update

We continued monitoring the latency and performance of the affected storage array throughout the evening yesterday. The latency levels returned to normal at 6:48 PM yesterday and have remained normal throughout the night and into the morning.

We are awaiting additional information from the datacenter storage team in regards to what caused the latency issues yesterday and will update this incident once we receive this information.

Currently, the affected storage array is not experiencing any issues with latency or performance but we will continue to monitor these statistics and the overall performance of the storage array until we are confident that the issue has been completely resolved and we receive such a confirmation from the datacenter storage team.

Additional updates to this incident will be provided as soon as we have any additional information.

Posted Oct 02, 2024 - 08:56 CDT

Update

The cache is continuing to rebuild on the affected storage array at this time. We are also continuing to work with the datacenter storage engineers and HP's enterprise team during this time to ensure the rebuild is completed successfully.

We will be continuing to monitor the affected storage array throughout the evening to ensure smooth operations and a decreased latency.

At this time, we are also in the process of offloading IOPs off of the affected storage array to minimize the disruption some virtual machines have been experiencing.

We apologize for any inconvenience caused by this disruption.

This incident will be updated as soon as we have any additional information to share.

Posted Oct 01, 2024 - 16:23 CDT

Update

The datacenter storage team has confirmed that the SSD caching on this storage array experienced an unexpected failure and the cache is actively being rebuilt on the array.

We are monitoring and are continuing to see a downward trajectory in overall latency on this storage array.

We are awaiting a confirmation from the datacenter team upon the cache being successfully rebuilt.

As it currently stands, the latency and overall performance on the cluster has improved, but has not yet returned to the expected levels.

Given the ongoing cache rebuilding process, it is likely that we may continue to see occasional spikes in latency which may affect performance on a small percentage of virtual machines but the overall average latency is slowly returning to normal.

We will be continuing to work with the datacenter storage team in regards to this ongoing issue and will update this incident once we receive additional information.

Posted Oct 01, 2024 - 13:50 CDT

Identified

We have received reports of some virtual machines experiencing issues with degraded performance this morning. During our investigation we noticed that one of our Nimble SAN clusters was experiencing higher than expected latency. We have contacted datacenter storage engineers to investigate the issue.

Currently, we are seeing higher than expected latency but it is currently trending downwards. Adeptcloud support is monitoring the status of virtual machines and migrating clusters when necessary.

Latency has been cut down by about 50% in the last hour and we are seeing it trend down back to normal with occasional spikes. We are continuing to monitor the situation. We will post an update once the issue is fully resolved or once we learn of any new information in the meantime.

Posted Oct 01, 2024 - 09:43 CDT

This incident affected: Adeptcloud Infrastructure (ACP - Storage).