[STO2] Storage issue in STO2 region
Incident Report for Cleura Public Cloud
Postmortem

Incident Report - 2021-09-09 - Storage outage - STO2

Incident Management Report

Country Sweden
Location STO2
--- ---
Report Type Incident Report
--- ---
Date 09 Sep 2021
--- ---
Incident Start  time 10:45 CEST (All time mention in this document is in the CEST timezone).
--- ---
Incident End time 10 Sep 2021  11:40
--- ---
Priority P1
--- ---
Impact High
--- ---
Urgency High
--- ---

Executive Summary

From the 9th of September at 10:45 to the 10th of September at 09:00, City Network experienced a larger storage outage affecting a large number of customers in the CityCloud STO2 region.

The root cause of this outage was hardware failure while syncing a large amount of data, leading to cache data mismatch.

Background

At the beginning of 2021, the work to move virtual instances from our old NFS-based storage to new, advanced, Ceph storage clusters started. At the same time, it was also planned to update the Ceph storage by replacing all existing Hard Disk Drives based storage with Solid State Drives (NVMe). We ordered new hardware and expected delivery in March of 2021. Delivery was pushed forward several times due to the high worldwide demands on electrical components.

To improve the overall performance experience, NVMe drives were added to some of the existing clusters and enabled a write cache for volumes in April. In May, we enabled the same write cache for VMs also.

Again, we got new delivery times for the new hardware with the new expected ETA at the end of July.

24th of July, the slow ops increased by a lot, and the storage team investigated and tested different config changes to mitigate this issue as much as possible.

18th August, we began racking the new hardware with the following installation configuration and burn-in test phases.

4th of September, syncing of data to bring the new storage nodes into production was started. The first step was syncing Images over the weekend - following by Volumes on the 7th of September.

Volume sync was estimated to take 5-6 days and planned to be run as a slow-controlled background task to minimize slow ops. On the evening 8th of September, one of the new storage nodes rebooted. As the volume sync work was still running, engineers verified that the health status of the cluster was in an okay state.

Sync was still ongoing, with 75% done at the time of this incident.

Description of the incident

On Wednesday, 9th of September, at 10:45, We observed that one of the new storage nodes had crashed and rebooted. The server was taken out of production as an investigation pointed to a memory hardware error.

The engineering team could not get the cluster in health status again after taking it out of production earlier. Due to the ongoing sync, the broken server was still holding cache data.

We needed to replace the broken memory to get the server back in production again. An infrastructure engineer went to the data center to replace the broken memory to take the server back into production and access the cache again. The server was then repaired and added to the cluster again.

Due to some cache mismatch and cluster being still in an error state, we proceeded with a workaround to solve this issue, which seemed to work. Service was back again at 15:30.

At 17:30, we saw that the cluster tried to remove a non-existing copy of data from the cache and went into an error state again. The deep analysis started, and it indicated that this problem was in the core Ceph source code. We began to contact Ceph developers to involve them in this investigation.

On September 10th, 01:00, We moved the cache to a proxy mode to see if it could mitigate the situation, but we still had the old storage servers in production, we got continuous slow IOPS. Together with external developers, we started to update the source code in Ceph file system. When code work finished, we updated the cluster with this new, forked, version and at 09:00, the file system started to work correctly again. We continue to sync volumes data throughout the day to finish that operation. No further work on the cluster was done over the weekend, only monitoring to ensure the situation stabilized again. Like the behavior before this incident, we could still see slow IOPS from the old storage servers. We expected this to continue as long as they remained in production.

On September 15th, we started to sync all virtual instances to the new hardware. This was the last step in the migration work predicted to take four days to complete. We decided to increase the speed of the sync to take the old servers out of production as soon as possible.

After midnight the sync was completed without any increased disturbances.

On September 10th, we removed the main part of the old hardware from the cluster, and monitoring indicated that all slow ops disappeared at that time. Some old servers still existing in the configuration and only runs monitoring and management services.

This incident was closed On September 15th, 11:40.

Incident Timeline (all times in CEST)

Date Time Observation/Action
09 Sep 2021 10:45 One of the new storage servers crashed and rebooted due to broken memory
09 Sep 2021 15:30 Storage service was back.
09 Sep 2021 17:30 Cluster being in error state affecting storage service
09 Sep 2021 21:00 Working with upstream developers on a fix.
10 Sep 2021 01:00 The cache was moved to proxy mode as a step to mitigate the issue. This work didn't help due to increased slow IOPS
10 Sep 2021 09:00 After the installation of the updated Ceph software, the service was back online again.
10 Sep 2021 11:00 Close monitoring while volume data sync was ongoing. Slow OPS was recurring as before the incident.

Root Cause

During this incident, several problems and root causes were discovered:

  1. Broken hardware - Failed memory on one of the new servers while data sync was ongoing. Even if the servers had been running intense burn-in tests, the failure happened anyway.
  2. Ceph bug - After adding the broken server into production again and fixing the cache mismatch, it showed on a ceph bug when trying to remove already removed data.

Corrective and Preventative Measures

No Action Description Company/Contractor/Supplier Planned Completion Date Status
1 Remove cache from the cluster to increase performance. City Network Q4 2021 ON-GOING
2 Move Ceph internal services to new hardware. City Network Q1 2022 TO DO
3 Investigate if we should increase Burn-In tests on new hardware. City Network Q4 2021 TO DO

Final Words

We apologize for the inconvenience during this process, and thanks again for your patience.

Service availability is always our main priority, and we will do everything we can to learn from the incident and avoid a recurrence in the future.

If you experience any problems, then please open a ticket with our support team.

Sincerely,

City Network International AB

Posted Oct 08, 2021 - 16:38 CEST

Resolved
This incident is now resolved and postmortem will we published shortly.
Posted Sep 16, 2021 - 11:41 CEST
Update
We have now completed all synchronisation tasks to the new storage servers, and it will not be any more slow iops from the cluster.

Service is fully operational and we will update with the next steps in this change around 13:00.
Posted Sep 16, 2021 - 09:34 CEST
Update
UPDATE: We would like to inform you about the current status for the storage work in STO2.

We have now increased the speed of the synchronisation moving VM data from the old storage nodes into the new NVMe based storage nodes. The goal is to have this done before tomorrow morning to solve the initial problem with high latency and slow IOPS.

As long this synchronisation is ongoing it will be recurring slow IOPS that might affect VMs, but we are doing close monitoring to solve those as soon it occurs.
Posted Sep 15, 2021 - 14:05 CEST
Update
We would like to inform you about the current status for the storage work in STO2.

* Syncronization of VM images started yesterday evening (Tuesday, September 14th), and will be done continuously in the background.
Time. We will update with expected time frame for this sync soon.

Customer impact:

* Continuous improvements on the stability of our storage system as more and more data is offloaded onto the new storage.
* We sometimes experience recurring slow IOPS meanwhile the sync is ongoing.
Posted Sep 15, 2021 - 11:35 CEST
Monitoring
Slow ops are gone and we are continue with close monitoring of this issue
Posted Sep 13, 2021 - 13:13 CEST
Identified
Engineers investigating slow ops in the cluster at the moment.
Posted Sep 13, 2021 - 12:53 CEST
Monitoring
We are continuing to closely monitor this situation.
If you still experience issues with your VM, please contact support.
Posted Sep 10, 2021 - 12:15 CEST
Update
It's still work ongoing to solve this incident, but the situation is stable since the last update.

If you still experience any issues, hard reboot might help.
Please contact support if you have any issues.
Posted Sep 10, 2021 - 11:05 CEST
Update
Work is still ongoing.
At this time, we have managed to resolve most of the urgent problems and we notice that the storage system is returning to normal operation. We are still cautious and team is still working, but you may notice that your instances and volumes are functional again. We are closely monitoring and will continue to update.
Posted Sep 10, 2021 - 09:43 CEST
Update
The work with restoring the service is still ongoing all night.
Engineering team works together with external resources with specific actions at this moment.
We have no ETA at this time.

For affected VMs it will not help with hard reboot.

We will update this status 09:30
Posted Sep 10, 2021 - 08:41 CEST
Update
The work with restoring the service is still ongoing all night.
Engineering team works together with external resources with specific actions at this moment.
We have no ETA at this time.

For affected VMs it will not help with hard reboot.

We will update this status 09:30
Posted Sep 10, 2021 - 08:40 CEST
Update
We continue to work on a resolution for this on-going issue.
Next update will be at 08:00 CEST
Posted Sep 10, 2021 - 00:26 CEST
Update
We continue to work on a resolution for this on-going issue.
Next update will be at 00:30 CEST
Posted Sep 09, 2021 - 23:29 CEST
Update
We continue to work on a resolution for this on-going issue.
Next update will be at 23:30 CEST
Posted Sep 09, 2021 - 22:36 CEST
Update
We continue to work on a resolution for this on-going issue.
The next update will be at 22:30 CEST
Posted Sep 09, 2021 - 21:28 CEST
Identified
We have identified another issue related to the new nodes in the cluster. Our storage team is currently working on it.
Posted Sep 09, 2021 - 17:28 CEST
Monitoring
We have now restored the service and keep monitoring this situation closely.

Affected VMs shall be working again, and if not do a hard reboot.
If you still see issues, please contact support.
Posted Sep 09, 2021 - 15:39 CEST
Update
This issue is related to the ongoing sync to new storage nodes, which one of the new server failed.
Storage team is working on restoring the service.
Posted Sep 09, 2021 - 15:20 CEST
Update
We are still working on this storage issue.
Rebooting servers will not help at this point.
Posted Sep 09, 2021 - 14:57 CEST
Identified
Slow operation mitigated, storage team is currently working on this issue.

If you still have issues with VM, please try do a hard reboot.
Posted Sep 09, 2021 - 12:00 CEST
Investigating
Our Engineering team is investigating storage issue in Stockholm region (STO2).

We apologize for the inconvenience and will share an update once we have more information.
Posted Sep 09, 2021 - 10:58 CEST
This incident affected: STO2 (Stockholm) (Block Storage).