[STO2] Storage issue in STO2 region

Incident Report for Cleura Public Cloud

Postmortem

Incident Report - 2021-09-09 - Storage outage - STO2

Incident Management Report

Country	Sweden
Location	STO2
---	---
Report Type	Incident Report
---	---
Date	09 Sep 2021
---	---
Incident Start time	10:45 CEST (All time mention in this document is in the CEST timezone).
---	---
Incident End time	10 Sep 2021 11:40
---	---
Priority	P1
---	---
Impact	High
---	---
Urgency	High
---	---

Executive Summary

From the 9th of September at 10:45 to the 10th of September at 09:00, City Network experienced a larger storage outage affecting a large number of customers in the CityCloud STO2 region.

The root cause of this outage was hardware failure while syncing a large amount of data, leading to cache data mismatch.

Background

At the beginning of 2021, the work to move virtual instances from our old NFS-based storage to new, advanced, Ceph storage clusters started. At the same time, it was also planned to update the Ceph storage by replacing all existing Hard Disk Drives based storage with Solid State Drives (NVMe). We ordered new hardware and expected delivery in March of 2021. Delivery was pushed forward several times due to the high worldwide demands on electrical components.

To improve the overall performance experience, NVMe drives were added to some of the existing clusters and enabled a write cache for volumes in April. In May, we enabled the same write cache for VMs also.

Again, we got new delivery times for the new hardware with the new expected ETA at the end of July.

24th of July, the slow ops increased by a lot, and the storage team investigated and tested different config changes to mitigate this issue as much as possible.

18th August, we began racking the new hardware with the following installation configuration and burn-in test phases.

4th of September, syncing of data to bring the new storage nodes into production was started. The first step was syncing Images over the weekend - following by Volumes on the 7th of September.

Volume sync was estimated to take 5-6 days and planned to be run as a slow-controlled background task to minimize slow ops. On the evening 8th of September, one of the new storage nodes rebooted. As the volume sync work was still running, engineers verified that the health status of the cluster was in an okay state.

Sync was still ongoing, with 75% done at the time of this incident.

Description of the incident

On Wednesday, 9th of September, at 10:45, We observed that one of the new storage nodes had crashed and rebooted. The server was taken out of production as an investigation pointed to a memory hardware error.

The engineering team could not get the cluster in health status again after taking it out of production earlier. Due to the ongoing sync, the broken server was still holding cache data.

We needed to replace the broken memory to get the server back in production again. An infrastructure engineer went to the data center to replace the broken memory to take the server back into production and access the cache again. The server was then repaired and added to the cluster again.

Due to some cache mismatch and cluster being still in an error state, we proceeded with a workaround to solve this issue, which seemed to work. Service was back again at 15:30.

At 17:30, we saw that the cluster tried to remove a non-existing copy of data from the cache and went into an error state again. The deep analysis started, and it indicated that this problem was in the core Ceph source code. We began to contact Ceph developers to involve them in this investigation.

On September 10th, 01:00, We moved the cache to a proxy mode to see if it could mitigate the situation, but we still had the old storage servers in production, we got continuous slow IOPS. Together with external developers, we started to update the source code in Ceph file system. When code work finished, we updated the cluster with this new, forked, version and at 09:00, the file system started to work correctly again. We continue to sync volumes data throughout the day to finish that operation. No further work on the cluster was done over the weekend, only monitoring to ensure the situation stabilized again. Like the behavior before this incident, we could still see slow IOPS from the old storage servers. We expected this to continue as long as they remained in production.

On September 15th, we started to sync all virtual instances to the new hardware. This was the last step in the migration work predicted to take four days to complete. We decided to increase the speed of the sync to take the old servers out of production as soon as possible.

After midnight the sync was completed without any increased disturbances.

On September 10th, we removed the main part of the old hardware from the cluster, and monitoring indicated that all slow ops disappeared at that time. Some old servers still existing in the configuration and only runs monitoring and management services.

This incident was closed On September 15th, 11:40.

Incident Timeline (all times in CEST)

Date	Time	Observation/Action
09 Sep 2021	10:45	One of the new storage servers crashed and rebooted due to broken memory
09 Sep 2021	15:30	Storage service was back.
09 Sep 2021	17:30	Cluster being in error state affecting storage service
09 Sep 2021	21:00	Working with upstream developers on a fix.
10 Sep 2021	01:00	The cache was moved to proxy mode as a step to mitigate the issue. This work didn't help due to increased slow IOPS
10 Sep 2021	09:00	After the installation of the updated Ceph software, the service was back online again.
10 Sep 2021	11:00	Close monitoring while volume data sync was ongoing. Slow OPS was recurring as before the incident.

Root Cause

During this incident, several problems and root causes were discovered:

Broken hardware - Failed memory on one of the new servers while data sync was ongoing. Even if the servers had been running intense burn-in tests, the failure happened anyway.
Ceph bug - After adding the broken server into production again and fixing the cache mismatch, it showed on a ceph bug when trying to remove already removed data.

Corrective and Preventative Measures

No	Action Description	Company/Contractor/Supplier	Planned Completion Date	Status
1	Remove cache from the cluster to increase performance.	City Network	Q4 2021	ON-GOING
2	Move Ceph internal services to new hardware.	City Network	Q1 2022	TO DO
3	Investigate if we should increase Burn-In tests on new hardware.	City Network	Q4 2021	TO DO

Final Words

We apologize for the inconvenience during this process, and thanks again for your patience.

Service availability is always our main priority, and we will do everything we can to learn from the incident and avoid a recurrence in the future.

If you experience any problems, then please open a ticket with our support team.

Sincerely,

City Network International AB

Posted Oct 08, 2021 - 16:38 CEST

Resolved

This incident is now resolved and postmortem will we published shortly.

Posted Sep 16, 2021 - 11:41 CEST

Update

We have now completed all synchronisation tasks to the new storage servers, and it will not be any more slow iops from the cluster.

Service is fully operational and we will update with the next steps in this change around 13:00.

Posted Sep 16, 2021 - 09:34 CEST

Update

UPDATE: We would like to inform you about the current status for the storage work in STO2.

We have now increased the speed of the synchronisation moving VM data from the old storage nodes into the new NVMe based storage nodes. The goal is to have this done before tomorrow morning to solve the initial problem with high latency and slow IOPS.

As long this synchronisation is ongoing it will be recurring slow IOPS that might affect VMs, but we are doing close monitoring to solve those as soon it occurs.

Posted Sep 15, 2021 - 14:05 CEST

Update

We would like to inform you about the current status for the storage work in STO2.

* Syncronization of VM images started yesterday evening (Tuesday, September 14th), and will be done continuously in the background.
Time. We will update with expected time frame for this sync soon.

Customer impact:

* Continuous improvements on the stability of our storage system as more and more data is offloaded onto the new storage.
* We sometimes experience recurring slow IOPS meanwhile the sync is ongoing.

Posted Sep 15, 2021 - 11:35 CEST

Monitoring

Slow ops are gone and we are continue with close monitoring of this issue

Posted Sep 13, 2021 - 13:13 CEST

Identified

Engineers investigating slow ops in the cluster at the moment.

Posted Sep 13, 2021 - 12:53 CEST

Monitoring

We are continuing to closely monitor this situation.
If you still experience issues with your VM, please contact support.

Posted Sep 10, 2021 - 12:15 CEST

Update

It's still work ongoing to solve this incident, but the situation is stable since the last update.

If you still experience any issues, hard reboot might help.
Please contact support if you have any issues.

Posted Sep 10, 2021 - 11:05 CEST

Update

Work is still ongoing.
At this time, we have managed to resolve most of the urgent problems and we notice that the storage system is returning to normal operation. We are still cautious and team is still working, but you may notice that your instances and volumes are functional again. We are closely monitoring and will continue to update.

Posted Sep 10, 2021 - 09:43 CEST

Update

The work with restoring the service is still ongoing all night.
Engineering team works together with external resources with specific actions at this moment.
We have no ETA at this time.

For affected VMs it will not help with hard reboot.

We will update this status 09:30

Posted Sep 10, 2021 - 08:41 CEST

Update

Posted Sep 10, 2021 - 08:40 CEST

Update

We continue to work on a resolution for this on-going issue.
Next update will be at 08:00 CEST

Posted Sep 10, 2021 - 00:26 CEST

Update

We continue to work on a resolution for this on-going issue.
Next update will be at 00:30 CEST

Posted Sep 09, 2021 - 23:29 CEST

Update

We continue to work on a resolution for this on-going issue.
Next update will be at 23:30 CEST

Posted Sep 09, 2021 - 22:36 CEST

Update

We continue to work on a resolution for this on-going issue.
The next update will be at 22:30 CEST

Posted Sep 09, 2021 - 21:28 CEST

Identified

We have identified another issue related to the new nodes in the cluster. Our storage team is currently working on it.

Posted Sep 09, 2021 - 17:28 CEST

Monitoring

We have now restored the service and keep monitoring this situation closely.

Affected VMs shall be working again, and if not do a hard reboot.
If you still see issues, please contact support.

Posted Sep 09, 2021 - 15:39 CEST

Update

This issue is related to the ongoing sync to new storage nodes, which one of the new server failed.
Storage team is working on restoring the service.

Posted Sep 09, 2021 - 15:20 CEST

Update

We are still working on this storage issue.
Rebooting servers will not help at this point.

Posted Sep 09, 2021 - 14:57 CEST

Identified

Slow operation mitigated, storage team is currently working on this issue.

If you still have issues with VM, please try do a hard reboot.

Posted Sep 09, 2021 - 12:00 CEST

Investigating

Our Engineering team is investigating storage issue in Stockholm region (STO2).

We apologize for the inconvenience and will share an update once we have more information.

Posted Sep 09, 2021 - 10:58 CEST

This incident affected: STO2 (Stockholm) (Block Storage).