Country | Sweden |
---|---|
Location | STO2 |
--- | --- |
Report Type | Incident Report |
--- | --- |
Date | 09 Sep 2021 |
--- | --- |
Incident Start time | 10:45 CEST (All time mention in this document is in the CEST timezone). |
--- | --- |
Incident End time | 10 Sep 2021 11:40 |
--- | --- |
Priority | P1 |
--- | --- |
Impact | High |
--- | --- |
Urgency | High |
--- | --- |
From the 9th of September at 10:45 to the 10th of September at 09:00, City Network experienced a larger storage outage affecting a large number of customers in the CityCloud STO2 region.
The root cause of this outage was hardware failure while syncing a large amount of data, leading to cache data mismatch.
At the beginning of 2021, the work to move virtual instances from our old NFS-based storage to new, advanced, Ceph storage clusters started. At the same time, it was also planned to update the Ceph storage by replacing all existing Hard Disk Drives based storage with Solid State Drives (NVMe). We ordered new hardware and expected delivery in March of 2021. Delivery was pushed forward several times due to the high worldwide demands on electrical components.
To improve the overall performance experience, NVMe drives were added to some of the existing clusters and enabled a write cache for volumes in April. In May, we enabled the same write cache for VMs also.
Again, we got new delivery times for the new hardware with the new expected ETA at the end of July.
24th of July, the slow ops increased by a lot, and the storage team investigated and tested different config changes to mitigate this issue as much as possible.
18th August, we began racking the new hardware with the following installation configuration and burn-in test phases.
4th of September, syncing of data to bring the new storage nodes into production was started. The first step was syncing Images over the weekend - following by Volumes on the 7th of September.
Volume sync was estimated to take 5-6 days and planned to be run as a slow-controlled background task to minimize slow ops. On the evening 8th of September, one of the new storage nodes rebooted. As the volume sync work was still running, engineers verified that the health status of the cluster was in an okay state.
Sync was still ongoing, with 75% done at the time of this incident.
On Wednesday, 9th of September, at 10:45, We observed that one of the new storage nodes had crashed and rebooted. The server was taken out of production as an investigation pointed to a memory hardware error.
The engineering team could not get the cluster in health status again after taking it out of production earlier. Due to the ongoing sync, the broken server was still holding cache data.
We needed to replace the broken memory to get the server back in production again. An infrastructure engineer went to the data center to replace the broken memory to take the server back into production and access the cache again. The server was then repaired and added to the cluster again.
Due to some cache mismatch and cluster being still in an error state, we proceeded with a workaround to solve this issue, which seemed to work. Service was back again at 15:30.
At 17:30, we saw that the cluster tried to remove a non-existing copy of data from the cache and went into an error state again. The deep analysis started, and it indicated that this problem was in the core Ceph source code. We began to contact Ceph developers to involve them in this investigation.
On September 10th, 01:00, We moved the cache to a proxy mode to see if it could mitigate the situation, but we still had the old storage servers in production, we got continuous slow IOPS. Together with external developers, we started to update the source code in Ceph file system. When code work finished, we updated the cluster with this new, forked, version and at 09:00, the file system started to work correctly again. We continue to sync volumes data throughout the day to finish that operation. No further work on the cluster was done over the weekend, only monitoring to ensure the situation stabilized again. Like the behavior before this incident, we could still see slow IOPS from the old storage servers. We expected this to continue as long as they remained in production.
On September 15th, we started to sync all virtual instances to the new hardware. This was the last step in the migration work predicted to take four days to complete. We decided to increase the speed of the sync to take the old servers out of production as soon as possible.
After midnight the sync was completed without any increased disturbances.
On September 10th, we removed the main part of the old hardware from the cluster, and monitoring indicated that all slow ops disappeared at that time. Some old servers still existing in the configuration and only runs monitoring and management services.
This incident was closed On September 15th, 11:40.
Date | Time | Observation/Action |
---|---|---|
09 Sep 2021 | 10:45 | One of the new storage servers crashed and rebooted due to broken memory |
09 Sep 2021 | 15:30 | Storage service was back. |
09 Sep 2021 | 17:30 | Cluster being in error state affecting storage service |
09 Sep 2021 | 21:00 | Working with upstream developers on a fix. |
10 Sep 2021 | 01:00 | The cache was moved to proxy mode as a step to mitigate the issue. This work didn't help due to increased slow IOPS |
10 Sep 2021 | 09:00 | After the installation of the updated Ceph software, the service was back online again. |
10 Sep 2021 | 11:00 | Close monitoring while volume data sync was ongoing. Slow OPS was recurring as before the incident. |
During this incident, several problems and root causes were discovered:
No | Action Description | Company/Contractor/Supplier | Planned Completion Date | Status |
---|---|---|---|---|
1 | Remove cache from the cluster to increase performance. | City Network | Q4 2021 | ON-GOING |
2 | Move Ceph internal services to new hardware. | City Network | Q1 2022 | TO DO |
3 | Investigate if we should increase Burn-In tests on new hardware. | City Network | Q4 2021 | TO DO |
We apologize for the inconvenience during this process, and thanks again for your patience.
Service availability is always our main priority, and we will do everything we can to learn from the incident and avoid a recurrence in the future.
If you experience any problems, then please open a ticket with our support team.
Sincerely,
City Network International AB