Storage disturbances in KNA1

Incident Report for Cleura Public Cloud

Postmortem

On September 11, 2020, City Network experienced an outage involving our City Cloud block storage system in region KNA1, impacting a large portion of our customers with active service in that location which we are truly sorry about this incident.

While we’re are still working on a full Root Cause Analysis and work around this incident we wanted to give you some explanation of what happened and how City Network chose to address the problem.

Why did you experience an outage?

During the initial investigation, it became clear that some storage back-end serving City Cloud region KNA1 was unresponsive and misbehaving. This led compute nodes getting errors witch affected some VMs going inte Read-Only state and hard reboot on those servers was needed.
To discover VMs in faulty state we are checking console output. For those servers we could see this state we performed the hard reboot.

What are NFS and Ceph?

City Cloud was launched using a storage technique based on NFS, where more recent architectural decisions were made to replace NFS with Ceph. We currently run Ceph globally in many different data centers. City Cloud in region KNA1 is currently undergoing a project to replace NFS with Ceph.

Network File System (NFS) is a distributed file system protocol basically allowing access to files over a computer network much like local storage is accessed. Ceph on the other hand is a complete open-source software storage platform, which implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for: object-, block- and file-level storage.

The benefit of Ceph is that it allows us to replicate data and make it fault-tolerant. As a result of its design, the system is both self-healing and self-managing, which in itself adds layers of safety to operations.

A new storage system using Ceph as technology is running in production in KNA1 for some services.

What are we doing now?

City Network engineers will deploy a new pool of compute nodes, allowing new capacity to be spawned on a new architecture and utilization a new, more reliable, and resilient storage system (Ceph).
This will allow both system disks, and volumes to use the new storage system. We will communicate when this fully implemented.

Meanwhile, the NFS Storage system will undergo diagnostics and non-interrupted maintenance to enable actions to be taken to attempt to prevent additional disruptions. In parallel, engineers are finalizing the investigation of migration tooling to allow data to be transferred from NFS to Ceph. Our Customer Engagement department will be responsible for the coordination of any upcoming data migration with you.

If you have any questions or concerns please contact your customer team or open a customer request with our Service Desk.

Posted Sep 23, 2020 - 13:05 CEST

Resolved

This incident is now closed. If you experience any issues, please contact our support.

Posted Sep 17, 2020 - 11:26 CEST

Monitoring

We have rebooted the impacted instances, all services are operational again. We will continue to monitor the situation.

Posted Sep 16, 2020 - 19:42 CEST

Update

We are still working with rebooting VMs showing ReadOnly status.

Posted Sep 16, 2020 - 17:59 CEST

Update

We are now checking VMs in Read-Only state and will perform hard reboot for those who are in this state.

Posted Sep 16, 2020 - 17:17 CEST

Identified

Technicians are working with the issue. If you have any issues with you VM try to perform a hard reboot.

Posted Sep 16, 2020 - 17:06 CEST

Investigating

Our monitoring captured time outs between certain computes and backend storage, our technicians are looking into the issue

Posted Sep 16, 2020 - 16:41 CEST

This incident affected: KNA1 (Karlskrona) (Compute).