On September 11, 2020, City Network experienced an outage involving our City Cloud block storage system in region KNA1, impacting a large portion of our customers with active service in that location which we are truly sorry about this incident.
While we’re are still working on a full Root Cause Analysis and work around this incident we wanted to give you some explanation of what happened and how City Network chose to address the problem.
During the initial investigation, it became clear that some storage back-end serving City Cloud region KNA1 was unresponsive and misbehaving. This led compute nodes getting errors witch affected some VMs going inte Read-Only state and hard reboot on those servers was needed.
To discover VMs in faulty state we are checking console output. For those servers we could see this state we performed the hard reboot.
City Cloud was launched using a storage technique based on NFS, where more recent architectural decisions were made to replace NFS with Ceph. We currently run Ceph globally in many different data centers. City Cloud in region KNA1 is currently undergoing a project to replace NFS with Ceph.
Network File System (NFS) is a distributed file system protocol basically allowing access to files over a computer network much like local storage is accessed. Ceph on the other hand is a complete open-source software storage platform, which implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for: object-, block- and file-level storage.
The benefit of Ceph is that it allows us to replicate data and make it fault-tolerant. As a result of its design, the system is both self-healing and self-managing, which in itself adds layers of safety to operations.
A new storage system using Ceph as technology is running in production in KNA1 for some services.
City Network engineers will deploy a new pool of compute nodes, allowing new capacity to be spawned on a new architecture and utilization a new, more reliable, and resilient storage system (Ceph).
This will allow both system disks, and volumes to use the new storage system. We will communicate when this fully implemented.
Meanwhile, the NFS Storage system will undergo diagnostics and non-interrupted maintenance to enable actions to be taken to attempt to prevent additional disruptions. In parallel, engineers are finalizing the investigation of migration tooling to allow data to be transferred from NFS to Ceph. Our Customer Engagement department will be responsible for the coordination of any upcoming data migration with you.
If you have any questions or concerns please contact your customer team or open a customer request with our Service Desk.