Storage outage on 22.08.2025

A short review about the circumstances, which led to this incident and our takeaways.

Monday, August 25, 2025

Flapping OSDs

Starting at 22.08.2025, 18:00, we noticed flapping OSDs. OSDs are Cephs storage daemons, where each daemon handles a specific disk in our cluster. So the Ceph OSD storage processes were marked down periodically. Around 70 of 1710 OSDs were constantly down and these OSDs were distributed evenly on the cluster. This led to incomplete or down Placement Groups (PGs). A Placement Group is associated with a set of data, that is stored on it.

As we were currently upgrading all our nodes to a newly built operating system image, we thought that this phenomenon might be related. However, in our tests, we had not seen anything similar, despite many node reinstalls.

The error message

Looking through the Ceph logs, we found src/erasure-code/ErasureCode.cc: 153: FAILED ceph_assert(chunk.is_contiguous()). This pointed us directly to the source of the problem: A buffer-list was being rebuild with rebuild_aligned_size_and_memory but failed to be contiguous afterwards. A search in the internet and the Ceph bug tracker revealed, that there was only a single report of this bug on reddit and that the company “croit” has helped creating a patch and a patched version.

It turned out, that the ceph code in this case was quite cryptic and we searched for patches to this code in more recent versions. We found a single patch causing stack corruptions, that is only found in more recent versions. We built a custom patched ceph version and tested it on a single node, but it did not solve our problem. We determined that the patch for this bug had not been released yet.

Finding the culprit

As the bug was caused by client operations on erasure coding pools, we thought that there might be some client access pattern, that would be causing the issue. Therefore, we increased the logging level and looked at the most recent operations before the assertion. To our surprise, it was a single client, that was always doing the same operation just before the assertion appeared. We quickly searched for the client IP and found that it was a specific Palma node.

Interestingly, shutting down the single application, accessing the disk on this node, did not stop the problem. So there the kernel code probably causing the issue during its cache flush. Unmounting the filesystem even increased the issue, as there were suddenly 270 OSDs down. A hard reboot of the machine finally solved the issue.

Our takeaways

So there is still a bug in the Ceph code, which does very rarely occur. However, we have now a way of quickly determining a way to shut down a malicious client, causing the issue. Futhermore, we asked croit for the patch, as they already fixed the underlying problem, so we can then build a fixed version. We are also discussing some options to limit access from Palma, as in the past, there we different performance issues caused by Palma users.