Storage Capacity in Uni Cloud Münster – OpenStack: Status Update

Status update on the storage situation in Uni Cloud Münster – OpenStack: why our Ceph cluster is full, why expanding it requires a larger network rebuild, and where we currently stand.

Storage Capacity in Uni Cloud Münster – OpenStack: Status Update

TL;DR: Our Ceph storage cluster is very full, and we currently cannot grant most requests for additional storage. New hardware is already on site, but before we can add it we need to restructure our network – not only into a two-tier (spine–leaf) topology, but also from an L2-based design (VLANs, MC-LAG on Cumulus) to a fully L3-routed fabric (BGP, EVPN, SRv6 on SONiC). This is in progress, but due to limited personnel and ongoing operations we cannot give a reliable date yet. We are working on it.

If you are a user of our Uni Cloud Münster – OpenStack service and have recently asked us about additional storage – or are planning to – this post is for you. We want to be transparent about where we currently stand, why expanding our storage is taking longer than any of us would like, and what we are doing about it.

The current situation

The Ceph cluster that backs our OpenStack service is currently filled well above our warning threshold. At this fill level we have to be very careful with new allocations: Ceph needs free capacity to rebalance data, to tolerate disk or node failures, and to keep performance predictable. As a consequence, we are currently unable to grant most requests for additional volumes or quota increases.

We know that this is painful. Several research groups and projects are actively waiting for more storage, and we have been answering these requests individually. This blog post is meant to give everyone the same, complete picture in one place.

Why we can’t just add more disks

The short answer: we have the hardware, but we cannot plug it in yet.

New storage nodes need network ports, and our current switches are full. To grow the cluster further, we have to restructure the underlying network into a two-tier (spine–leaf) design. Only then can we add additional leaf switches – and with them the new storage nodes that are already sitting in our racks, waiting to be put into production.

Adding a spine layer on its own would be straightforward. In our case, however, we are changing several fundamentals at the same time: moving from an L2-based design (VLANs, MC-LAG on Cumulus) to a fully L3-routed fabric (BGP, EVPN, SRv6 on SONiC) and aligning with the university’s new network color concept for network segmentation – not least because the switch generation we procured cannot provide stable MC-LAG, so the old L2 design was not actually an option.

Where we are in this process

  • Concept: done. The new network architecture is fully designed.
  • Proof of concept: done. We have already rolled the new design out on one of our test clusters and it works as expected.
  • Migration planning: in progress. Moving a production cluster from the current L2 design to the new L3 fabric without significant downtime is a complex operation that needs to be planned step by step.
  • Testing and bug hunting: in progress. Before we touch production we want to be confident that we have found the rough edges in a controlled environment.

Once the migration is complete, integrating the new storage hardware will be comparatively straightforward.

Why it is taking so long

We want to be honest about the reasons for the delay:

  • Limited personnel. Our team is small for the scope of services we operate. We are working intensively on this, but there is a hard limit to how many complex changes we can drive forward in parallel.
  • Ongoing operations don’t pause. Security updates, incident response, and general maintenance of all the services we run cannot be deferred indefinitely just because another task is urgent. Keeping the existing infrastructure healthy and secure has to remain a priority – also for the users currently waiting for more storage.
  • Unforeseen issues. As with any complex infrastructure, unexpected problems regularly come up and need attention before we can return to planned work.

What this means for you

We are not in a position to give a reliable date for when additional storage will be available again. Any number we put out now would be a guess, and we would rather be honest than over-promise.

What we can promise:

  • We are aware that this is blocking real work for some of you.
  • This topic has high priority for our team and we are actively working on it.
  • As soon as we have a realistic timeline – or once new capacity actually becomes available – we will communicate it here and to the affected users.

For general questions, you can always reach us at openstack@uni-muenster.de.

Thank you for your patience and understanding.