Service outage on 02.07.2025 at SP4

A short review about the circumstances, which led to this incident and our takeaways.

Introduction

There was a power maintenance on 01.07.2025 at around 22:30 which caused a single power line to fail. All hardware kept running, but we have seen the following issues, which caused a total outage:

General slowness during power maintenance (unsolved)

ssh login took up to 30 seconds after successful authentication. ps faux and some other commands took 2-4 seconds to output data. We currently do not know the underlying reason and cannot debug this issue further.

Api server connections intermittently not working

There were some leftover gateway / httproute resources from our gateway api tests. The gateway bound on port 443, which was somehow causing conflicts with the api server, which listens on another ip address on port 443. We assume, that cilium was somehow failing to properly determine, which packet had to be routed to the envoy and which to the backend service.

Kube-apiserver etcd-servers needs all etcds

We have added all control plane ips to the --etcd-servers kube-apiserver flag, so that on server failures, the kube-apiserver does not automatically fail, but connects to another etcd server. The default is to connect only to the local etcd server, because in normal cluster-api deployments, only the ip addresses of the other servers are generally not known.

However, the kube-apiserver fails to start with an error message, if not all etcd servers cannot be contacted. We had kube-apiservers failing due to network connectivity issues and this additional problem caused the apiserver be permanently unavailable. This problem originates in a preflight check, which was included until kubernetes 1.22 (https://github.com/kubernetes/kubernetes/pull/101993). Also, the timeout is not respected properly: https://github.com/kubernetes/kubernetes/issues/113892

Kubelet cannot bind on 10250

TCP connections automatically use a dynamically allocated free local port on connect() syscalls. The port range is determined by the net.ipv4.ip_local_port_range sysctl. We had set the range to “10240 65535”. However, kubelet wants to listen on port 10250, which meant, that this port must not be used by other connections. As we found out, kubelet itself makes network connections on startup and dynamically allocates ports. In our case, kubelet used port 10250 for outgoing connections before starting the listener.

We increase now start at port 32768 to further minimize the risk of reusing statically allocated ports:

sysctl -w net.ipv4.ip_local_port_range="32768 65535"

Ceph monitor volume too small

Ceph monitors use only a few gigabytes of storage for its database. However, during data migrations, recoveries or outages, the storage demand increases and in our case required over 80GB. We manually increased the monitor storage until we had enough space available:

lvextend -L100G /dev/ceph/cephmon
cryptsetup resize luks-cephmon
xfs_growfs /var/lib/ceph/mon

Ceph mon cpu limit

We have seen that some Ceph monitors had a cpu load of around 1. We previously limited the cpu usage to 1, so it was an easy choice to increase the cpu limit to 2. Ceph monitors still use around 1 cpu at max, but we can now handle usage spikes without throttling.

Ceph mgr memory limit

Ceph mgr daemons were oomkilled at 6GB memory.

We increased the limit to 8GB and set TCMALLOC_AGGRESSIVE_DECOMMIT for the daemons to further reduce memory usage.

Ceph rbds were still mounted on other nodes

Some RBD devices were mounted on other nodes. This issue was causing the MariaDB server, which is central to the OpenStack services, to be unavailable. The RBD devices were not mounted on the other nodes and could easily be unmapped.