There was a power maintenance on 01.07.2025 at around 22:30 which caused a single power line to fail. All hardware kept running, but we have seen the following issues, which caused a total outage:
ssh
login took up to 30 seconds after successful authentication.
ps faux
and some other commands took 2-4 seconds to output data.
We currently do not know the underlying reason and cannot debug this issue further.
There were some leftover gateway / httproute resources from our gateway api tests. The gateway bound on port 443, which was somehow causing conflicts with the api server, which listens on another ip address on port 443. We assume, that cilium was somehow failing to properly determine, which packet had to be routed to the envoy and which to the backend service.
We have added all control plane ips to the --etcd-servers
kube-apiserver flag, so that
on server failures, the kube-apiserver does not automatically fail, but connects to
another etcd server. The default is to connect only to the local etcd server, because
in normal cluster-api deployments, only the ip addresses of the other servers are generally
not known.
However, the kube-apiserver fails to start with an error message, if not all etcd servers cannot be contacted. We had kube-apiservers failing due to network connectivity issues and this additional problem caused the apiserver be permanently unavailable. This problem originates in a preflight check, which was included until kubernetes 1.22 (https://github.com/kubernetes/kubernetes/pull/101993). Also, the timeout is not respected properly: https://github.com/kubernetes/kubernetes/issues/113892
TCP connections automatically use a dynamically allocated free local port on connect()
syscalls.
The port range is determined by the net.ipv4.ip_local_port_range
sysctl. We had set the range to “10240 65535”.
However, kubelet wants to listen on port 10250, which meant, that this port must not be used by other
connections. As we found out, kubelet itself makes network connections on startup and dynamically allocates ports.
In our case, kubelet used port 10250 for outgoing connections before starting the listener.
We increase now start at port 32768 to further minimize the risk of reusing statically allocated ports:
sysctl -w net.ipv4.ip_local_port_range="32768 65535"
Ceph monitors use only a few gigabytes of storage for its database. However, during data migrations, recoveries or outages, the storage demand increases and in our case required over 80GB. We manually increased the monitor storage until we had enough space available:
lvextend -L100G /dev/ceph/cephmon
cryptsetup resize luks-cephmon
xfs_growfs /var/lib/ceph/mon
We have seen that some Ceph monitors had a cpu load of around 1. We previously limited the cpu usage to 1, so it was an easy choice to increase the cpu limit to 2. Ceph monitors still use around 1 cpu at max, but we can now handle usage spikes without throttling.
Ceph mgr daemons were oomkilled at 6GB memory.
We increased the limit to 8GB and set TCMALLOC_AGGRESSIVE_DECOMMIT
for the daemons to further reduce memory usage.
Some RBD devices were mounted on other nodes. This issue was causing the MariaDB server, which is central to the OpenStack services, to be unavailable. The RBD devices were not mounted on the other nodes and could easily be unmapped.