Monitoring

Metrics, Logs and Traces for Services in the Kubernetes.

We are aiming to seamlessly connect services within Kubernetes to our monitoring stack for metrics, logs, and traces, utilizing various tools for this purpose. This description outlines how data from Kubernetes is structured within the monitoring system and how the different components can be utilized. For detailed information, please take a look at the monitoring itself.

Organization of the Data

The metrics, logs, and traces that are collected are aggregated based on the organizationalUnit specified in the request. Read and write access to the data of an organizationalUnit cannot be further divided.

Grafana

Access to the data can be achieved using our Grafana instance. There, an organization can be associated with the data of an organizationalUnit.

In Grafana, the corresponding data sources are automatically created within the organizations to access metrics, logs, and traces. These sources are named:

Metrics: WWU Kube: Mimir
Logs: WWU Kube: Loki
Traces: WWU Kube: Tempo

The prefix WWU Kube can be customized, and we are currently in the process of transitioning to assign new default names here.

Metrics

To gather metrics in Kubernetes, we utilize Prometheus in combination with the Prometheus Operator. As a long-term storage solution, we automatically forward the metrics to Mimir.

To gather metrics from services in the Kubernetes, there are several Custom Resource Definitions (CRDs) available that can be used to specify which metrics should be collected. A more detailed description of these CRDs can be found in the documentation of the Prometheus Operator. The most common case is likely to utilize a ServiceMonitor to obtain metrics from the Pods of a Service. Note, that you will have to allow the traffic between the metrics endpoint and our ingester inside the cluster, like in this example.

For instance, the following resource collects metrics from all pods of the service, determined by the corresponding fields namespaceSelector, selector and endpoints:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: example
  namespace: example
spec:
  endpoints:
    - interval: 30s
      path: /metrics
      port: http-metrics
  namespaceSelector:
    matchNames:
      - example
  selector:
    matchLabels:
      app.kubernetes.io/component: example
      app.kubernetes.io/instance: example
      app.kubernetes.io/name: example

Certain metrics, such as whether a service is up, or CPU and memory consumption, are not metrics of our own services but are collected by external cluster services. To enable our tenants to use such metrics, we endeavor to enrich the metrics of the organizationalUnit with some cluster metrics. However, this is done only with metrics that correspond to the namespaces of the organizationalUnit.

Currently, this consists of a subset of metrics from both the Kubelet and the kube-state-metrics service. If additional cluster metrics are required, we will explore the possibility of making them available.

Logs

Logs in Kubernetes are collected using Vector and then sent to the central Loki of our monitoring system.

The logs of all pods in the namespaces of the Kubernetes Clusters are automatically collected under the respective organizationalUnit. Technically, this is achieved through the use of the owner Annotations on the Pods, which are automatically added to each Pod.

Retention Time

By default, Loki stores logs for a period of 14 days. However, it is possible to increase the log retention time to either 90 or 180 days. This can be achieved by adding the cloud.uni-muenster.de/vector-retention annotation to your Kubernetes pod configuration file and set them to 90d or 180d accordingly.

For a 90-day retention period, add the following annotation:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
  annotations:
    cloud.uni-muenster.de/vector-retention: "90d"
spec:
  containers:
  ...

Traces

Kubernetes is also capable of collecting traces from services. For this purpose, we utilize the OpenTelemetry Collector to ingest data in various formats. The data is then send to the central Tempo of our monitoring system.

Sending Traces

On each node, an OpenTelemetry Collector Agent is running, which accepts traces via various ports and protocols. This agent can be accessed from within the pods using the syntax host:.

Currently, we support the following interfaces:

Protocol	Port
otlp	tcp/4317
Jaeger GPRC	tcp/14250
Jaeger Thrift	tcp/14268
Jaeger Thrift Binary	upd/6832
Jaeger Thrift Compact	udp/6831
zipkin	tcp/9411

We ourselves predominantly rely on otlp where possible and also recommend it to our customers. It appears that tracing is gradually consolidating around this protocol. In the future, we are planning to deactivate protocols that haven’t gained traction.

To facilitate the onboarding process, there are examples available on how to send such traces.