Monitoring with Prometheus

Monitoring Flux with Prometheus Operator and Grafana.

This guide walks you through configuring monitoring for the Flux control plane.

Flux uses kube-prometheus-stack to provide a monitoring stack made out of:

Prometheus Operator - manages Prometheus clusters atop Kubernetes
Prometheus - collects metrics from the Flux controllers and Kubernetes API
Grafana dashboards - displays the Flux control plane resource usage and reconciliation stats
kube-state-metrics - generates metrics about the state of the Kubernetes objects

Install the kube-prometheus-stack

To install the monitoring stack with flux, first register the toolkit Git repository on your cluster:

flux create source git monitoring \
  --interval=30m \
  --url=https://github.com/fluxcd/flux2 \
  --branch=main

Then apply the manifests/monitoring/kube-prometheus-stack kustomization:

flux create kustomization monitoring-stack \
  --interval=1h \
  --prune=true \
  --source=monitoring \
  --path="./manifests/monitoring/kube-prometheus-stack" \
  --health-check="Deployment/kube-prometheus-stack-operator.monitoring" \
  --health-check="Deployment/kube-prometheus-stack-grafana.monitoring"

The above Kustomization will install the kube-prometheus-stack in the monitoring namespace.

Prometheus Configuration

Note that the above configuration is not suitable for production. In order to configure long term storage for metrics and highly availability for Prometheus consult the Helm chart documentation.

Install Flux Grafana dashboards

Note that the Flux controllers expose the /metrics endpoint on port 8080. When using Prometheus Operator you need a PodMonitor object to configure scraping for the controllers.

Apply the manifests/monitoring/monitoring-config containing the PodMonitor and the ConfigMap with Flux’s Grafana dashboards:

flux create kustomization monitoring-config \
  --interval=1h \
  --prune=true \
  --source=monitoring \
  --path="./manifests/monitoring/monitoring-config"

You can access Grafana using port forwarding:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80

To log in to the Grafana dashboard, you can use the default credentials from the kube-prometheus-stack chart:

username: admin
password: prom-operator

Flux dashboards

Control plane dashboard http://localhost:3000/d/flux-control-plane:

Cluster reconciliation dashboard http://localhost:3000/d/flux-cluster:

If you wish to use your own Prometheus and Grafana instances, then you can import the dashboards from GitHub.

Metrics

For each toolkit.fluxcd.io kind, the controllers expose a gauge metric to track the Ready condition status, and a histogram with the reconciliation duration in seconds.

Ready status metrics:

gotk_reconcile_condition{kind, name, namespace, type="Ready", status="True"}
gotk_reconcile_condition{kind, name, namespace, type="Ready", status="False"}
gotk_reconcile_condition{kind, name, namespace, type="Ready", status="Unknown"}
gotk_reconcile_condition{kind, name, namespace, type="Ready", status="Deleted"}

Suspend status metrics:

gotk_suspend_status{kind, name, namespace}

Time spent reconciling:

gotk_reconcile_duration_seconds_bucket{kind, name, namespace, le}
gotk_reconcile_duration_seconds_sum{kind, name, namespace}
gotk_reconcile_duration_seconds_count{kind, name, namespace}

Alert manager example:

groups:
- name: GitOpsToolkit
  rules:
  - alert: ReconciliationFailure
    expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (namespace, name, kind) + on(namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (namespace, name, kind)) * 2 == 1
    for: 10m
    labels:
      severity: page
    annotations:
      summary: '{{ $labels.kind }} {{ $labels.namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes.'