Grafana Dashboards
Kubernetes Number of Pods per Node
This dashboard was created to show the current number of pods per node in the cluster.
Dashboard Layout
Each box on the dashboard has the name of the node and the number of pods on the node. This number includes pods with statuses such as “DeadlineExceeded” and completed".
The exception is the Max Pods per Node
box. This is a constant number set on creation for the maximum allowed pods per node.
The current architecture does not allow instance group id to be viewed on the dashboard.
We currently have 2 instance groups:
- Default worker node group (r6i.2xlarge)
- Monitoring node group (r6i.8xlarge Nodes)
As the dashboard is set in descending order, the last two boxes are normally from the monitoring Nodes group (2 instances), and the rest are from the default Node groups.
You can run the following command to confirm this and get more information about a node:
kubectl describe node <node_name>
Troubleshooting
Fixing “failed to load dashboard” errors
The OpenSearch alert has reported an error similar to:
Grafana failed to load one or more dashboards - This could prevent new dashboards from being created ⚠️
You can also see errors from the Grafana pod by running:
kubectl logs -n monitoring prometheus-operator-grafana-<pod-id> -f -c grafana
You’ll see an error similar to:
t=2021-12-03T13:37:35+0000 lvl=eror msg="failed to load dashboard from " logger=provisioning.dashboard type=file name=sidecarProvider file=/tmp/dashboards/<MY-DASHBOARD>.json error="invalid character 'c' looking for beginning of value"
Identify the namespace and name of the configmap which contains this dashboard name by running:
kubectl get configmaps -A -ojson | jq -r '.items[] | select (.data."<MY-DASHBOARD>.json") | .metadata.namespace + "/" + .metadata.name'
This will return the namespace and name of the configmap which contains the dashboard config. Describe the namespace and find the user’s slack-channel which is an annotation on the namespace:
kubectl describe namespace <namespace>
Contact the user in the given slack-channel and ask them to fix it. Provide the list of affected dashboards and the error message to help diagnose the issue.
Fixing “duplicate dashboard uid” errors
The OpenSearch alert has reported an error similar to:
Duplicate Grafana dashboard UIDs found
To help in identifying the dashboards, you can exec into the Grafana pod as follows:
kubectl exec --stdin --tty prometheus-operator-grafana-xxxxxx-yyyyy --container grafana-sc-dashboard -- sh
Then, cd into the /tmp/dashboards
directory, and execute the find
command as below:
cd /tmp/dashboards/
for i in $(find ./ -type f); do tail $i; done|grep '"uid": '|cut -f2 -d':'|sort|uniq -c|grep -v ' 1 '
This will output any duplicate UIDs and the number of occurrences of each.
i.e.:
2 "[duplicate-dashboard-uid]",
...
Now you can grep the dashboard json
files within the directory to identify which dashboards contain the duplicate UID with:
grep -Rnw . -e "[duplicate-dashboard-uid]"
./my-test-dashboard.json: "uid": "duplicate-dashboard-uid",
./my-test-dashboard-2.json: "uid": "duplicate-dashboard-uid",
Identify the namespace and name of the configmap which contains this dashboard name by running:
kubectl get configmaps -A -ojson | jq -r '.items[] | select (.data."my-test-dashboard.json") | .metadata.namespace + "/" + .metadata.name'
This will return the namespace and name of the configmap which contains the dashboard config. Describe the namespace and find the user’s slack-channel which is an annotation on the namespace:
kubectl describe namespace <namespace>
Contact the user in the given slack-channel and ask them to fix it. Provide the list of affected dashboards and the error message to help diagnose the issue.