Troubleshooting
Node bootstrap failures when using CABPK with cloud-init
Failures during Node bootstrapping can have a lot of different causes. For example, Cluster API resources might be misconfigured or there might be problems with the network. The following steps describe how bootstrap failures can be troubleshooted systematically.
- Access the Node via ssh.
- Take a look at cloud-init logs via
less /var/log/cloud-init-output.log
orjournalctl -u cloud-init --since "1 day ago"
. (Note: cloud-init persists logs of the commands it executes (like kubeadm) only after they have returned.) - It might also be helpful to take a look at
journalctl --since "1 day ago"
. - If you see that kubeadm times out waiting for the static Pods to come up, take a look at:
- containerd:
crictl ps -a
,crictl logs
,journalctl -u containerd
- Kubelet:
journalctl -u kubelet --since "1 day ago"
(Note: it might be helpful to increase the Kubelet log level by e.g. setting--v=8
viasystemctl edit --full kubelet && systemctl restart kubelet
)
- containerd:
- If Node bootstrapping consistently fails and the kubeadm logs are not verbose enough, the
kubeadm
verbosity can be increased viaKubeadmConfigSpec.Verbosity
.
Labeling nodes with reserved labels such as node-role.kubernetes.io
fails with kubeadm error during bootstrap
Self-assigning Node labels such as node-role.kubernetes.io
using the kubelet --node-labels
flag
(see kubeletExtraArgs
in the CABPK examples)
is not possible due to a security measure imposed by the
NodeRestriction
admission controller
that kubeadm enables by default.
Assigning such labels to Nodes must be done after the bootstrap process has completed:
kubectl label nodes <name> node-role.kubernetes.io/worker=""
For convenience, here is an example one-liner to do this post installation
# Kubernetes 1.19 (kubeadm 1.19 sets only the node-role.kubernetes.io/master label)
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
# Kubernetes >= 1.20 (kubeadm >= 1.20 sets the node-role.kubernetes.io/control-plane label)
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
Cluster API with Docker
When provisioning workload clusters using Cluster API with the Docker infrastructure provider, provisioning might be stuck:
-
if there are stopped containers on your machine from previous runs. Clean unused containers with docker rm -f .
-
if the docker space on your disk is being exhausted
- Run docker system df to inspect the disk space consumed by Docker resources.
- Run docker system prune --volumes to prune dangling images, containers, volumes and networks.
Cluster API with Docker Desktop - “too many open files”
When using Cluster API and the Docker infrastructure provider on MacOS with Docker Desktop an error of “too many open files” has been observed.
One solution to this issue is to increase the maximum inotify file watch settings in the Docker Desktop VM.
To do so:
- Enter the Docker Desktop VM
nc -U ~/Library/Containers/com.docker.docker/Data/debug-shell.sock
- Increase the inotify limits using sysctl
sysctl fs.inotify.max_user_watches=1048576
sysctl fs.inotify.max_user_instances=8192
- Exit the Docker Desktop VM
exit
Note: This error was observed in Docker 4.3. An alternative solution is to stick with an older version of Docker Desktop while this issue is being resolved. An issue is currently open on the Docker Desktop repository.
Failed clusterctl init - ‘failed to get cert-manager object’
When using older versions of Cluster API 0.4 and 1.0 releases - 0.4.6, 1.0.3 and older respectively - Cert Manager may not be downloadable due to a change in the repository location. This will cause clusterctl init
to fail with the error:
$ cluster-api % clusterctl init --infrastructure docker
Fetching providers
Installing cert-manager Version="v1.7.2"
Error: action failed after 10 attempts: failed to get cert-manager object /, Kind=, /: Object 'Kind' is missing in 'unstructured object has no kind'
This error was fixed in more recent Cluster API releases on the 0.4 and 1.0 release branches. The simplest way to resolve the issue is to upgrade to a newer version of Cluster API for a given release. For who need to continue using an older release it is possible to override the repository used by clusterctl init
in the clusterctl config file. The default location of this file is in ~/.cluster-api/clusterctl.yaml
.
To do so add the following to the file:
cert-manager:
url: "https://github.com/cert-manager/cert-manager/releases/latest/cert-manager.yaml"
Alternatively a Cert Manager yaml file can be placed in the clusterctl overrides layer which is by default in $HOME/.cluster-api/overrides
. A Cert Manager yaml file can be placed at $(HOME)/.cluster-api/overrides/cert-manager/v1.7.2/cert-manager.yaml
More information on the clusterctl config file can be found at its page in the book