Using the Cluster Autoscaler

Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster based on the utilization of Pods and Nodes in your cluster. For more general information about the Cluster Autoscaler, please see the project documentation.

The following instructions are a reproduction of the Cluster API provider specific documentation from the Autoscaler project documentation.

Cluster Autoscaler on Cluster API

The cluster autoscaler on Cluster API uses the cluster-api project to manage the provisioning and de-provisioning of nodes within a Kubernetes cluster.

Kubernetes Version

The cluster-api provider requires Kubernetes v1.16 or greater to run the v1alpha3 version of the API.

Starting the Autoscaler

To enable the Cluster API provider, you must first specify it in the command line arguments to the cluster autoscaler binary. For example:

cluster-autoscaler --cloud-provider=clusterapi

Please note, this example only shows the cloud provider options, you will most likely need other command line flags. For more information you can invoke cluster-autoscaler --help to see a full list of options.

Configuring node group auto discovery

If you do not configure node group auto discovery, cluster autoscaler will attempt to match nodes against any scalable resources found in any namespace and belonging to any Cluster.

Limiting cluster autoscaler to only match against resources in the blue namespace

--node-group-auto-discovery=clusterapi:namespace=blue

Limiting cluster autoscaler to only match against resources belonging to Cluster test1

--node-group-auto-discovery=clusterapi:clusterName=test1

Limiting cluster autoscaler to only match against resources matching the provided labels

--node-group-auto-discovery=clusterapi:color=green,shape=square

These can be mixed and matched in any combination, for example to only match resources in the staging namespace, belonging to the purple cluster, with the label owner=jim:

--node-group-auto-discovery=clusterapi:namespace=staging,clusterName=purple,owner=jim

Connecting cluster-autoscaler to Cluster API management and workload Clusters

You will also need to provide the path to the kubeconfig(s) for the management and workload cluster you wish cluster-autoscaler to run against. To specify the kubeconfig path for the workload cluster to monitor, use the --kubeconfig option and supply the path to the kubeconfig. If the --kubeconfig option is not specified, cluster-autoscaler will attempt to use an in-cluster configuration. To specify the kubeconfig path for the management cluster to monitor, use the --cloud-config option and supply the path to the kubeconfig. If the --cloud-config option is not specified it will fall back to using the kubeconfig that was provided with the --kubeconfig option.

Autoscaler running in a joined cluster using service account credentials

+-----------------+
| mgmt / workload |
| --------------- |
|    autoscaler   |
+-----------------+

Use in-cluster config for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi

Autoscaler running in workload cluster using service account credentials, with separate management cluster

+--------+              +------------+
|  mgmt  |              |  workload  |
|        | cloud-config | ---------- |
|        |<-------------+ autoscaler |
+--------+              +------------+

Use in-cluster config for workload cluster, specify kubeconfig for management cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --cloud-config=/mnt/kubeconfig

Autoscaler running in management cluster using service account credentials, with separate workload cluster

+------------+             +----------+
|    mgmt    |             | workload |
| ---------- | kubeconfig  |          |
| autoscaler +------------>|          |
+------------+             +----------+

Use in-cluster config for management cluster, specify kubeconfig for workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/kubeconfig \
                   --clusterapi-cloud-config-authoritative

Autoscaler running anywhere, with separate kubeconfigs for management and workload clusters

+--------+               +------------+             +----------+
|  mgmt  |               |     ?      |             | workload |
|        |  cloud-config | ---------- | kubeconfig  |          |
|        |<--------------+ autoscaler +------------>|          |
+--------+               +------------+             +----------+

Use separate kubeconfigs for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/workload.kubeconfig \
                   --cloud-config=/mnt/management.kubeconfig

Autoscaler running anywhere, with a common kubeconfig for management and workload clusters

+---------------+             +------------+
| mgmt/workload |             |     ?      |
|               |  kubeconfig | ---------- |
|               |<------------+ autoscaler |
+---------------+             +------------+

Use a single provided kubeconfig for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/workload.kubeconfig

Enabling Autoscaling

To enable the automatic scaling of components in your cluster-api managed cloud there are a few annotations you need to provide. These annotations must be applied to either MachineSet or MachineDeployment resources depending on the type of cluster-api mechanism that you are using.

There are two annotations that control how a cluster resource should be scaled:

cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size - This specifies the minimum number of nodes for the associated resource group. The autoscaler will not scale the group below this number. Please note that the cluster-api provider will not scale down to, or from, zero unless that capability is enabled (see Scale from zero support).
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size - This specifies the maximum number of nodes for the associated resource group. The autoscaler will not scale the group above this number.

The autoscaler will monitor any MachineSet or MachineDeployment containing both of these annotations.

Scale from zero support

The Cluster API community has defined an opt-in method for infrastructure providers to enable scaling from zero-sized node groups in the Opt-in Autoscaling from Zero enhancement. As defined in the enhancement, each provider may add support for scaling from zero to their provider, but they are not required to do so. If you are expecting built-in support for scaling from zero, please check with the Cluster API infrastructure providers that you are using.

If your Cluster API provider does not have support for scaling from zero, you may still use this feature through the capacity annotations. You may add these annotations to your MachineDeployments, or MachineSets if you are not using MachineDeployments (it is not needed on both), to instruct the cluster autoscaler about the sizing of the nodes in the node group. At the minimum, you must specify the CPU and memory annotations, these annotations should match the expected capacity of the nodes created from the infrastructure.

For example, if my MachineDeployment will create nodes that have “16000m” CPU, “128G” memory, 2 NVidia GPUs, and can support 200 max pods, the folllowing annotations will instruct the autoscaler how to expand the node group from zero replicas:

apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
    capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu"
    capacity.cluster-autoscaler.kubernetes.io/gpu-count: "2"
    capacity.cluster-autoscaler.kubernetes.io/maxPods: "200"

Note the maxPods annotation will default to 110 if it is not supplied. This value is inspired by the Kubernetes best practices Considerations for large clusters.

RBAC changes for scaling from zero

If you are using the opt-in support for scaling from zero as defined by the Cluster API infrastructure provider, you will need to add the infrastructure machine template types to your role permissions for the service account associated with the cluster autoscaler deployment. The service account will need permission to get and list the infrastructure machine templates for your infrastructure provider.

For example, when using the Kubemark provider you will need to set the following permissions:

rules:
  - apiGroups:
    - infrastructure.cluster.x-k8s.io
    resources:
    - kubemarkmachinetemplates
    verbs:
    - get
    - list

Pre-defined labels and taints on nodes scaled from zero

The Cluster API provider currently does not support the addition of pre-defined labels and taints for node groups that are scaling from zero. This work is on-going and will be included in a future release once the API for specifying those labels and taints has been accepted by the community.

Specifying a Custom Resource Group

By default all Kubernetes resources consumed by the Cluster API provider will use the group cluster.x-k8s.io, with a dynamically acquired version. In some situations, such as testing or prototyping, you may wish to change this group variable. For these situations you may use the environment variable CAPI_GROUP to change the group that the provider will use.

Please note that setting the CAPI_GROUP environment variable will also cause the annotations for minimum and maximum size to change. This behavior will also affect the machine annotation on nodes, the machine deletion annotation, and the cluster name label. For example, if CAPI_GROUP=test.k8s.io then the minimum size annotation key will be test.k8s.io/cluster-api-autoscaler-node-group-min-size, the machine annotation on nodes will be test.k8s.io/machine, the machine deletion annotation will be test.k8s.io/delete-machine, and the cluster name label will be test.k8s.io/cluster-name.

Specifying a Custom Resource Version

When determining the group version for the Cluster API types, by default the autoscaler will look for the latest version of the group. For example, if MachineDeployments exist in the cluster.x-k8s.io group at versions v1alpha1 and v1beta1, the autoscaler will choose v1beta1.

In some cases it may be desirable to specify which version of the API the cluster autoscaler should use. This can be useful in debugging scenarios, or in situations where you have deployed multiple API versions and wish to ensure that the autoscaler uses a specific version.

Setting the CAPI_VERSION environment variable will instruct the autoscaler to use the version specified. This works in a similar fashion as the API group environment variable with the exception that there is no default value. When this variable is not set, the autoscaler will use the behavior described above.

Sample manifest

A sample manifest that will create a deployment running the autoscaler is available. It can be deployed by passing it through envsubst, providing these environment variables to set the namespace to deploy into as well as the image and tag to use:

export AUTOSCALER_NS=kube-system
export AUTOSCALER_IMAGE=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.20.0
envsubst < examples/deployment.yaml | kubectl apply -f-

A note on permissions

The cluster-autoscaler-management role for accessing cluster api scalable resources is scoped to ClusterRole. This may not be ideal for all environments (eg. Multi tenant environments). In such cases, it is recommended to scope it to a Role mapped to a specific namespace.

Autoscaling with ClusterClass and Managed Topologies

For users using ClusterClass and Managed Topologies the Cluster Topology controller attempts to set MachineDeployment replicas based on the spec.topology.workers.machineDeployments[].replicas field. In order to use the Cluster Autoscaler this field can be left unset in the Cluster definition.

The below Cluster definition shows which field to leave unset:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: "my-cluster"
  namespace: default
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.128.0.0/12"]
    pods:
      cidrBlocks: ["192.168.0.0/16"]
    serviceDomain: "cluster.local"
  topology:
    class: "quick-start"
    version: v1.24.0
    controlPlane:
      replicas: 1 
    workers:
      machineDeployments:
        - class: default-worker
          name: linux
       ## replicas field is not set. 
       ## replicas: 1

Warning: If the Autoscaler is enabled and the replicas field is set for a MachineDeployment or MachineSet the Cluster may enter a broken state where replicas become unpredictable.

If the replica field is unset in the Cluster definition Autoscaling can be enabled as described above

Special note on GPU instances

As with other providers, if the device plugin on nodes that provides GPU resources takes some time to advertise the GPU resource to the cluster, this may cause Cluster Autoscaler to unnecessarily scale out multiple times.

To avoid this, you can configure kubelet on your GPU nodes to label the node before it joins the cluster by passing it the --node-labels flag. For the CAPI cloudprovider, the label format is as follows:

cluster-api/accelerator=<gpu-type>

<gpu-type> is arbitrary.

The Cluster API Book