HowTo Migrate your EKS cluster to Graviton2

In this article, I’m going to show you how you can migrate an existing EKS cluster to Graviton2 workers in a few simple steps. I’m going to assume you already have an existing EKS cluster working on an Intel based infrastructure (Intel and/or AMD instance types).

Step 1: add Graviton2 worker nodes in your cluster.

This depends on how your EKS cluster is configured; you could have one of the following scenarios:

  • NodeGroup workers: this is a typical EKS install where you have worker nodes deployed in a nodegroup.
  • No nodegroups or autoscaling workers; these would be managed by something like Karpenter where we need to add the Graviton2 nodes like that.

NodeGoup setup

In this case, you will have to add a new nodegroup that has support for Graviton2 instances. Depending on how you created your nodegroups you would use a similar method. For example, I will show how to do this when using the terraform open-source EKS module; this is a snippet of our cluster definition:

module "eks_cluster" {
  source  = "terraform-aws-modules/eks/aws"

  cluster_name      = "my-cluster"
  cluster_version   = "1.28"
  subnets           = ["subnet-xxxxxxx", "subnet-yyyyyyy"] 
  vpc_id            = "vpc-xxxxxxx"
  manage_aws_auth   = true
  node_groups = [
    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6i.xlarge"
      name             = "intel-node-group"
    },
...
  ]
}

in this case, if we want to add a graviton2 nodegroup we would just have to define a new one similar to the intel one but with graviton2 instances. For ex:

    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6g.xlarge"
      name             = "arm64-node-group"
    },

and reapply the terraform for the cluster definition.

Karpenter

If your EKS cluster is using Karpenter to manage worker nodes then it will not use nodegroups and we will need to tell Karpenter to add these into the cluster. I’ve written a dedicated article about this use case that you can follow along here.

Step 2: Convert your workloads to graviton2.

Now that we have graviton2 worker nodes in the cluster we can convert all our workloads to arm64. We have two special cases that we need to handle:

  • daemonsets; these are special k8s workloads that run on every single node. These are normally monitoring or logging agents, security agents, and other global infrastructure software that we usually don’t write ourselves and we install and use in our cluster from either open source or commercial vendors.
  • custom services; these are our own software that we have built and run in our clusters.

Daemonsets

If any of your daemonsets doesn’t have support for arm64 you will see that immediately after you add your first graviton2 worker in the cluster as those will fail to run there. Most of the time we install these with something like helm charts from an open source repository or vendor.

What is needed for this to work is to have all the containers used by the daemonset available for both intel and arm64. This can be seen on the image source in dockerhub or amazon ECR, etc. If there is no support for it we will probably need to work with the vendor or the open source maintainers to have them add support to build a multi-arch containers and have these available.

Assuming that the container registry has support for multi-arch and we have available the arm64 image, EKS nodes are capable of downloading from the registry the appropriate architecture the node is running automatically without any sepecial configuration. Very cool! This is the happy path; where upstream already has support for it or we can get this in the upstream and we don’t have to change anything on our end.

Unfortunately, this doesn’t work all the time. If that is the case, you will have to build the multi-arch containers yourself. Hopefully, the project has definitions for the dockerfile and it is not so difficult to rebuild the image with something like buildx for multi-arch. Once we have that, we can push it to an internal repo, and then use it in the helm chart instead of the official image. Most helm charts will allow us to redefine the image in use with your own custom images; look for that in the values.yaml. This would look something like:

image:
  # -- image registry
  registry: "docker.io"
  # -- Image repository.
  repository: project/agent
  # -- image tag: latest or defined.
  tag: null
  # -- image pull policy.
  pullPolicy: IfNotPresent
  # -- Optional set of image pull secrets.
  pullSecrets: []

Where we would replace with the custom registry (maybe a private ECR) and the repository (our custom project in ecr)

Custom software

The rest of the deployments running in the EKS cluster should be our custom software that we should have full control over and be able to compile and build for multi-arch format. While we are in the transition phase and we don’t have arm64 images for some deployments we need to make sure we are using a selector to have those run on intel nodes as if not they will fail. We can do this with something like:

      nodeSelector:
        intent: apps
        kubernetes.io/arch: amd64

inside the definition of the service. This will force it to run on the original intel worker nodes.

Once we have built the arm64 image we can just remove that selector and allow it to be deployed on any node type, or we can change it to kubernetes.io/arch: arm64 and force it on graviton2 nodes.

Depending on how many services you have deployed in your cluster and how unique they are, this might be a tedious process until you can switch all of those to graviton2 nodes, but this should be worth it, bringing great performance and cost savings along the way.

Once you have migrated all your workloads to graviton2 you can just retire the intel nodegroups and after that, your EKS cluster will be running fully on graviton2. Boom!

comments powered by Disqus