Terraform EKS module upgrade from v17.x to v18.x
Bill Hegazy July 20, 2022 #AWS #Terraform #EKSIf youโre like me and you are using the awesome terraform-aws-eks module to manage your EKS clusters, then you should know that there are many breaking changes when upgrading the module version from v17.x to
v18.x
, in this guide I will share the steps that I took to ease the upgrade a little bit (we have more than 12 clusters with similar config) and I hope that this helps someone.
I highly recommend that you also read the official guide UPGRADE-18.0.md.
1) Change module version and variables
Check the variables changes here, as everyone is using a different configuration, in my case I have ended up with the following:
module "eks" {
source = "terraform-aws-modules/eks/aws"
# check releases for the latest version
# https://github.com/terraform-aws-modules/terraform-aws-eks/releases
version = "18.26.2"
# Needed for EKS module Upgrade
# https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/UPGRADE-18.0.md#upgrade-from-v17x-to-v18x
prefix_separator = ""
iam_role_name = var.cluster_name
cluster_security_group_name = var.cluster_name
cluster_security_group_description = "EKS cluster security group."
# Add this to avoid issues with AWS Load balancer controller
# "error":"expect exactly one securityGroup tagged kubernetes.io/cluster/xxx
# https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1986
node_security_group_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = null
}
cluster_name = local.name
cluster_version = local.cluster_version
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {
resolve_conflicts = "OVERWRITE"
}
kube-proxy = {
resolve_conflicts = "OVERWRITE"
}
}
cluster_encryption_config = [
{
provider_key_arn = aws_kms_key.eks.arn
resources = ["secrets"]
}
]
vpc_id = module.vpc.vpc_id
# Rename subnets to subnet_ids
subnet_ids = module.vpc.private_subnets
# Rename node_group_defaults to eks_managed_node_group_defaults
eks_managed_node_group_defaults = {
ami_type = "AL2_x86_64"
instance_types = ["m5.large", "m5a.large", "t3.large", "m5.xlarge"]
iam_role_attach_cni_policy = true
update_config = {
max_unavailable_percentage = 50
}
block_device_mappings = {
xvda = {
device_name = "/dev/xvda"
ebs = {
volume_size = "100"
volume_type = "gp3"
encrypted = true
kms_key_id = aws_kms_key.ebs.arn
}
}
}
metadata_options = {
http_endpoint = "enabled"
http_tokens = "required"
http_put_response_hop_limit = 2
instance_metadata_tags = "disabled"
}
}
# Rename eks_managed_node_groups to eks_managed_node_groups
# the variables in the sub module also changed, so be sure to rename them!
eks_managed_node_groups = {
default = {
name = "default"
use_name_prefix = true
subnet_ids = module.vpc.private_subnets
desired_size = 3
max_size = 10
min_size = 3
labels = {
environment = local.environment
capacity = "on_demand"
}
tags = local.tags
},
spot = {
name = "spot"
use_name_prefix = true
capacity_type = "SPOT"
instance_types = ["m5a.xlarge", "m5.xlarge", "r5.xlarge", "r5a.xlarge"]
subnet_ids = module.vpc.private_subnets
desired_size = 3
max_size = 15
min_size = 3
cluster_version = local.cluster_version
labels = {
environment = local.environment
capacity = "spot"
}
tags = local.tags
}
tags = local.tags
}
}
If your iam_role_name
variable is NOT the cluster_name, then get the cluster iam role name using aws cli:
| |
2) Do a terraform plan
to see what changes
You should see a lot of resources change including for example IAM, node groups, and even cluster control plane replacement (we will fix that next step)
Since there are a lot of changes and we wanted to make the upgrade simple for us, the best approach is to remove existing node groups and related resources from terraform state, this suggestion also came from users in the related Github issue
3) Terraform state migration and deletion
Everyone setup is different so be careful when playing with Terrafrom state file, as your tf state might get corrupted, always backup the state file first!
# Backup terraform state first
# Remove node_groups from tf state
# Rename the cluster iam role tf resource to new name
# Remove policy attachment for node groups from tf state
# Remove node groups security group from state
# Remove node groups aws_security_group_rule resources from tf state
# Remove cluster aws_security_group_rule resources from tf state
# Remove IAM Role of node groups from tf state
# If you are managing the aws-auth configmap using EKS module
# Then remove aws-auth configmap from tf state as now the module dropped the support
# If you use addons then remove from state as well
# Then import the new add ons to state
4) Test terraform plan
again
Make sure you DO NOT see the following
- The cluster Must be replaced
- Old node groups must be destroyed
- Node groups related resources (sg, sg rules, old node iam) must be destroyed
You should see the following:
- Some cluster policy changes.
- New node group adding.
- New node groups related resources.
5) Proceed with terraform apply
You can now start by applying the changes with terraform apply
.
After the apply
is complete, you should see new nodes joining the cluster and working as expected
6) Manually delete the old node group and related resources
Now you can go ahead and delete the old node group manually and pods will restart on new node groups ๐๐.
Other related resources should be manually deleted as well such as old node group security groups.
Over to you
Like this post? Consider following me on Medium billhegazy.
If you have questions or want to reach out, add me on LinkedIn.