EKS — Security Groups for Pods

5 min readFeb 13, 2021

The problem

Containerised applications running in Kubernetes frequently require access to other services running within the cluster as well as external AWS services, such as Amazon RDS or Amazon Elasticache Redis. On AWS, controlling network level access between services is often accomplished via EC2 security groups. Before a few months ago, you could only assign security groups at the node level, and every pod on a node shared the same security groups.

The saviour

Security groups for pods makes it easy to achieve network security compliance by running applications with varying network security requirements on shared compute resources. Network security rules that span pod to pod and pod to external AWS service traffic can be defined in a single place with EC2 security groups, and applied to applications with Kubernetes native APIs.

Considerations

The following are a list of considerations before embarking on this implementation:

You must be using at least version 1.17 with platform version eks.3
Traffic flow to and from pods with associated security groups are not subjected to Calico network policy enforcement and are limited to Amazon EC2 security group enforcement only. This will likely be fixed in the future release of Calico.
Security groups for pods only work if the nodes themselves are in a private subnet configured with a NAT gateway or instance. Pods with assigned security groups deployed to public subnets are not able to access the internet.
Kubernetes services of type NodePort and LoadBalancer using instance targets with an externalTrafficPolicy set to Local are not supported with pods that you assign security groups to.
This feature is only available a selection of instance types within AWS. The full list can be found at the bottom of this page.

The implementation

So the bit you all came to read! How on earth do you implement this?

1. Cluster IAM permissions

Firstly we need to associate the following IAM policy with the IAM role attached to your EKS cluster.

arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

This was done via the following snippet of Terraform code:

resource "aws_iam_role_policy_attachment" "k8s_master_attach" {
  for_each   = toset(local.cluster_policies)
  role       = aws_iam_role.k8s_master.name
  policy_arn = each.value
}locals {
  cluster_policies = [
    "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy",
    "arn:aws:iam::aws:policy/AmazonEKSServicePolicy",
    "arn:aws:iam::aws:policy/AmazonEKSVPCResourceController",
  ]
}

2. AWS CNI adjustments

The following changes were required to be made to the AWS CNI daemonset.

As you are likely to be leveraging liveliness and readiness probes within your deployments, we need to setDISABLE_TCP_EARLY_DEMUX to true so that the kubelet can connect to pods on branch network interfaces via TCP.
We need to setENABLE_POD_ENI to true this will add the label vpc.amazonaws.com/has-trunk-attached to the node if it is possible to attach an additional ENI. This is an essential configuration required to enable SGs for pods.

initContainers:
  - name: aws-vpc-cni-init
    env:
      - name: DISABLE_TCP_EARLY_DEMUX
        value: "true"containers:
  - name: aws-node
    env:
      - name: ENABLE_POD_ENI
        value: "true"

You can validate if these changes have been successful by executing the following commands:

kubectl get nodes -o wide -l vpc.amazonaws.com/has-trunk-attached=true

3. Security Group setup

The configuration of the security groups is the most important step for this whole thing to work.

Let’s take the scenario whereby we have a specific set of pods that require access to a specific RDS database within our setup.

I want to use the diagram below to aid our implementation:

Key:

RQ1 — The pod security group requires outbound internet access.

RQ2 — Our ingress nodes need inbound access to our pod security group to route external traffic.

RQ3 — The pod security group needs outbound access to the RDS security group to allow communication with the database.

RQ4 — The pod security group has to resolve DNS by using the worker security group.

RQ5 — The worker security group needs to have access to the pod security group for liveliness and readiness probes to work.

Terraform code

The following snippets are taken from our rds-mysql module.

RDS security group

# ==========================
# Security for RDS cluster
# ==========================

resource "aws_security_group" "rds" {
  name        = "${var.name}"
  description = "Default security group for ${var.name}."
  vpc_id      = var.vpc_id

  tags = {
    Name      = "${var.name}"
    Region    = data.aws_region.current.name
    Platform  = var.platform
    CreatedBy = "Terraform"
  }
}

resource "aws_security_group_rule" "rds_ingress_from_pods" {
  description              = "Ingress: Allow access from ${aws_security_group.pods.name}."
  from_port                = 3306
  protocol                 = "tcp"
  security_group_id        = aws_security_group.rds.id
  source_security_group_id = aws_security_group.pods.id
  to_port                  = 3306
  type                     = "ingress"
}

resource "aws_security_group_rule" "rds_ingress_to_self" {
  description       = "Ingress: Allow access to itself."
  self              = true
  from_port         = 3306
  protocol          = "tcp"
  security_group_id = aws_security_group.rds.id
  to_port           = 3306
  type              = "ingress"
}

Pod security group

# ==========================
# Security for pods to leverage
# ==========================

resource "aws_security_group" "pods" {
  name        = "${var.name}-for-pods"
  description = "Default security group for ${var.name}-for-pods."
  vpc_id      = var.vpc_id

  tags = {
    Name      = "${var.name}-for-pods"
    Region    = data.aws_region.current.name
    Platform  = var.platform
    CreatedBy = "Terraform"
  }
}

resource "aws_security_group_rule" "pods_ingress_all_tcp_from_worker_security_group" {
  description              = "Ingress: Allow all TCP from the worker security group (required for probes to work)."
  from_port                = 0
  protocol                 = "tcp"
  security_group_id        = aws_security_group.pods.id
  source_security_group_id = var.security_groups["k8s_worker"]
  to_port                  = 65535
  type                     = "ingress"
}

resource "aws_security_group_rule" "pods_ingress_all_tcp_from_ingress_security_group" {
  description              = "Ingress: Allow all TCP from the ingress security group (required for ingress to work)."
  from_port                = 0
  protocol                 = "tcp"
  security_group_id        = aws_security_group.pods.id
  source_security_group_id = var.security_groups["k8s_ingress"]
  to_port                  = 65535
  type                     = "ingress"
}

resource "aws_security_group_rule" "pods_egress_to_rds" {
  description              = "Egress: Allow access to ${aws_security_group.rds.name}."
  from_port                = 3306
  protocol                 = "tcp"
  security_group_id        = aws_security_group.pods.id
  source_security_group_id = aws_security_group.rds.id
  to_port                  = 3306
  type                     = "egress"
}

resource "aws_security_group_rule" "pods_egress_https_all" {
  description       = "Egress: HTTPS"
  from_port         = 443
  protocol          = "tcp"
  security_group_id = aws_security_group.pods.id
  cidr_blocks       = ["0.0.0.0/0"]
  to_port           = 443
  type              = "egress"
}

resource "aws_security_group_rule" "pods_egress_http_all" {
  description       = "Egress: HTTP"
  from_port         = 80
  protocol          = "tcp"
  security_group_id = aws_security_group.pods.id
  cidr_blocks       = ["0.0.0.0/0"]
  to_port           = 80
  type              = "egress"
}

resource "aws_security_group_rule" "pods_egress_dns_tcp_to_worker_security_group" {
  description              = "Egress: Allow DNS resolution (TCP) via worker security group"
  from_port                = 53
  protocol                 = "tcp"
  security_group_id        = aws_security_group.pods.id
  source_security_group_id = var.security_groups["k8s_worker"]
  to_port                  = 53
  type                     = "egress"
}

resource "aws_security_group_rule" "pods_egress_dns_udp_to_worker_security_group" {
  description              = "Egress: Allow DNS resolution (UDP) via worker security group"
  from_port                = 53
  protocol                 = "udp"
  security_group_id        = aws_security_group.pods.id
  source_security_group_id = var.security_groups["k8s_worker"]
  to_port                  = 53
  type                     = "egress"
}

Worker security group updates

resource "aws_security_group_rule" "worker_ingress_dns_tcp_from_pods_security_group" {
  description              = "Ingress: Allow DNS resolution (via TCP) from ${aws_security_group.pods.name}."
  from_port                = 53
  protocol                 = "tcp"
  security_group_id        = var.security_groups["k8s_worker"]
  source_security_group_id = aws_security_group.pods.id
  to_port                  = 53
  type                     = "ingress"
}

resource "aws_security_group_rule" "worker_ingress_dns_udp_from_pods_security_group" {
  description              = "Ingress: Allow DNS resolution (via UDP) from ${aws_security_group.pods.name}."
  from_port                = 53
  protocol                 = "udp"
  security_group_id        = var.security_groups["k8s_worker"]
  source_security_group_id = aws_security_group.pods.id
  to_port                  = 53
  type                     = "ingress"
}

Conclusion

Every company has their own security and compliance policies, some of which are tightly coupled to security groups. I did find it very easy to configure our clusters to use Security Groups for pods and I don’t believe any engineer will struggle with it. However, this is yet another Kubernetes resource which further expands and effectively complicates various configurations.

The one important gotcha is the need for the worker security group to be able to access the pod security group on any ports used as liveliness and readiness probes. This caught me out for a little while, which is why I have added it here as a mandatory step.