Giới thiệu

Bài này ghi lại quá trình mình học và cố dựng một cụm EKS (Elastic Kubernetes Service) - là một kubernetes service của AWS. Hy vọng nó sẽ giúp mọi người hiểu hơn về EKS.

Kiến thức chuẩn bị trước khi đọc bài viết bao gồm:

Kubernetes: core components (pod, service, controller, node), càng nhiều càng tốt, ...
AWS: các dịch vụ như IAM, EC2, VPC, Security Group, ...
Terraform: một trong những công cụ để làm infrastructure as code.

Container Orchestration nói chung hay Kubernetes nói riêng vẫn là những công nghệ phức tạp, gồm nhiều thành phần. Các công ty - tổ chức khi áp dụng, thường sửa đổi để phù hợp với mục đích của họ, và trên các nền tảng cloud như AWS, Azure, GCP mỗi cloud họ đều có một k8s distribution custom riêng (k8s là viết tắt của kubernetes).

Học Kubernetes là một hành trình dài và gập gềnh. Rất mong các bạn góp ý những sai sót, cải thiện, bàn tán - thảo luận ở phía dưới phần comments để mình cùng các bạn khác có thể hiểu hơn về EKS.

Glossary - Từ khóa - Từ viết tắt mà mình sẽ sử dụng

k8s: viết tắt của kubernetes

k: trong example chạy lệnh các bạn có thể thấy mình không dùng kubectl mà chỉ gọi k thì đây là alias của kubectl nhá, cái này là chức năng của bash script: alias k=kubectl

worker node, ec2 instance: đều ám chỉ chung là worker node trong cụm EKS.

Bắt đầu

Kiến trúc cơ bản của EKS trông như sau: https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html

EKS sẽ gồm:

EKS Control Plane tương đương với controller node bên Kubernetes gốc, theo mình hiểu đến hiện tại thì EKS control plane được quản lý bởi AWS. Phía người sử dụng không có cách nào truy cập được control plane này, chỉ có thể tương tác qua các API mà nó cung cấp.
Worker node - nơi chạy pod, containers backend. Với EKS thì mình có thể sử dụng luôn dịch vụ EC2 của AWS để tạo máy ảo làm worker node. Worker node (EC2 instance) có thể nằm trong "Auto Scaling Group" - đây là tính năng scale: tăng, giảm số lượng EC2 instance một cách tự động (theo work load, theo traffic, theo hẹn giờ, ...) thay vì các bạn phải tạo tay một node mới sau đó join node đó vào cụm.

Ok, "đơn giản", giờ bắt đầu tạo EKS nhá. Trước khi tạo thì cần chuẩn bị các thứ sau:

aws command line: search google "install aws cli".
credentials ~/.aws/credentials, bạn nào chưa biết cách setup thì tự tra google "how to configure aws credentials" nhá. Nó sẽ trông như này:

[default]
aws_access_key_id = SECRET
aws_secret_access_key = SECRET
region=ap-southeast-1

terraform command line: search google "install terraform".
kubectl command line

Khởi tạo project

Khởi tạo terraform với provider là aws, ở đây mình sử dụng provider aws ở phiên bản 5.24.

terraform.tf

terraform { required_providers { aws = { source = "hashicorp/aws" version = "5.24" } } required_version = ">= 1.2.0"
}

provider.tf

provider "aws" { region = "ap-southeast-1" shared_credentials_files = ["~/.aws/credentials"] profile = "default"
}

Sau đó chạy terraform init để kéo provider về.

terraform init

Khởi tạo thành công thì có thể bắt đầu setup một cụm EKS được rùi.

Tạo network

Trước khi tạo EKS cần setup networking, mình sẽ tạo một network đơn giản:

1 vpc có dải ip là 10.0.0.0/16
3 subnet với dải ip khác nhau
Mỗi subnet nằm trên một availability zone khác nhau
Các subnet này sẽ gán ip public cho bất kỳ ec2 instance: map_public_ip_on_launch = true
internet gateway và route để cho phép các ec2 trong subnet này có quyền truy cập ra ngoài

NOTE: map_public_ip_on_launch = true cái này là bắt buộc với setup hiện tại, nếu không bật tính năng này thì quá trình tạo cụm có thể sảy ra lỗi như "Instances failed to join the cluster" (các bạn có thể xem lỗi này ở phía dưới bài viết), ... phía cuối bài viết mình sẽ giải thích thêm

network.tf

resource "aws_vpc" "zero" { cidr_block = "10.0.0.0/16" tags = { Name = "zero" }
} resource "aws_subnet" "zero_one" { vpc_id = aws_vpc.zero.id cidr_block = "10.0.1.0/24" availability_zone = "ap-southeast-1a" map_public_ip_on_launch = true tags = { Name = "zero_one" }
} resource "aws_subnet" "zero_two" { vpc_id = aws_vpc.zero.id cidr_block = "10.0.2.0/24" availability_zone = "ap-southeast-1b" map_public_ip_on_launch = true tags = { Name = "zero_two" }
} resource "aws_subnet" "zero_three" { vpc_id = aws_vpc.zero.id cidr_block = "10.0.3.0/24" availability_zone = "ap-southeast-1c" map_public_ip_on_launch = true tags = { Name = "zero_three" }
} resource "aws_internet_gateway" "public_access" { vpc_id = aws_vpc.zero.id
} resource "aws_route" "public_access" { route_table_id = aws_vpc.zero.default_route_table_id destination_cidr_block = "0.0.0.0/0" gateway_id = aws_internet_gateway.public_access.id
}

Tạo quyền cho control plane

Control plane cần quyền để thao tác với các tài nguyên cần thiết cho một cụm EKS, ví dụ như:

Xem thông tin network của cụm hiện tại
Xem và sửa auto scaling, kiểm soát số lượng node trong cụm
Xem thông tin của worker node (network, disk, cpu, ram)

Các quyền này đã được setup sẵn ở trong một policy có sẵn của AWS là AmazonEKSClusterPolicy

Giờ chúng ta sẽ tạo role EKSClusterRole sau đó gắn (attach) với quyền (policy) AmazonEKSClusterPolicy (Chi tiết quyền: https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonEKSClusterPolicy.html).

Role này có thể được sự dụng bởi dịch vụ eks, tức control plane được quản lý bởi AWS sẽ sử dụng role này để có thể tương tác với các tài nguyên bên trong AWS.

iam-cluster.tf

data "aws_iam_policy_document" "eks_cluster_role_trust_policy" { statement { effect = "Allow" actions = [ "sts:AssumeRole" ] principals { type = "Service" identifiers = ["eks.amazonaws.com"] } }
} resource "aws_iam_role" "eks_cluster_role" { name = "EKSClusterRole" assume_role_policy = data.aws_iam_policy_document.eks_cluster_role_trust_policy.json
} resource "aws_iam_role_policy_attachment" "eks_cluster_role_policy_attachment_1" { policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy" role = aws_iam_role.eks_cluster_role.name
}

Sau khi setup xong quyền thì đã có thể khởi tạo một cluster mới.

Khởi tạo control plane (cluster)

Phần này mình tham khảo tại https://registry.terraform.io/providers/hashicorp/aws/5.24.0/docs/resources/eks_cluster#basic-usage. Mình sẽ tạo một EKS Cluster như sau:

tên cụm là zero
phiên bản kubernetes là 1.28
role là role vừa tạo, role có tên là EKSClusterRole
subnet là 3 subnet mà mình đã tạo ở trên, và

main.tf

resource "aws_eks_cluster" "zero" { name = "zero" role_arn = aws_iam_role.eks_cluster_role.arn version = "1.28" vpc_config { subnet_ids = [ aws_subnet.zero_one.id, aws_subnet.zero_two.id, aws_subnet.zero_three.id, ] } # Ensure that IAM Role permissions are created before and deleted after EKS Cluster handling. # Otherwise, EKS will not be able to properly delete EKS managed EC2 infrastructure such as Security Groups. depends_on = [ aws_iam_role_policy_attachment.eks_cluster_role_policy_attachment_1, ]
}

Thử terraform apply xem sao

...
aws_eks_cluster.zero: Still creating... [17m50s elapsed]
aws_eks_cluster.zero: Still creating... [18m0s elapsed]
aws_eks_cluster.zero: Still creating... [18m10s elapsed]
aws_eks_cluster.zero: Creation complete after 18m12s [id=zero] Apply complete! Resources: 7 added, 0 changed, 0 destroyed.

Không hiểu sao lâu vãi gần 20p ạ, chắc nặng! Theo https://docs.aws.amazon.com/eks/latest/userguide/clusters.html, việc khởi tạo một cụm EKS bao gồm việc tạo control plane trên nhiều availability zone và sau đứng trước chúng là một load balancer.

Cài đặt cụm k8s chưa bao giờ là dễ cả, haizz...

Đoạn này khởi tạo dễ có lỗi, các bạn nếu tạo lỗi thì comments hoặc ib trực tiếp mình (tuana9a), mình sẽ trợ giúp các bạn nhé. Nếu ok thì vô thử giao diện kiểm tra xem.

Lên rùi nè, nếu có lỗi thì trường status sẽ đỏ lòm kèm chữ "Error" to đùng. Toát hết mồ hôi hột. Các bạn có thể click vào cluster và xem các thông tin show trên giao diện.

Giao diện tạo role thuộc dịch vụ IAM, tương tự các bạn có thể vào kiểm tra.

Lúc này coi như control plane đã tạo thành công, có thể kéo kube config về bằng lệnh aws eks update-kubeconfig --name zero với zero là tên của cluster mà bạn đã tạo.

vagrant@tuana9a-dev aws-eks-example $ aws eks update-kubeconfig --name zero
Added new context arn:aws:eks:ap-southeast-1:384588864907:cluster/zero to /home/vagrant/.kube/config
vagrant@tuana9a-dev aws-eks-example $
vagrant@tuana9a-dev aws-eks-example $ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6996c975cb-ppchn 0/1 Pending 0 31m
kube-system coredns-6996c975cb-vb8d9 0/1 Pending 0 31m
vagrant@tuana9a-dev aws-eks-example $ kubectl get nodes -o wide
No resources found
vagrant@tuana9a-dev aws-eks-example $

Ở trên mình đã có thể sử dụng kubectl để kiểm tra các tài nguyên, kiểm tra pod thì trạng thái của pod đang là Pending điều này có thể giải thích bằng lệnh kubectl get nodes ngay sau đó: do không có worker node. Vậy công việc tiếp theo là cần tạo worker node.

Tạo quyền cho worker node

Tương tự như control plane worker node cũng cần có quyền để truy cập các thông tin của cụm, vi dụ như:

Xem thông tin của cụm Kubernetes hiện tại
Xem thông tin networking hiện tại
Kéo các image của kube-system như coredns, ...

Các quyền này cũng được setup trong các policy có sẵn bởi AWS là

AmazonEKSWorkerNodePolicy: cho phép worker xem thông tin.
AmazonEKS_CNI_Policy: cho phép worker thao tác với networking.
AmazonEC2ContainerRegistryReadOnly: cái này mình không chắc có thể là cho phép worker có thể kéo các custom docker image của kube-system từ registry ECR của AWS.

Thực ra nói thật là mình cũng không chắc chỗ này, đại loại nếu mình bỏ một trong 3 quyền, thì sẽ lại có lỗi kiểu như "Instances failed to join the cluster" và việc tạo cụm lại không thành công, lỗi này mình có hình ảnh ở phía cuối bài viết. Mình có tìm được link này của AWS họ cũng giải thích qua về quyền cho worker https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html. Bạn nào có thể giải thích thêm chỗ này thì tốt =))

Ok giờ sẽ tạo một role có tên là EKSNodeRole, role này được sử dụng bởi dịch vụ ec2, tức ec2 - worker node có thể sử dụng role này. Role này sẽ được gắn với các policy ở trên: AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy và AmazonEC2ContainerRegistryReadOnly

iam-worker.tf

# setup IAM for the worker node data "aws_iam_policy_document" "eks_node_role_trust_policy" { statement { effect = "Allow" actions = [ "sts:AssumeRole" ] principals { type = "Service" identifiers = ["ec2.amazonaws.com"] } }
} resource "aws_iam_role" "eks_node_role" { name = "EKSNodeRole" assume_role_policy = data.aws_iam_policy_document.eks_node_role_trust_policy.json
} resource "aws_iam_role_policy_attachment" "eks_node_role_policy_attachment_1" { policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy" role = aws_iam_role.eks_node_role.name
} resource "aws_iam_role_policy_attachment" "eks_node_role_policy_attachment_2" { policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy" role = aws_iam_role.eks_node_role.name
} resource "aws_iam_role_policy_attachment" "eks_node_role_policy_attachment_3" { policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" role = aws_iam_role.eks_node_role.name
}

Tạo worker node

Để tạo worker node có một vài cách như sau:

EKS managed node groups
Self managed nodes
AWS Fargate

Các bạn có thể xem chi tiết ở đây https://docs.aws.amazon.com/eks/latest/userguide/eks-compute.html, để đơn giản thì mình sử dụng phương án đầu tiên, nó được AWS quản lý nên đơn giản hơn và mình không phải setup thêm gì cả. Link terraform https://registry.terraform.io/providers/hashicorp/aws/5.24.0/docs/resources/eks_node_group

Mình sẽ tạo group worker node như sau: Thêm đoạn phía dưới vào file main.tf

resource "aws_eks_cluster" "zero" { ...
} resource "aws_eks_node_group" "zero" { cluster_name = aws_eks_cluster.zero.name node_group_name = "zero" node_role_arn = aws_iam_role.eks_node_role.arn subnet_ids = [ aws_subnet.zero_one.id, aws_subnet.zero_two.id, aws_subnet.zero_three.id, ] capacity_type = "SPOT" scaling_config { desired_size = 1 max_size = 3 min_size = 1 } update_config { max_unavailable = 1 } # Ensure that IAM Role permissions are created before and deleted after EKS Node Group handling. # Otherwise, EKS will not be able to properly delete EC2 Instances and Elastic Network Interfaces. depends_on = [ aws_iam_role_policy_attachment.eks_node_role_policy_attachment_1, aws_iam_role_policy_attachment.eks_node_role_policy_attachment_2, aws_iam_role_policy_attachment.eks_node_role_policy_attachment_3, ]
}

Giải thích chút, node group này có nhiệm vụ quản lý số lượng worker node trong cụm kubernetes, chi tiết

Tên group là zero
Control plane của group này là zero - Tên của cụm mà mình khởi tạo control plane ở trên.
Các node trong group này sẽ có role là role mình vừa tạo ở trên.
Các node trong group nằm dưới 3 subnet đã tạo ở trên, ip của node sẽ nằm trong dải ip của subnet tương ứng.
Node group này có thể scale tự động: min: 1, max: 3

Ngoài ra còn tham số SPOT cái này các bạn search google aws spot instance nhé.

Ok cũng nhiều phết rùi, giờ thử apply xem sao nhé

Lỗi, à lỗi này do thiếu cni, cái cni này là thứ giúp cho networking trong cụm k8s hoạt động được (pod-to-pod networking, ...). Các thứ này trong EKS được quản lý bằng addon, ngoài cni thì còn coredns, kube-proxy. Vậy là biết nguyên nhân rồi, bổ sung thêm đống addon này là được.

Mình sẽ thêm cả 3 addon này vào cụm EKS như sau:

addon.tf

resource "aws_eks_addon" "zero_vpccni" { cluster_name = aws_eks_cluster.zero.name addon_name = "vpc-cni" addon_version = "v1.14.1-eksbuild.1"
} resource "aws_eks_addon" "zero_kubeproxy" { cluster_name = aws_eks_cluster.zero.name addon_name = "kube-proxy" addon_version = "v1.28.1-eksbuild.1"
} resource "aws_eks_addon" "zero_coredns" { cluster_name = aws_eks_cluster.zero.name addon_name = "coredns" addon_version = "v1.10.1-eksbuild.2" depends_on = [aws_eks_node_group.zero]
}

Các bạn để ý coredns addon mình có dòng depends_on = [aws_eks_node_group.zero], cái này là để khiến việc tạo coredns sảy ra sau khi tạo worker node thành công. Mình đã từng bị tình huống coredns được tạo trước node group và do coredns yêu cầu phải có worker node available để deploy pod, mà tạo trước nên không có worker node nào available sẽ khiến việc khởi tạo coredns không thành công hoặc rất lâu lúc này terraform apply sẽ báo lỗi, sau đó mình apply lại thì nó sẽ replace coredns cũ và do đã có worker node ready trước đó nên việc tạo cụm sẽ thành công.

Và để tránh tình huống chưa khởi tạo cni thì ta việc tạo node group cũng nên chờ addon tạo xong thì mới tạo node group, thêm depend cho node group như sau

main.tf

resource "aws_eks_node_group" "zero" { ... depends_on = [ ... aws_eks_addon.zero_vpccni, aws_eks_addon.zero_kubeproxy, ]
}

Ok apply lại nào.

aws_eks_node_group.zero: Still creating... [2m0s elapsed]
aws_eks_node_group.zero: Still creating... [2m10s elapsed]
aws_eks_node_group.zero: Creation complete after 2m10s [id=zero:zero]
aws_eks_addon.zero_coredns: Creating...
aws_eks_addon.zero_coredns: Still creating... [10s elapsed]
aws_eks_addon.zero_coredns: Creation complete after 14s [id=zero:coredns] Apply complete! Resources: 4 added, 0 changed, 1 destroyed.

Cúng cùi cũng thành cong. Thử các kubectl get cơ bản nào, bạn nào quên chưa kéo kube config về thì lại chạy lệnh aws eks update-kubeconfig --name zero nhá.

14:17:15 vagrant@tuana9a-dev aws-eks-example master ? aws eks update-kubeconfig --name zero
Updated context arn:aws:eks:ap-southeast-1:384588864907:cluster/zero in /home/vagrant/.kube/config
14:17:20 vagrant@tuana9a-dev aws-eks-example master ? kubectl get nodes NAME STATUS ROLES AGE VERSION
ip-10-0-1-177.ap-southeast-1.compute.internal Ready <none> 59m v1.28.3-eks-e71965b
14:17:55 vagrant@tuana9a-dev aws-eks-example master ? 14:18:04 vagrant@tuana9a-dev aws-eks-example master ? kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-npzwb 2/2 Running 0 60m
kube-system coredns-fd4c6955-s4w8k 1/1 Running 0 59m
kube-system coredns-fd4c6955-zh7rq 1/1 Running 0 59m
kube-system kube-proxy-s7m4x 1/1 Running 0 60m
14:18:14 vagrant@tuana9a-dev aws-eks-example master ?

Yay lên rồi, test thử một vài deployment xem sao.

test-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata: name: nginx-deployment labels: app: nginx
spec: replicas: 10 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80

kubectl apply -f test-deployment.yaml

vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $ k get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-86dcfdf4c6-2wbj7 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-58vl9 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-89m92 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-btlqc 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-cr4kn 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-ff7xq 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-mjbtw 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-qchlc 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-vxd7n 1/1 Running 0 67s
nginx-deployment-86dcfdf4c6-xcjdg 1/1 Running 0 67s
vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $

Good. Thử tăng thêm 30 pod xem auto scaling của EKS hoạt động như nào, chỉ cần sửa thành replicas: 30

...
spec: replicas: 30
...

Apply lại và xem kết quả.

vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $ k get pods NAME READY STATUS RESTARTS AGE
nginx-deployment-86dcfdf4c6-2wbj7 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-58vl9 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-5wc8g 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-89m92 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-9bwf5 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-9d286 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-bf2xq 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-btlqc 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-ccn4x 1/1 Running 0 12m
nginx-deployment-86dcfdf4c6-cr4kn 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-dj757 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-f2xp2 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-ff7xq 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-g9xsm 1/1 Running 0 12m
nginx-deployment-86dcfdf4c6-gmckm 1/1 Running 0 12m
nginx-deployment-86dcfdf4c6-hm6j7 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-kbj5w 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-kfgj4 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-lzqww 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-mjbtw 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-nwgjc 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-p9llh 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-qchlc 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-tcldb 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-vvzv7 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-vxd7n 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-xcjdg 1/1 Running 0 18m
nginx-deployment-86dcfdf4c6-xwfrw 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-zjhg2 0/1 Pending 0 12m
nginx-deployment-86dcfdf4c6-zlzns 0/1 Pending 0 12m
vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $

hmm có một vài pod bị pending, grep output để dễ xem hơn

vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $ k get pods | grep Pending
nginx-deployment-86dcfdf4c6-5wc8g 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-9bwf5 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-9d286 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-bf2xq 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-dj757 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-f2xp2 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-hm6j7 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-kbj5w 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-kfgj4 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-lzqww 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-nwgjc 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-p9llh 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-tcldb 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-vvzv7 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-xwfrw 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-zjhg2 0/1 Pending 0 14m
nginx-deployment-86dcfdf4c6-zlzns 0/1 Pending 0 14m
vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $ k get pods | grep Pending | wc -l
17
vagrant@tuana9a-dev ~/github.com/tuana9a/aws-eks-example/zero (master?) $

Có 17 pod vẫn bị Pending, ngồi chờ cả 14 phút mà vẫn chưa thấy EKS tự scale thêm EC2 node cho mình

Đoạn này mình đã tìm ra nguyên nhân là mình chưa có cluster autoscaler, kiểu mặc định cụm EKS không tự scale cho mình, chức năng này giống như tính năng, optional feature và AWS để người dùng đưa ra lựa chọn.

Ok thích thì chiều, có nhiều hơn một cách để implement cluster autoscaler. Có thể xem ở đây Autoscaling - Amazon EKS, mình lựa chọn Cluster Autoscaler on AWS

Thực ra đến đây thì có thể coi như việc tạo một cụm EKS đã thành công với số lượng worker trong cụm là số lượng tĩnh, các bạn có thể set số lượng node trong cụm thông qua giá trị desired_size của node group. Cơ mà dùng cloud mà vẫn phải làm tay nhiều vậy thì hơi tù.

Các bạn nào quan tâm thì ta bắt đầu nhé.

cluster autoscaler

Tóm tắt nhanh thì mình sẽ add thêm quyền cho worker node và sau đó sẽ apply một deployment manifest, pod ở trong deployment này có thể lấy quyền từ node và có thể thao tác với tài nguyên của AWS để cấp thêm ec2 instance cho cụm EKS.

Thêm quyền auto scaling cho worker node

Mình sẽ thêm policy mới và attach nó vào role của worker đã tạo trước đó iam-worker.tf

...
# OPTIONAL for cluster autoscaler
resource "aws_iam_policy" "eks_auto_scaler" { name = "EKSAutoScaler" path = "/" description = "EKS cluster auto scaler" # Terraform's "jsonencode" function converts a # Terraform expression result to valid JSON syntax. policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Action = [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeScalingActivities", "autoscaling:DescribeTags", "ec2:DescribeInstanceTypes", "ec2:DescribeLaunchTemplateVersions" ] Resource = "*" }, { Effect = "Allow" Action = [ "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeImages", "ec2:GetInstanceTypesFromInstanceRequirements", "eks:DescribeNodegroup" ] Resource = "*" }, ] })
}
...
resource "aws_iam_role_policy_attachment" "eks_node_role_policy_attachment_4" { policy_arn = aws_iam_policy.eks_auto_scaler.arn role = aws_iam_role.eks_node_role.name
}

Thêm depend_on vào node group cho đủ bộ main.tf

resource "aws_eks_node_group" "zero" { depends_on = [ ... aws_iam_role_policy_attachment.eks_node_role_policy_attachment_4, # OPTIONAL for cluster autoscaler ... ]
}

Thêm deployment để thực hiện việc auto scaling

Sau khi worker node có quyền thì cần deploy pod để thực hiện quá trình auto scaling, file manifest mình lấy ở đây https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml, lưu ý là có trường <YOUR CLUSTER NAME> cần phải sửa ứng với tên cụm EKS của bạn.

- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>

Để cho tiện mình sẽ dùng template engine của terraform để tự động fill trường đó hộ mình. Và chúng ta cần thêm provider là hashicop/local vào terraform.tf và provider.tf

Giờ terraform.tf trông sẽ như sau

terraform { required_providers { aws = { source = "hashicorp/aws" version = "5.24" } local = { source = "hashicorp/local" version = "2.4.0" } } required_version = ">= 1.2.0"
}

Và đây là provider.tf

provider "aws" { region = "ap-southeast-1" shared_credentials_files = ["~/.aws/credentials"] profile = "default"
} provider "local" { # Configuration options
}

Chạy terraform init -upgrade để kéo provider mới về. Các bạn quên không chạy nó sẽ báo lỗi.

sau đó tạo thư mục templates và tạo file cluster-autoscaler-autodiscover.yaml.tftpl bên trong

templates/cluster-autoscaler-autodiscover.yaml.tftpl

---
apiVersion: v1
kind: ServiceAccount
metadata: labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler name: cluster-autoscaler namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler
rules: - apiGroups: [""] resources: ["events", "endpoints"] verbs: ["create", "patch"] - apiGroups: [""] resources: ["pods/eviction"] verbs: ["create"] - apiGroups: [""] resources: ["pods/status"] verbs: ["update"] - apiGroups: [""] resources: ["endpoints"] resourceNames: ["cluster-autoscaler"] verbs: ["get", "update"] - apiGroups: [""] resources: ["nodes"] verbs: ["watch", "list", "get", "update"] - apiGroups: [""] resources: - "namespaces" - "pods" - "services" - "replicationcontrollers" - "persistentvolumeclaims" - "persistentvolumes" verbs: ["watch", "list", "get"] - apiGroups: ["extensions"] resources: ["replicasets", "daemonsets"] verbs: ["watch", "list", "get"] - apiGroups: ["policy"] resources: ["poddisruptionbudgets"] verbs: ["watch", "list"] - apiGroups: ["apps"] resources: ["statefulsets", "replicasets", "daemonsets"] verbs: ["watch", "list", "get"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"] verbs: ["watch", "list", "get"] - apiGroups: ["batch", "extensions"] resources: ["jobs"] verbs: ["get", "list", "watch", "patch"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["create"] - apiGroups: ["coordination.k8s.io"] resourceNames: ["cluster-autoscaler"] resources: ["leases"] verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler
rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "list", "watch"] - apiGroups: [""] resources: ["configmaps"] resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"] verbs: ["delete", "get", "update", "watch"] ---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata: name: cluster-autoscaler labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler
roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-autoscaler
subjects: - kind: ServiceAccount name: cluster-autoscaler namespace: kube-system ---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: name: cluster-autoscaler namespace: kube-system labels: k8s-addon: cluster-autoscaler.addons.k8s.io k8s-app: cluster-autoscaler
roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: cluster-autoscaler
subjects: - kind: ServiceAccount name: cluster-autoscaler namespace: kube-system ---
apiVersion: apps/v1
kind: Deployment
metadata: name: cluster-autoscaler namespace: kube-system labels: app: cluster-autoscaler
spec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler annotations: prometheus.io/scrape: 'true' prometheus.io/port: '8085' spec: priorityClassName: system-cluster-critical securityContext: runAsNonRoot: true runAsUser: 65534 fsGroup: 65534 seccompProfile: type: RuntimeDefault serviceAccountName: cluster-autoscaler containers: - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2 name: cluster-autoscaler resources: limits: cpu: 100m memory: 600Mi requests: cpu: 100m memory: 600Mi command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${cluster_name} volumeMounts: - name: ssl-certs mountPath: /etc/ssl/certs/ca-certificates.crt # /etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes readOnly: true imagePullPolicy: "Always" securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true volumes: - name: ssl-certs hostPath: path: "/etc/ssl/certs/ca-bundle.crt"

Thêm dòng này vào main.tf

# OPTIONAL for cluster autoscaler
resource "local_file" "autoscaler-manifest" { content = templatefile("templates/cluster-autoscaler-autodiscover.yaml.tftpl", { cluster_name = aws_eks_cluster.zero.name }) filename = "cluster-autoscaler-autodiscover.yaml"
}

Giờ thì terraform apply lại. Sau khi apply thành công thì ta sẽ có file cluster-autoscaler-autodiscover.yaml ở ngay thư mục hiện tại, kubectl apply nó

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Chờ một lúc và

13:48:18 vagrant@tuana9a-dev aws-eks-example master ? k get nodes NAME STATUS ROLES AGE VERSION
ip-10-0-1-177.ap-southeast-1.compute.internal Ready <none> 24h v1.28.3-eks-e71965b
ip-10-0-2-226.ap-southeast-1.compute.internal Ready <none> 101s v1.28.3-eks-e71965b
ip-10-0-3-79.ap-southeast-1.compute.internal Ready <none> 103s v1.28.3-eks-e71965b
13:48:21 vagrant@tuana9a-dev aws-eks-example master ? 13:50:52 vagrant@tuana9a-dev aws-eks-example master ? k get pods | grep Pending 13:50:58 vagrant@tuana9a-dev aws-eks-example master ? k get pods | grep Pending | wc -l
0
13:51:02 vagrant@tuana9a-dev aws-eks-example master ?

Há há thế là được rồi. Mệt vl. Vậy là chúng ta đã tạo được cụm EKS đi kèm auto scale node thành công.

Tổng hợp lỗi

worker node không thể join cụm

Lỗi này thì các bạn cần kiểm tra lại quyền IAM, hoặc setting của network (VPC, subnet, ...)

quên không khởi tạo cni

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Khi click vào xem chi tiết node của node group

Cuộn xuống dưới phần Conditions các bạn sẽ thấy

Lỗi này do các bạn quên không khởi tạo addon vpn-cni

next step

Setup private ECR - Docker registry cho worker node kéo docker image.
Setup EBS storage class cho EKS, dynamic volume provisioning.
Security: với setup hiện tại thì cả control plane, worker node đang có ip public, tức có open surface để tấn công và rõ ràng là không cần thiết phải expose các worker node ra public internet như vậy, cần tìm cách deploy chúng một cách private.
Instance types: mặc định mình được cấp là t3.medium, rõ ràng tham số này nên được kiểm soát, tương tư với disk size cho mỗi worker node.

destroy

Sau khi tạo thành công và nghịch các thứ các bạn có thể chạy lệnh terraform destroy để "destroy" các thứ mà bạn với mình vừa tạo. Hẹn mn ở bài tiếp theo về EKS, còn kha khá thứ để nghịch cái thằng này.

Tham khảo

https://registry.terraform.io/providers/hashicorp/aws/5.24.0/docs/resources/eks_cluster

https://registry.terraform.io/providers/hashicorp/aws/5.24.0/docs/resources/eks_node_group

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws

Tôi bắt đầu học EKS

Giới thiệu

Bắt đầu

Khởi tạo project

Tạo network

Tạo quyền cho control plane

Khởi tạo control plane (cluster)

Tạo quyền cho worker node

Tạo worker node

cluster autoscaler

Thêm quyền auto scaling cho worker node

Thêm deployment để thực hiện việc auto scaling

Tổng hợp lỗi

worker node không thể join cụm

quên không khởi tạo cni

next step

destroy

Tham khảo

Bình luận

Bài viết tương tự

Deploying A Containerized Web Application On Kubernetes

Kubernetes - Học cách sử dụng Kubernetes Namespace cơ bản

[Kubernetes] Kubectl và các command cơ bản

Triển khai EFK Stack trên Kubernetes

Thực hành Kubernetes (K8S) bằng cách sử dụng lệnh Command

Kubernetes best practices - Liveness và Readiness Health checks