Aller au contenu

Amazon EKS

The following instructions are community-authored and may not be up to date.

It is advised to consult the official AWS EKS documentation for the latest best practices and configurations as well as the latest version of the Assemblyline Helm chart.

Prerequisites

Tool Version Purpose
aws CLI v2+ AWS resource management
kubectl 1.30+ Kubernetes management
helm 3.x Chart deployment
eksctl latest (Optional) EKS helpers

Ensure your AWS credentials are configured with permissions for EKS, EC2, S3, ElastiCache, IAM, and ELBv2.

# Verify AWS identity
aws sts get-caller-identity

# Install kubectl EKS plugin
aws eks update-kubeconfig --name assemblyline-cluster --region us-east-1

AWS Infrastructure

VPC & Networking

The EKS cluster runs in an existing VPC with 3 subnets across 3 AZs:

Resource Value
VPC <YOUR_VPC_ID>
Subnet AZ-a <SUBNET_AZ_A>
Subnet AZ-b <SUBNET_AZ_B>
Subnet AZ-c <SUBNET_AZ_C>

Requirements:

  • Subnets must be private (with NAT gateway) or public with auto-assign public IP
  • Subnets must be tagged for ALB discovery:

    kubernetes.io/cluster/assemblyline-cluster = shared
    kubernetes.io/role/elb = 1              # for internet-facing ALB
    kubernetes.io/role/internal-elb = 1     # for internal ALB (if needed)
    

EKS Cluster

# Create the EKS cluster
aws eks create-cluster \
  --name assemblyline-cluster \
  --region us-east-1 \
  --kubernetes-version 1.30 \
  --role-arn arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_CLUSTER_ROLE> \
  --resources-vpc-config \
    subnetIds=<SUBNET_AZ_A>,<SUBNET_AZ_B>,<SUBNET_AZ_C>,\
    securityGroupIds=<YOUR_ADDITIONAL_SG>,\
    endpointPublicAccess=true,\
    endpointPrivateAccess=true

# Wait for cluster to become ACTIVE (~10 minutes)
aws eks wait cluster-active --name assemblyline-cluster --region us-east-1

# Update kubeconfig
aws eks update-kubeconfig --name assemblyline-cluster --region us-east-1

Node Groups

We use a two-tier architecture with workload separation via taints and labels:

Node Group Instance Purpose Count Taint Label
r6a-infra r6a.large (2 vCPU, 16GB) Core infrastructure 4 ON_DEMAND workload=infra:NoSchedule workload=infra
m6a-services m6a.large (2 vCPU, 8GB) Service pods 4-5 ON_DEMAND (autoscales) (none) workload=services

Create the infra node group:

aws eks create-nodegroup \
  --cluster-name assemblyline-cluster \
  --nodegroup-name r6a-infra \
  --node-role arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_NODE_GROUP_ROLE> \
  --subnets <SUBNET_AZ_A> <SUBNET_AZ_B> <SUBNET_AZ_C> \
  --instance-types r6a.large \
  --scaling-config minSize=3,maxSize=6,desiredSize=4 \
  --capacity-type ON_DEMAND \
  --ami-type AL2023_x86_64_STANDARD \
  --disk-size 20 \
  --labels workload=infra \
  --taints "key=workload,value=infra,effect=NO_SCHEDULE" \
  --region us-east-1 \
  --tags Project=AssemblyLine4,Environment=prod

Create the services node group:

aws eks create-nodegroup \
  --cluster-name assemblyline-cluster \
  --nodegroup-name m6a-services \
  --node-role arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_NODE_GROUP_ROLE> \
  --subnets <SUBNET_AZ_A> <SUBNET_AZ_B> <SUBNET_AZ_C> \
  --instance-types m6a.large \
  --scaling-config minSize=2,maxSize=10,desiredSize=4 \
  --capacity-type ON_DEMAND \
  --ami-type AL2023_x86_64_STANDARD \
  --disk-size 20 \
  --labels workload=services \
  --region us-east-1 \
  --tags Project=AssemblyLine4,Environment=prod

Wait for both node groups:

aws eks wait nodegroup-active --cluster-name assemblyline-cluster --nodegroup-name r6a-infra --region us-east-1
aws eks wait nodegroup-active --cluster-name assemblyline-cluster --nodegroup-name m6a-services --region us-east-1
kubectl get nodes -L workload

Security Groups (Cross-SG Rules)

EKS managed node groups may receive different security groups. Elasticsearch requires port 9300 connectivity between all nodes for cluster formation. Add cross-SG rules if node groups get different SGs:

# Identify the 3 security groups
# SG1: your custom data-plane SG (if exists)
# SG2: eks-cluster-sg-assemblyline-cluster-* (auto-created by EKS)
# SG3: assemblyline-cluster-node-* (if exists)

# Add cross-SG ingress rules (all traffic between the SGs)
SG1=<YOUR_DATA_PLANE_SG>
SG2=<YOUR_EKS_CLUSTER_SG>
SG3=<YOUR_NODE_SG>

aws ec2 authorize-security-group-ingress --group-id $SG1 --protocol -1 --source-group $SG2 --region us-east-1
aws ec2 authorize-security-group-ingress --group-id $SG2 --protocol -1 --source-group $SG1 --region us-east-1
aws ec2 authorize-security-group-ingress --group-id $SG2 --protocol -1 --source-group $SG3 --region us-east-1
aws ec2 authorize-security-group-ingress --group-id $SG3 --protocol -1 --source-group $SG2 --region us-east-1
Checking node SGs
aws ec2 describe-instances --filters Name=tag:eks:nodegroup-name,Values=r6a-infra --query 'Reservations[*].Instances[*].SecurityGroups' --region us-east-1

IMDS Hop Limit

Note

New nodes from autoscaling will also need this fix. Consider using a launch template with MetadataOptions.HttpPutResponseHopLimit: 2 for permanent fix.

EKS managed node groups default to IMDSv2 with hop limit 1. Pods (like the ALB controller) that need instance metadata require hop limit 2. Fix this on all node instances:

# Fix hop limit on all cluster nodes
for INSTANCE_ID in $(aws ec2 describe-instances \
  --filters "Name=tag:eks:cluster-name,Values=assemblyline-cluster" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].InstanceId' --output text --region us-east-1); do
  aws ec2 modify-instance-metadata-options \
    --instance-id "$INSTANCE_ID" \
    --http-put-response-hop-limit 2 \
    --region us-east-1
done

AWS Managed Services

S3 Filestore Bucket

aws s3api create-bucket \
  --bucket <YOUR_S3_BUCKET> \
  --region us-east-1

# Block public access
aws s3api put-public-access-block \
  --bucket <YOUR_S3_BUCKET> \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

ElastiCache Redis

Redis must be accessible from the EKS VPC. Ensure the ElastiCache subnet group uses the same subnets as EKS, and the security group allows port 6379 from the EKS node security groups.

Create a Redis cluster (replication group with 2 nodes):

aws elasticache create-replication-group \
  --replication-group-id al-redis \
  --replication-group-description "AssemblyLine Redis" \
  --engine redis \
  --cache-node-type cache.t3.medium \
  --num-cache-clusters 2 \
  --cache-subnet-group-name <your-subnet-group> \
  --security-group-ids <your-redis-sg> \
  --region us-east-1

ACM Certificate

Request a certificate for your domain:

aws acm request-certificate \
  --domain-name assemblyline.example.com \
  --validation-method DNS \
  --region us-east-1

# Note the certificate ARN output — you'll need it for the Helm values
# Complete DNS validation as prompted

IAM Roles & Policies

Cluster Role

The EKS cluster service role needs:

  • AmazonEKSClusterPolicy
arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_CLUSTER_ROLE>

Node Group Role

The EC2 node group role needs:

  • AmazonEKSWorkerNodePolicy
  • AmazonEKS_CNI_Policy
  • AmazonEC2ContainerRegistryReadOnly
  • AmazonSSMManagedInstanceCore (optional, for SSM access)
arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_NODE_GROUP_ROLE>

S3 IRSA Role

Create an IAM role for Service Account (IRSA) to allow pods to access S3 without static credentials:

# 1. Create OIDC provider for the cluster (one-time)
eksctl utils associate-iam-oidc-provider --cluster assemblyline-cluster --approve --region us-east-1

# 2. Create the IAM role with S3 permissions
# Trust policy must reference the EKS OIDC provider and the specific service account
Parameter Value
Role ARN arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_S3_IRSA_ROLE>
Service Account assemblyline in namespace al
Bucket <YOUR_S3_BUCKET>

The role needs this S3 policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::<YOUR_S3_BUCKET>",
        "arn:aws:s3:::<YOUR_S3_BUCKET>/*"
      ]
    }
  ]
}

ALB Controller IRSA

The AWS Load Balancer Controller needs its own IRSA role with the standard ALB controller IAM policy. See AWS docs.

Kubernetes Add-ons

AWS Load Balancer Controller

Pass --set vpcId=<YOUR_VPC_ID> (or add --aws-vpc-id arg) to bypass IMDS VPC lookup. Without this, the controller will crash if IMDS hop limit is 1.

# Add the EKS Helm repo
helm repo add eks https://aws.github.io/eks-charts
helm repo update

# Install the controller
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=assemblyline-cluster \
  --set serviceAccount.create=true \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=<ALB_CONTROLLER_ROLE_ARN> \
  --set replicaCount=2 \
  --set vpcId=<YOUR_VPC_ID>
Verification
# Should show 2/2 READY
kubectl get deployment -n kube-system aws-load-balancer-controller

EBS CSI Driver

Required for gp2 PersistentVolumes (Elasticsearch data):

# Install as EKS managed add-on
aws eks create-addon \
  --cluster-name assemblyline-cluster \
  --addon-name aws-ebs-csi-driver \
  --service-account-role-arn <EBS_CSI_ROLE_ARN> \
  --region us-east-1
Verification
# Should show provisioner: kubernetes.io/aws-ebs
kubectl get storageclass gp2

Cluster Autoscaler (Optional)

If using cluster autoscaler, install it and ensure node group ASG tags include:

k8s.io/cluster-autoscaler/enabled = true
k8s.io/cluster-autoscaler/assemblyline-cluster = owned

AssemblyLine Helm Deployment

Add Helm Repository

helm repo add assemblyline https://cybercentrecanada.github.io/assemblyline-helm-chart/
helm repo update

# Verify chart availability
helm search repo assemblyline/assemblyline --versions

Create Namespace & Secrets

# Create namespace
kubectl create namespace al

# The Helm chart auto-generates most secrets, but you may want to pre-create:
# - assemblyline-system-passwords (contains datastore-password)
# These are typically auto-generated on first install.

Helm Values File

Save the following as deployment/k8s/values.yaml:

# 1. Ingress Configuration (AWS Load Balancer Controller)
ingressAnnotations:
  kubernetes.io/ingress.class: "alb"
  alb.ingress.kubernetes.io/ip-address-type: dualstack
  alb.ingress.kubernetes.io/scheme: internet-facing
  alb.ingress.kubernetes.io/inbound-cidrs: "<YOUR_ALLOWED_CIDRS>"
  alb.ingress.kubernetes.io/target-type: ip
  alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
  alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
  alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:<AWS_ACCOUNT_ID>:certificate/<YOUR_CERT_ID>

tlsSecretName: assemblyline-tls

# 2. Storage Classes (EBS gp2)
persistantStorageClass: gp2

# 3. Assemblyline Configuration
useInternalRedis: false
configuration:
  # 3.1 Redis (ElastiCache - plain TCP, no SSL)
  core:
    metrics:
      redis:
        host: "<YOUR_REDIS_ENDPOINT>"
        port: 6379"
    redis:
      nonpersistent:
        host: "<YOUR_REDIS_ENDPOINT>"
        port: 6379
      persistent:
        host: "<YOUR_REDIS_ENDPOINT>"
        port: 6379

  # 3.2 Filestore (S3 via IRSA)
  filestore:
    storage: ["s3://s3.amazonaws.com?s3_bucket=<YOUR_S3_BUCKET>&use_ssl=True&aws_region=us-east-1"]
    cache: ["s3://s3.amazonaws.com?s3_bucket=<YOUR_S3_BUCKET>&use_ssl=True&aws_region=us-east-1"]

  services:
    default_auto_update: true
  ui:
    fqdn: "assemblyline.example.com"

# 4. Service Accounts (IRSA for S3 Access)

# 4.1 Add annotations for the Scaler service account to use the S3 IRSA role
serviceAccountAnnotations:
   eks.amazonaws.com/role-arn: "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_S3_IRSA_ROLE>"

# 4.2 Set the service account for the other core components to use the default IRSA role (if needed)
coreServiceAccountName: "<IRSA_SERVICE_ACCOUNT_NAME>"  # e.g., "assemblyline-core-irsa"

# 5. Internal components
internalFilestore: false
internalELKStack: true
seperateInternalELKStack: true
internalDatastore: true
enableInternalEncryption: false

metricbeat:
  deployment:
    resources:
      requests:
        cpu: "100m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "1Gi"

# 6. Resource Limits for the AL components
defaultReqCPU: "100m"
dispatcherReqCPU: "100m"
esMetricsReqCPU: "100m"
ingesterReqCPU: "100m"
metricsReqCPU: "100m"
scalerReqCPU: "100m"
uiReqCPU: "100m"
redisPersistentReqCPU: "100m"
redisVolatileReqCPU: "100m"

# 7. Service Server
useAutoScaler: true
serviceServerInstances: 1
serviceServerInstancesMax: 1
serviceServerReqRam: "1Gi"
serviceServerLimRam: "12Gi"
serviceServerReqCPU: "250m"
serviceServerLimCPU: "4000m"

# 8. Node affinity: schedule core pods ONLY on infra-tainted nodes
tolerations:
  - key: "workload"
    operator: "Equal"
    value: "infra"
    effect: "NoSchedule"

nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: workload
            operator: In
            values:
              - infra

# 9. Elastic configurations for Kibana and Elasticsearch
kibana:
  resources:
    requests:
      cpu: "250m"

datastore:
  volumeClaimTemplate:
    storageClassName: gp2
  resources:
    requests:
      cpu: "250m"
    tolerations:
     - key: "workload"
       operator: "Equal"
       value: "infra"
       effect: "NoSchedule"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
              - key: workload
                operator: In
                values:
                  - infra

log-storage:
  replicas: 2
  volumeClaimTemplate:
    storageClassName: gp2
  resources:
    requests:
      cpu: "250m"
  tolerations:
     - key: "workload"
       operator: "Equal"
       value: "infra"
       effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: workload
                operator: In
                values:
                  - infra

Install the Chart

helm install assemblyline assemblyline/assemblyline \
  -n al \
  -f deployment/k8s/values.yaml \
  --version 7.1.6 \
  --wait --timeout 15m

# Wait for pods to start (this will take several minutes as Elasticsearch and other stateful services initialize)
watch kubectl get pods -n al
Verification
# 1. All pods running
echo "=== Pod Status ==="
kubectl get pods -n al | grep -v Running | grep -v Completed | grep -v NAME

# 2. Elasticsearch cluster health
kubectl exec -n al datastore-master-0 -c datastore -- \
  curl -s -u "elastic:$(kubectl get secret -n al assemblyline-system-passwords -o jsonpath='{.data.datastore-password}' | base64 -d)" \
  'http://localhost:9200/_cluster/health?pretty'

# 3. Node placement verification
echo "=== Infra pods ==="
for node in $(kubectl get nodes -l workload=infra -o name); do
  n=$(echo $node | sed 's|node/||')
  echo "  $n: $(kubectl get pods -n al --field-selector spec.nodeName=$n --no-headers | wc -l) pods"
done
echo "=== Service pods ==="
for node in $(kubectl get nodes -l workload=services -o name); do
  n=$(echo $node | sed 's|node/||')
  echo "  $n: $(kubectl get pods -n al --field-selector spec.nodeName=$n --no-headers | wc -l) pods"
done

# 4. ALB target group health
aws elbv2 describe-target-groups --region us-east-1 \
  --query 'TargetGroups[?starts_with(TargetGroupName,`k8s-al-`)].TargetGroupArn' --output text | \
  tr '\t' '\n' | while read tg; do
    echo "=== $(echo $tg | grep -oP 'k8s-al-[^/]+') ==="
    aws elbv2 describe-target-health --target-group-arn "$tg" --region us-east-1 \
      --query 'TargetHealthDescriptions[*].{IP:Target.Id,State:TargetHealth.State}' --output table
  done

# 5. Test the API
curl -sk -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://assemblyline.example.com/api/v4/user/whoami/
# Expected: HTTP 401 (unauthenticated is normal)

# 6. Check for broken service pods
for pod in $(kubectl get pods -n al -o name | grep alsvc- | grep -v updates); do
  kubectl logs -n al $(echo $pod | sed 's|pod/||') --tail=3 2>&1 | \
    grep -q "Waiting for receive task" && echo "BROKEN: $pod"
done

Known Issues & Workarounds

Issue Impact Workaround Permanent Fix
IMDS hop limit 1 on new nodes ALB controller can't get VPC ID Set --aws-vpc-id + fix hop limit on instances Launch template with HttpPutResponseHopLimit: 2
Cross-SG connectivity ES cluster formation fails Manual cross-SG ingress rules Ensure all node groups share the same SG
Terraform drift IaC not in sync None (CLI-managed) Import all resources into Terraform state
ALB rule priority WebSocket misrouted Correct priority: socketio before frontend Stable after initial fix