Container Checkpointing in Kubernetes With a Custom API

Tutorials

AWS Kubernetes

Problem Statement

Challenge

Organizations running containerized applications in Kubernetes often need to capture and preserve the state of running containers for:

Disaster recovery
Application migration
Debug/troubleshooting
State preservation
Environment reproduction

However, there’s no straightforward, automated way to:

Create container checkpoints on-demand
Store these checkpoints in a standardized format
Make them easily accessible across clusters
Trigger checkpointing through a standard interface

Current Limitations

Manual checkpoint creation requires direct cluster access
No standardized storage format for checkpoints
Limited integration with container registries
Lack of programmatic access for automation
Complex coordination between containerd and storage systems

Solution

A Kubernetes sidecar service that:

Exposes checkpoint functionality via REST API
Automatically converts checkpoints to OCI-compliant images
Stores images in ECR for easy distribution
Integrates with existing Kubernetes infrastructure
Provides a standardized interface for automation

This solves the core problems by:

Automating the checkpoint process
Standardizing checkpoint storage
Making checkpoints portable
Enabling programmatic access
Simplifying integration with existing workflows

Target users:

DevOps teams
Platform engineers
Application developers
Site Reliability Engineers (SREs)

Forensic container checkpointing is based on Checkpoint/Restore In Userspace (CRIU) and allows the creation of stateful copies of a running container without the container knowing that it is being checkpointed. The copy of the container can be analyzed and restored in a sandbox environment multiple times without the original container being aware of it. Forensic container checkpointing was introduced as an alpha feature in Kubernetes v1.25.

This article will guide you on how to deploy Golang code that can be used to take a container checkpoint using an API.

The code takes a pod identifier, retrieves the container ID from containerd as an input, and then uses the ctr command to checkpoint the specific container in the k8s.io namespace of containerd:

Prerequisites

Kubernetes cluster
Install ctr commandline tool. if you are able to run ctr commands on the kubelet or worker node; if not, install or adjust AMI to contain the ctr.
kubectl configured to communicate with your cluster
Docker installed locally
Access to a container registry (e.g., Docker Hub, ECR)
Helm (for installing Nginx Ingress Controller)

Step 0: Code to Create Container Checkpoint Using GO

Create a file named checkpoint_container.go with the following content:

package main

​

import (

    "context"

    "fmt"

    "log"

    "os"

    "os/exec"

    "strings"

​

    "github.com/aws/aws-sdk-go/aws"

    "github.com/aws/aws-sdk-go/aws/session"

    "github.com/aws/aws-sdk-go/service/ecr"

    "github.com/containerd/containerd"

    "github.com/containerd/containerd/namespaces"

)

​

func init() {

    log.SetOutput(os.Stdout)

    log.SetFlags(log.Ldate | log.Ltime | log.Lmicroseconds | log.Lshortfile)

}

​

func main() {

    if len(os.Args) < 4 {

        log.Fatal("Usage: checkpoint_container <pod_identifier> <ecr_repo> <aws_region>")

    }

​

    podID := os.Args[1]

    ecrRepo := os.Args[2]

    awsRegion := os.Args[3]

​

    log.Printf("Starting checkpoint process for pod %s", podID)

​

    containerID, err := getContainerIDFromPod(podID)

    if err != nil {

        log.Fatalf("Error getting container ID: %v", err)

    }

​

    err = processContainerCheckpoint(containerID, ecrRepo, awsRegion)

    if err != nil {

        log.Fatalf("Error processing container checkpoint: %v", err)

    }

​

    log.Printf("Successfully checkpointed container %s and pushed to ECR", containerID)

}

​

func getContainerIDFromPod(podID string) (string, error) {

    log.Printf("Searching for container ID for pod %s", podID)

    client, err := containerd.New("/run/containerd/containerd.sock")

    if err != nil {

        return "", fmt.Errorf("failed to connect to containerd: %v", err)

    }

    defer client.Close()

​

    ctx := namespaces.WithNamespace(context.Background(), "k8s.io")

​

    containers, err := client.Containers(ctx)

    if err != nil {

        return "", fmt.Errorf("failed to list containers: %v", err)

    }

​

    for _, container := range containers {

        info, err := container.Info(ctx)

        if err != nil {

            continue

        }

        if strings.Contains(info.Labels["io.kubernetes.pod.uid"], podID) {

            log.Printf("Found container ID %s for pod %s", container.ID(), podID)

            return container.ID(), nil

        }

    }

​

    return "", fmt.Errorf("container not found for pod %s", podID)

}

​

func processContainerCheckpoint(containerID, ecrRepo, region string) error {

    log.Printf("Processing checkpoint for container %s", containerID)

    checkpointPath, err := createCheckpoint(containerID)

    if err != nil {

        return err

    }

    defer os.RemoveAll(checkpointPath)

​

    imageName, err := convertCheckpointToImage(checkpointPath, ecrRepo, containerID)

    if err != nil {

        return err

    }

​

    err = pushImageToECR(imageName, region)

    if err != nil {

        return err

    }

​

    return nil

}

​

func createCheckpoint(containerID string) (string, error) {

    log.Printf("Creating checkpoint for container %s", containerID)

    checkpointPath := "/tmp/checkpoint-" + containerID

    cmd := exec.Command("ctr", "-n", "k8s.io", "tasks", "checkpoint", containerID, "--checkpoint-path", checkpointPath)

    output, err := cmd.CombinedOutput()

    if err != nil {

        return "", fmt.Errorf("checkpoint command failed: %v, output: %s", err, output)

    }

    log.Printf("Checkpoint created at: %s", checkpointPath)

    return checkpointPath, nil

}

​

func convertCheckpointToImage(checkpointPath, ecrRepo, containerID string) (string, error) {

    log.Printf("Converting checkpoint to image for container %s", containerID)

    imageName := ecrRepo + ":checkpoint-" + containerID

​

    cmd := exec.Command("buildah", "from", "scratch")

    containerId, err := cmd.Output()

    if err != nil {

        return "", fmt.Errorf("failed to create container: %v", err)

    }

​

    cmd = exec.Command("buildah", "copy", string(containerId), checkpointPath, "/")

    err = cmd.Run()

    if err != nil {

        return "", fmt.Errorf("failed to copy checkpoint: %v", err)

    }

​

    cmd = exec.Command("buildah", "commit", string(containerId), imageName)

    err = cmd.Run()

    if err != nil {

        return "", fmt.Errorf("failed to commit image: %v", err)

    }

​

    log.Printf("Created image: %s", imageName)

    return imageName, nil

}

​

func pushImageToECR(imageName, region string) error {

    log.Printf("Pushing image %s to ECR in region %s", imageName, region)

    sess, err := session.NewSession(&aws.Config{

        Region: aws.String(region),

    })

    if err != nil {

        return fmt.Errorf("failed to create AWS session: %v", err)

    }

​

    svc := ecr.New(sess)

​

    authToken, registryURL, err := getECRAuthorizationToken(svc)

    if err != nil {

        return err

    }

​

    err = loginToECR(authToken, registryURL)

    if err != nil {

        return err

    }

​

    cmd := exec.Command("podman", "push", imageName)

    err = cmd.Run()

    if err != nil {

        return fmt.Errorf("failed to push image to ECR: %v", err)

    }

​

    log.Printf("Successfully pushed checkpoint image to ECR: %s", imageName)

    return nil

}

​

func getECRAuthorizationToken(svc *ecr.ECR) (string, string, error) {

    log.Print("Getting ECR authorization token")

    output, err := svc.GetAuthorizationToken(&ecr.GetAuthorizationTokenInput{})

    if err != nil {

        return "", "", fmt.Errorf("failed to get ECR authorization token: %v", err)

    }

​

    authData := output.AuthorizationData[0]

    log.Print("Successfully retrieved ECR authorization token")

    return *authData.AuthorizationToken, *authData.ProxyEndpoint, nil

}

​

func loginToECR(authToken, registryURL string) error {

    log.Printf("Logging in to ECR at %s", registryURL)

    cmd := exec.Command("podman", "login", "--username", "AWS", "--password", authToken, registryURL)

    err := cmd.Run()

    if err != nil {

        return fmt.Errorf("failed to login to ECR: %v", err)

    }

    log.Print("Successfully logged in to ECR")

    return nil

}

Step 1: Initialize the go Module

Shell

go mod init checkpoint_container

Modify the go.mod file:

module checkpoint_container

​

go 1.23

​

require (

    github.com/aws/aws-sdk-go v1.44.298

    github.com/containerd/containerd v1.7.2

)

require (

    github.com/jmespath/go-jmespath v0.4.0 // indirect

    github.com/opencontainers/go-digest v1.0.0 // indirect

    github.com/opencontainers/image-spec v1.1.0-rc2.0.20221005185240-3a7f492d3f1b // indirect

    github.com/pkg/errors v0.9.1 // indirect

    google.golang.org/genproto v0.0.0-20230306155012-7f2fa6fef1f4 // indirect

    google.golang.org/grpc v1.53.0 // indirect

    google.golang.org/protobuf v1.30.0 // indirect

)

Run the following command:

Shell

go mod tidy

Step 2: Build and Publish Docker Image

Create a Dockerfile in the same directory:

Dockerfile

# Build stage

FROM golang:1.20 as builder

​

WORKDIR /app

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -o checkpoint_container

​

# Final stage

FROM amazonlinux:2

​

# Install necessary tools

RUN yum update -y && \

    amazon-linux-extras install -y docker && \

    yum install -y awscli containerd skopeo && \

    yum clean all

​

# Copy the built Go binary

COPY --from=builder /app/checkpoint_container /usr/local/bin/checkpoint_container

​

EXPOSE 8080

​

ENTRYPOINT ["checkpoint_container"]

This Dockerfile does the following:

Uses golang:1.20 as the build stage to compile your Go application.
Uses amazonlinux:2 as the final base image.
Installs the AWS CLI, Docker (which includes containerd), and skopeo using yum and amazon-linux-extras.
Copies the compiled Go binary from the build stage.

Shell

docker build -t <your-docker-repo>/checkpoint-container:v1 .

docker push <your-docker-repo>/checkpoint-container:v1

Replace <your-docker-repo> with your actual Docker repository.

Step 3: Apply the RBAC resources

Create a file named rbac.yaml:

YAML

apiVersion: v1

kind: ServiceAccount

metadata:

  name: checkpoint-sa

  namespace: default

​

---

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

  name: checkpoint-role

  namespace: default

rules:

- apiGroups: [""]

  resources: ["pods"]

  verbs: ["get", "list"]

​

---

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

  name: checkpoint-rolebinding

  namespace: default

subjects:

- kind: ServiceAccount

  name: checkpoint-sa

  namespace: default

roleRef:

  kind: Role

  name: checkpoint-role

  apiGroup: rbac.authorization.k8s.io

Apply the RBAC resources:

Shell

kubectl apply -f rbac.yaml

Step 4: Create a Kubernetes Deployment

Create a file named deployment.yaml:

YAML

apiVersion: apps/v1

kind: Deployment

metadata:

  name: main-app

  namespace: default

spec:

  replicas: 1

  selector:

    matchLabels:

      app: main-app

  template:

    metadata:

      labels:

        app: main-app

    spec:

      serviceAccountName: checkpoint-sa

      containers:

      - name: main-app

        image: nginx:latest  # Replace with your main application image

      - name: checkpoint-sidecar

        image: <your-docker-repo>/checkpoint-container:v1

        ports:

        - containerPort: 8080

        securityContext:

          privileged: true

        volumeMounts:

        - name: containerd-socket

          mountPath: /run/containerd/containerd.sock

      volumes:

      - name: containerd-socket

        hostPath:

          path: /run/containerd/containerd.sock

          type: Socket

Apply the deployment:

Shell

kubectl apply -f deployment.yaml

In deployment.yaml, update the following:

YAML

image: <your-docker-repo>/checkpoint-container:v1

Step 5: Kubernetes Service

Create a file named service.yaml:

YAML

apiVersion: v1

kind: Service

metadata:

  name: checkpoint-service

  namespace: default

spec:

  selector:

    app: main-app

  ports:

    - protocol: TCP

      port: 80

      targetPort: 8080

Apply the service:

Shell

kubectl apply -f service.yaml

Step 6: Install Ngnix Ingress Contoller

Shell

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx

helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx

Step 7: Create Ingress Resource

Create a file named ingress.yaml:

YAML

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: checkpoint-ingress

  annotations:

    kubernetes.io/ingress.class: nginx

    nginx.ingress.kubernetes.io/ssl-redirect: "false"

spec:

  rules:

  - http:

      paths:

      - path: /checkpoint

        pathType: Prefix

        backend:

          service:

            name: checkpoint-service

            port: 

              number: 80

Apply the Ingress:

Shell

kubectl apply -f ingress.yaml

Step 8: Test the API

Shell

kubectl get services ingress-ngnix-contoller -n ingress-ngnix

Shell

curl -X POST http://<EXTERNAL-IP>/checkpoint \

 -H "Content-Type: application/json" \

 -d '{"podId": "your-pod-id", "ecrRepo": "your-ecr-repo", "awsRegion": "your-aws-region"}'

Replace <EXTERNAL-IP> with the actual external IP.

Additional Considerations

Security.
- Implement HTTPS by setting up TLS certificates
- Add authentication to the API
Monitoring. Set up logging and monitoring for the API and checkpoint process.
Resource management. Configure resource requests and limits for the sidecar container.
Error handling. Implement robust error handling in the Go application.
Testing. Thoroughly test the setup in a non-production environment before deploying it to production.
Documentation. Maintain clear documentation on how to use the checkpoint API.

Conclusion

This setup deploys the checkpoint container as a sidecar in Kubernetes and exposes its functionality through an API accessible from outside the cluster. It provides a flexible solution for managing container checkpoints in a Kubernetes environment.

AWS/EKS Specific

Step 7: Install the AWS Load Balancer Controller

Instead of using the Nginx Ingress Controller, we’ll use the AWS Load Balancer Controller. This controller will create and manage ALBs for our Ingress resources.

1. Add the EKS chart repo to Helm:

Shell

helm repo add eks https://aws.github.io/eks-charts

2. Install the AWS Load Balancer Controller:

Shell

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \

  -n kube-system \

  --set clusterName=<your-cluster-name> \

  --set serviceAccount.create=false \

  --set serviceAccount.name=aws-load-balancer-controller

Replace <your-cluster-name> with your EKS cluster name.

Note: Ensure that you have the necessary IAM permissions set up for the AWS Load Balancer Controller. You can find the detailed IAM policy in the AWS documentation.

Step 8: Create Ingress Resource

Create a file named ingress.yaml:

YAML

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: checkpoint-ingress

  annotations:

    kubernetes.io/ingress.class: alb

    alb.ingress.kubernetes.io/scheme: internet-facing

    alb.ingress.kubernetes.io/target-type: ip

spec:

  rules:

  - http:

      paths:

      - path: /checkpoint

        pathType: Prefix

        backend:

          service:

            name: checkpoint-service

            port: 

              number: 80

​

Apply the Ingress:

Shell

kubectl apply -f ingress.yaml

Step 9: Test the API

1. Get the ALB DNS name:

Shell

kubectl get ingress checkpoint-ingress

Look for the ADDRESS field, which will be the ALB’s DNS name.

2. Send a test request:

Shell

curl -X POST http://<ALB-DNS-NAME>/checkpoint \

     -H "Content-Type: application/json" \

     -d '{"podId": "your-pod-id", "ecrRepo": "your-ecr-repo", "awsRegion": "your-aws-region"}'

Replace <ALB-DNS-NAME> with the actual DNS name of your ALB from step 1.

Additional Considerations for AWS ALB

1. Security groups. The ALB will have a security group automatically created. Ensure it allows inbound traffic on port 80 (and 443 if you set up HTTPS).

2. SSL/TLS: To enable HTTPS, you can add the following annotations to your Ingress:

YAML

alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'

alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account-id:certificate/certificate-id

3. Access logs. Enable access logs for your ALB by adding the following:

YAML

alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=your-log-bucket,access_logs.s3.prefix=your-log-prefix

4. WAF integration. If you want to use AWS WAF with your ALB, you can add:

YAML

alb.ingress.kubernetes.io/waf-acl-id: your-waf-web-acl-id

5. Authentication. You can set up authentication using Amazon Cognito or OIDC by using the appropriate ALB Ingress Controller annotations.

These changes will set up your Ingress using an AWS Application Load Balancer instead of Nginx. The ALB Ingress Controller will automatically provision and configure the ALB based on your Ingress resource.

Conclusion

Remember to ensure that your EKS cluster has the necessary IAM permissions to create and manage ALBs. This typically involves creating an IAM policy and a service account with the appropriate permissions.

This setup will now use AWS’s native load-balancing solution, which integrates well with other AWS services and can be more cost-effective in an AWS environment.

Source:
https://dzone.com/articles/container-checkpointing-kubernetes-api