Anjuna Nitro Kubernetes toolset troubleshooting guide

This section provides troubleshooting steps for Pods that are not running properly inside an AWS Nitro enclave using the Anjuna Nitro Runtime for EKS. If you have noticed that the Pods you have tried to deploy are not behaving as expected, keep reading to scope the problem and identify a solution.

A line with <snip>… indicates that some lines have been removed from the full configuration.

Anjuna Nitro EKS deployment overview

A Pod is deployed
The Pod is mutated by the anjuna-nitro-webhook
The mutated Pod is scheduled on an AWS Nitro-based worker Node
The Node allocates AWS Nitro related devices using the anjuna-device-manager
The Pod is launched using the anjuna-launcher-pod image
The launcher Pod acquires the EIF
The launcher Pod launches an enclave using the EIF

This troubleshooting guide will help you identify which step is not working as expected.

Verifying the infrastructure

First, verify that the infrastructure is set up correctly. The infrastructure includes the anjuna-nitro-webhook-app Pod, the anjuna-nitro-device-manager Pods, and the anjuna-nitro-launcher image.

Verifying the images are available in image registry

There are three images that are relevant for the Anjuna infrastructure: webhook, device manager, and launcher Pod. Verify that all three exist in the image registry that your cluster uses.

When using the Amazon Elastic Container Registry (ECR), use the following command to list the repositories:

$ aws ecr describe-repositories

Look for repositories ending in anjuna-nitro-webhook, anjuna-nitro-launcher, and anjuna-device-manager.

Additionally, you may want to list the images available in a repository to verify that the correctly tagged images are present. When using Amazon ECR, use the following command to list the images:

$ aws ecr list-images --repository-name <an Anjuna ECR repo>

The methods to list repositories and images will differ from other image registry systems. Please consult the relevant documentation for your registry when not using Amazon ECR.

If the images are not present in ECR, follow the instructions in Importing the Anjuna Docker images to AWS.

Verifying the webhook is properly configured and running

Start by running the following command and verifying that the anjuna-nitro-webhook-app is running correctly.

$ kubectl get pod

Then run the following command:

$ kubectl logs anjuna-nitro-webhook-app

Check that the logs start with messages similar to:

ANJ-WEBHOOK: 2022/07/29 17:26:18.823395 anjuna-k8s-nitro-webhook version master.0301 (build commit: 71a979c)
ANJ-WEBHOOK: 2022/07/29 17:26:18.823847 Starting server on :443
ANJ-WEBHOOK: 2022/07/29 17:26:18.824070 Using TLS certificate with CA issuer subject 'Webhook One-off CA'

Next, verify that the webhook service is properly configured by running the following command:

$ kubectl describe service anjuna-nitro-webhook-svc

Make sure that the output contains information similar to the following:

Selector:      	name=anjuna-nitro-webhook-app

Lastly, verify that the mutation context is properly configured by running the following command:

$ kubectl describe MutatingWebhookConfiguration anjuna-nitro-webhook

Make sure that the output contains information similar to the following:

Webhooks:
  # <snip>...
  Client Config:
      # <snip>...
      Service:
      Name:        anjuna-nitro-webhook-svc
      Namespace:   default
      Path:        /transform
      Port:        443
  # <snip>...
  Object Selector:
      Match Labels:
      nitro.k8s.anjuna.io/managed:  yes

It is important that nitro.k8s.anjuna.io/managed: yes appears in the output.

Setting the webhook log level

When viewing the webhook logs with kubectl logs anjuna-nitro-webhook-app, the log level can be increased to view additional details, which may be helpful for troubleshooting webhook issues. The default log level is info and can be set to one of the following: info, debug, or trace in order from lowest to highest verbosity level.

Set the log level in the webhook YAML configuration ConfigMap (helm-charts/anjuna-tools/templates/anjuna-nitro-webhook.yaml) by adding a log-level value, and install the Helm chart again to apply the changes to the cluster.

Following is an example, where <log level> is one of the three log levels:

apiVersion: v1
kind: ConfigMap
metadata:
  name: anjuna-nitro-webhook-config
data:
  webhook-config.yaml: |
    listen-addr: :443
    logging-enabled: true
    log-level: <log level>

Verifying the device manager is properly configured

Start by verifying that the device manager DaemonSet is properly configured by running the following command:

$ kubectl get ds anjuna-nitro-device-manager

Check that the output contains information similar to the following:

Name:       	anjuna-nitro-device-manager
Selector:   	name=anjuna-nitro-device-manager
Node-Selector:  anjuna-nitro-device-manager=enabled
# <snip>...
Pod Template:
  # <snip>...
  Containers:
   anjuna-nitro-device-manager:
      Image:  	557884445442.dkr.ecr.us-east-2.amazonaws.com/anjuna-device-manager:1.0
      # <snip>...
      Mounts:
      /dev/nitro_enclaves from nitro-enclaves-dev (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
      Type:      	HostPath (bare host directory volume)
      Path:      	/var/lib/kubelet/device-plugins
      HostPathType:
   nitro-enclaves-dev:
      Type:      	HostPath (bare host directory volume)
      Path:      	/dev/nitro_enclaves
      HostPathType:
# <snip>...

Next, verify that the Pods are running by running the following command:

$ kubectl get pods

Verify that you have as many anjuna-nitro-device-manager Pods running as you have AWS Nitro-based worker Nodes. If there are less, identify which Node does not have this Pod running, and run the following command:

$ kubectl describe node <Node>

Verify that it has the label:

Labels:         	anjuna-nitro-device-manager=enabled

Finally, you can run the following command on any of the device manager Pods:

$ kubectl logs <device manager Pod>

Then search the output for the following information:

ANJ-DEVMGR: 2022/07/29 17:26:19.365126 INFO  Loading smarter-device-manager
# <snip>...
ANJ-DEVMGR: 2022/07/29 17:26:19.370206 INFO  Registered device plugin for smarter-devices/nitro_enclaves with Kubelet
ANJ-DEVMGR: 2022/07/29 17:26:19.370212 INFO  All devices successfully restarted

You can also run the following command to see that the Node provides the required devices:

$ kubectl describe node <any Node>

Check that the Node’s capacity has the following:

hugepages-2Mi:                   <some memory amount>
smarter-devices/nitro_enclaves:  1

Notice that the amount presented for hugepages-2Mi is the maximum size of the enclave that you can launch.

Setting the device manager log level

When viewing the device manager logs with kubectl logs <device manager Pod>, the log level can be increased to view additional details, which may be helpful for troubleshooting device manager issues. The default log level is info and can be set to one of the following: fatal, error, info, debug, or trace in order from lowest to highest verbosity level.

This setting can be changed via a command-line argument to the smarter-device-management executable within the device manager image. This can be done by specifying this option with the Docker CMD instruction in the Dockerfile used for creating the device manager image, as in the following example, where <log level> is one of the previously listed log level names:

$ CMD ["smarter-device-management","-log-level","<log level>"]

Alternatively, the Dockerfile CMD can be overridden when the device manager Pod is deployed using the Kubernetes command instruction in the YAML spec configuration. For example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: anjuna-nitro-device-manager
  labels:
    name: anjuna-nitro-device-manager
    role: agent
spec:
# <snip>...
  template:
# <snip>...
    spec:
# <snip>...
      containers:
      - name: anjuna-nitro-device-manager
# <snip>...
         command: ["smarter-device-management"]
         args: ["-log-level","<log level>"]

Verifying a Pod deployment

Once you have verified that the infrastructure is running correctly, you can troubleshoot faulty Pod deployments.

Verifying the Pod definition

Verify that the Pod definition is properly tagged to allow it to be mutated by your webhook. In the Pod definition (e.g., the YAML file used to deploy it), make sure that the label nitro.k8s.anjuna.io/managed exists at the metadata.labels field and is set to yes. If this tag is not defined, or not set to the correct value, fix the issue and try the deployment again.

Verifying the Pod was mutated

Verify that the webhook received the Pod definition to mutate by running the following command:

$ kubectl logs anjuna-nitro-webhook-app

Check that the output contains the following:

ANJ-WEBHOOK: 2022/07/29 17:31:04.853594 Received pod description for <Pod name>
ANJ-WEBHOOK: 2022/07/29 17:31:04.855804 Returning patched response for <Pod name>

Next, make sure that the mutated Pod definition contains the expected changes by running the following command:

$ kubectl describe pod <Pod name>

Check the output for the following line:

Image:      	<image registry>/<launcher image repo>:<launcher image tag>

Check the output for the requests:

Requests:
      # <snip>...
      hugepages-2Mi:                   <the Pod’s defined memory limit>
      smarter-devices/nitro_enclaves:  1

If you are trying to deploy an EIF built on the fly, also verify that the Pod has the following environment variable defined:

ANJ_ENCLAVE_DOCKER_IMAGE:       	<Pod image>

If you are trying to deploy a prebuilt EIF, check that the following environment variable is defined:

ANJ_ENCLAVE_IMAGE_LOCATION:     	<URI to your prebuilt EIF location>

Verifying the Pod was given access to required resources

Identify the Node that the container is running on, and then identify the specific anjuna-nitro-device-manager running on that Node.

Examine the logs of that Pod by running the following command:

$ kubectl logs <device manager Pod>

Search for the following output from about the time that the Pod was deployed:

ANJ-DEVMGR: 2022/07/29 17:31:04.894412 INFO  Device plugin for smarter-devices/nitro_enclaves allocated 1 devices

Verifying the Pod launched successfully

Examine the logs of the Pod by running the following command:

$ kubectl logs <Pod name>

Look for the launcher logs:

ANJ-LAUNCH: anjuna-k8s-launcher version master.0301 (build commit: 71a979c)
ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-cli --version
ANJ-LAUNCH: Anjuna Nitro CLI master.0301 (build commit: 0c196d9)
ANJ-LAUNCH:
ANJ-LAUNCH: Created "/run/nitro_enclaves"
ANJ-LAUNCH: Created "/var/log/nitro_enclaves"
ANJ-LAUNCH: Created "/opt/anjuna/nitro/images"

If you are using an EIF built on the fly, look for the logs:

ANJ-LAUNCH: Building EIF file
ANJ-LAUNCH: Generated enclave config:
# <snip>...
ANJ-LAUNCH: Running: /usr/bin/docker pull <Pod image>
ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-cli build-enclave ...
ANJ-LAUNCH: EIF build successful

If you are using a prebuilt EIF, look for the logs:

ANJ-LAUNCH: Downloading EIF file from '<uri to EIF>'
ANJ-LAUNCH: EIF download successful

Next, verify that the enclave launched successfully by searching for a log that looks like:

ANJ-LAUNCH: Started enclave with enclave-cid: 17, memory: 2048 MiB, cpu-ids: [1, 5]

And lastly, verify that the Pod’s CMD is executed by looking for the log:

ANJ-ENCLAVE: Launched "<path to Pod image’s CMD>" with pid=...

FAQs

The s3 encrypted configuration is not applied

First, make sure that the enclave configuration file contains an entry of the form:

encryptedConfig:
  type: s3
  uri:  <URI to encrypted configuration>
  allowList:
    ...

Note that the allowList must contain at least one environment variable or one file.

Second, make sure that the file is available to the enclave.

Next, make sure that the enclave downloaded the correct encrypted configuration by looking at the Pod’s logs and searching for a log line of the form:

ANJ-ENCLAVE: Encrypted config from <URI to encrypted configuration> applied

If there are issues with obtaining the encrypted configuration data, see Updating the KMS Policy to Authorize AWS Nitro Enclaves for instructions on how to verify that the enclave has access to the key used to encrypt the encrypted configuration.

Cannot connect to network services on a Pod

When trying to connect to a Pod, the client receives connection timeouts or errors such as “Connection refused”. This is most likely due to a misconfigured Pod definition. Make sure that the Pod’s definition has the required container ports in the field spec.containers.ports.

Next, verify that the Pod’s mutated definition has the environment variable ANJ_ENCLAVE_PORT_ALL and contains all ports that you want to expose, by running the following command.

$ kubectl describe pod <Pod name>

Finally, verify that the launcher Pod has launched the anjuna-nitro-netd-parent with all required ports exposed, by running the following command:

$ kubectl logs <Pod name>

Look for a log line starting with:

ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-netd-parent

Additionally, look for flags of the form --expose <port num>.

You can also look for log lines of the following form in order to verify that the network daemon has indeed exposed the correct ports:

ANJ-NETD-PARENT: [info] Expose enclave port 80 -> 80

A Pod is stuck in pending state during deployment

Start by verifying that all of the Nodes are running and have the required capacities (see Verifying the device manager is properly configured, above).

Second, verify that the Pod is not requesting any more resources than are available (see Verifying the Pod was mutated, above).

If the Pod’s memory requirements exceed the available memory capacity of all Nodes, consider increasing the memory capacity of your Nodes (specifically, the available hugepages-2Mi).

If your Pod is requesting more smarter-devices/nitro_enclaves than are available, consider adding more AWS Nitro-based Nodes to your cluster.

Insufficient memory allocated examples

The following is an example of an error message that might be seen if there is insufficient memory allocated resulting from the built EIF:

ANJ-LAUNCH: [ E26 ] Insufficient memory requested. User provided `memory` is 256 MB, but based on the EIF file size, the minimum memory should be 336 MB

An out-of-memory condition may also be occurring if run-container-process does not start, which would be indicated by the log output stopping before the last line in the following example log excerpt:

ANJ-NETD-PARENT: [info] new connection from vm(17):2101522098 to vm(3):1024
ANJ-NETD-PARENT: [info] Assigning 10.0.4.214/32 gw 169.254.1.1 to vm(17):2101522098
ANJ-NETD-PARENT: [info] Connection closed from vm(17):2101522098
ANJ-NETD-PARENT: [info] new connection from vm(17):2101522099 to vm(3):1024
ANJ-NETD-PARENT: [debug] Connection closed to vm(17):2101522098
ANJ-ENCLAVE: run-container-process version master.0294 (build commit: 89783f9)

A Pod crashes randomly or with odd error messages

These issues might be related to insufficient memory for the Pod in the enclave. Pods running in an enclave have different memory requirements from Pods running directly on worker Nodes.

This is because when Pods run directly on the worker Node, the Pod only needs to consider the memory required for the applications launched. When Pods run in enclaves, the memory is used not only by the applications, but also by the kernel and the filesystem.

Try increasing the memory limits for the Pod in the Pod definition and then redeploying.

The webhook produced a log about “bad certificate”

If your webhook produced the following log, it means that the CA provided in the MutatingWebhookConfiguration is not the one used to sign the webhook’s certificate:

ANJ-WEBHOOK: http: TLS handshake error from <some address>: remote error: tls: bad certificate

Re-run the following script, and follow the instructions that it provides, including updating the Kubernetes secret anjuna-nitro-webhook-cert, and updating the MutatingWebhookConfiguration to use the newly-produced CA bundle.

$ bash generate-webhook-tls-cert.sh