Anjuna Nitro Kubernetes Toolset troubleshooting guide
This section provides troubleshooting steps for Pods that are not running properly inside an AWS Nitro enclave using the Anjuna Nitro Runtime for EKS. If you have noticed that the Pods you have tried to deploy are not behaving as expected, keep reading to scope the problem and identify a solution.
On this page,
some code blocks are shortened to emphasize only the relevant configuration.
A line with <snip>… indicates that some lines have been removed from the full configuration.
|
Anjuna Nitro EKS deployment overview
-
A Pod is deployed
-
The Pod is mutated by the
anjuna-nitro-webhook
-
The mutated Pod is scheduled on an AWS Nitro-based worker Node
-
The Node allocates AWS Nitro related devices using the
anjuna-device-manager
-
The Pod is launched using the
anjuna-launcher-pod
image -
The launcher Pod acquires the EIF
-
The launcher Pod launches an enclave using the EIF
This troubleshooting guide will help you identify which step is not working as expected.
Verifying the infrastructure
First, verify that the infrastructure is set up correctly.
The infrastructure includes the anjuna-nitro-webhook-app
Pod,
the anjuna-nitro-device-manager
Pods, and the anjuna-nitro-launcher
image.
Verifying the images are available in image registry
There are three images that are relevant for the Anjuna infrastructure: webhook, device manager, and launcher Pod. Verify that all three exist in the image registry that your cluster uses.
When using the Amazon Elastic Container Registry (ECR), use the following command to list the repositories:
$ aws ecr describe-repositories
Look for repositories ending in anjuna-nitro-webhook
, anjuna-nitro-launcher
,
and anjuna-device-manager
.
Additionally, you may want to list the images available in a repository to verify that the correctly tagged images are present. When using Amazon ECR, use the following command to list the images:
$ aws ecr list-images --repository-name <an Anjuna ECR repo>
The methods to list repositories and images will differ from other image registry systems. Consult the relevant documentation for your registry when not using Amazon ECR. |
If the images are not present in ECR, follow the instructions in Importing the Anjuna Docker images to AWS.
Verifying the webhook is properly configured and running
Start by running the following command and verifying that the anjuna-nitro-webhook-app
is
running correctly.
$ kubectl get pod
Then run the following command:
$ kubectl logs anjuna-nitro-webhook-app
Check that the logs start with messages similar to:
ANJ-WEBHOOK: 2022/07/29 17:26:18.823395 anjuna-k8s-nitro-webhook version master.0301 (build commit: 71a979c) ANJ-WEBHOOK: 2022/07/29 17:26:18.823847 Starting server on :443 ANJ-WEBHOOK: 2022/07/29 17:26:18.824070 Using TLS certificate with CA issuer subject 'Webhook One-off CA'
Next, verify that the webhook service is properly configured by running the following command:
$ kubectl describe service anjuna-nitro-webhook-svc
Make sure that the output contains information similar to the following:
Selector: name=anjuna-nitro-webhook-app
Lastly, verify that the mutation context is properly configured by running the following command:
$ kubectl describe MutatingWebhookConfiguration anjuna-nitro-webhook
Make sure that the output contains information similar to the following:
Webhooks:
# <snip>...
Client Config:
# <snip>...
Service:
Name: anjuna-nitro-webhook-svc
Namespace: default
Path: /transform
Port: 443
# <snip>...
Object Selector:
Match Labels:
nitro.k8s.anjuna.io/managed: yes
It is important that nitro.k8s.anjuna.io/managed: yes
appears in the output.
Setting the webhook log level
When viewing the webhook logs with kubectl logs anjuna-nitro-webhook-app
,
the log level can be increased to view additional details, which may be helpful for
troubleshooting webhook issues.
The default log level is info
and can be set to one of the following: info
, debug
,
or trace
in order from lowest to highest verbosity level.
Set the log level in the webhook YAML configuration ConfigMap
(helm-charts/anjuna-tools/templates/anjuna-nitro-webhook.yaml
) by adding a log-level
value,
and install the Helm chart again to apply the changes to the cluster.
Following is an example, where <log level>
is one of the three log levels:
apiVersion: v1
kind: ConfigMap
metadata:
name: anjuna-nitro-webhook-config
data:
webhook-config.yaml: |-
listen-addr: :443
logging-enabled: true
log-level: <log level>
Verifying the device manager is properly configured
Start by verifying that the device manager DaemonSet is properly configured by running the following command:
$ kubectl get ds anjuna-nitro-device-manager
Check that the output contains information similar to the following:
Name: anjuna-nitro-device-manager Selector: name=anjuna-nitro-device-manager Node-Selector: anjuna-nitro-device-manager=enabled # <snip>... Pod Template: # <snip>... Containers: anjuna-nitro-device-manager: Image: 557884445442.dkr.ecr.us-east-2.amazonaws.com/anjuna-device-manager:1.0 # <snip>... Mounts: /dev/nitro_enclaves from nitro-enclaves-dev (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: nitro-enclaves-dev: Type: HostPath (bare host directory volume) Path: /dev/nitro_enclaves HostPathType: # <snip>...
Next, verify that the Pods are running by running the following command:
$ kubectl get pods
Verify that you have as many anjuna-nitro-device-manager
Pods running as you have
AWS Nitro-based worker Nodes.
If there are less, identify which Node does not have this Pod running, and run the following
command:
$ kubectl describe node <Node>
Verify that it has the label:
Labels: anjuna-nitro-device-manager=enabled
Finally, you can run the following command on any of the device manager Pods:
$ kubectl logs <device manager Pod>
Then search the output for the following information:
ANJ-DEVMGR: 2022/07/29 17:26:19.365126 INFO Loading smarter-device-manager # <snip>... ANJ-DEVMGR: 2022/07/29 17:26:19.370206 INFO Registered device plugin for smarter-devices/nitro_enclaves with Kubelet ANJ-DEVMGR: 2022/07/29 17:26:19.370212 INFO All devices successfully restarted
You can also run the following command to see that the Node provides the required devices:
$ kubectl describe node <any Node>
Check that the Node’s capacity has the following:
hugepages-2Mi: <some memory amount> smarter-devices/nitro_enclaves: 1
Notice that the amount presented for hugepages-2Mi
is the maximum size of the enclave that you
can launch.
Setting the device manager log level
When viewing the device manager logs with kubectl logs <device manager Pod>
,
the log level can be increased to view additional details,
which may be helpful for troubleshooting device manager issues.
The default log level is info
and can be set to one of the following:
fatal
, error
, info
, debug
, or trace
in order from lowest to highest verbosity level.
This setting can be changed via a command-line argument to the smarter-device-management
executable within the device manager image.
This can be done by specifying this option with the Docker CMD instruction in the Dockerfile used
for creating the device manager image.
Following is an example,
where <log level>
is one of the previously listed log level names:
CMD ["smarter-device-management","-log-level","<log level>"]
Alternatively, the Dockerfile CMD can be overridden when the device manager Pod is deployed using
the Kubernetes command
instruction in the YAML spec configuration. For example:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: anjuna-nitro-device-manager
labels:
name: anjuna-nitro-device-manager
role: agent
spec:
# <snip>...
template:
# <snip>...
spec:
# <snip>...
containers:
- name: anjuna-nitro-device-manager
# <snip>...
command: ["smarter-device-management"]
args: ["-log-level", "<log level>"]
Verifying a Pod deployment
Once you have verified that the infrastructure is running correctly, you can troubleshoot faulty Pod deployments.
Verifying the Pod definition
Verify that the Pod definition is properly tagged to allow it to be mutated by your webhook.
In the Pod definition (e.g., the YAML file used to deploy it),
make sure that the label nitro.k8s.anjuna.io/managed
exists at the metadata.labels
field
and is set to yes
.
If this tag is not defined, or not set to the correct value,
fix the issue and try the deployment again.
Verifying the Pod was mutated
Verify that the webhook received the Pod definition to mutate by running the following command:
$ kubectl logs anjuna-nitro-webhook-app
Check that the output contains the following:
ANJ-WEBHOOK: 2022/07/29 17:31:04.853594 Received pod description for <Pod name> ANJ-WEBHOOK: 2022/07/29 17:31:04.855804 Returning patched response for <Pod name>
Next, make sure that the mutated Pod definition contains the expected changes by running the following command:
$ kubectl describe pod <Pod name>
Check the output for the following line:
Image: <image registry>/<launcher image repo>:<launcher image tag>
Check the output for the requests:
Requests: # <snip>... hugepages-2Mi: <the Pod’s defined memory limit> smarter-devices/nitro_enclaves: 1
If you are trying to deploy an EIF built on the fly, also verify that the Pod has the following environment variable defined:
ANJ_ENCLAVE_DOCKER_IMAGE: <Pod image>
If you are trying to deploy a prebuilt EIF, check that the following environment variable is defined:
ANJ_ENCLAVE_IMAGE_LOCATION: <URI to your prebuilt EIF location>
Verifying the Pod was given access to required resources
Identify the Node that the container is running on,
and then identify the specific anjuna-nitro-device-manager
running on that Node.
Examine the logs of that Pod by running the following command:
$ kubectl logs <device manager Pod>
Search for the following output from about the time that the Pod was deployed:
ANJ-DEVMGR: 2022/07/29 17:31:04.894412 INFO Device plugin for smarter-devices/nitro_enclaves allocated 1 devices
Verifying the Pod launched successfully
Examine the logs of the Pod by running the following command:
$ kubectl logs <Pod name>
Look for the launcher logs:
ANJ-LAUNCH: anjuna-k8s-launcher version master.0301 (build commit: 71a979c) ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-cli --version ANJ-LAUNCH: Anjuna Nitro CLI master.0301 (build commit: 0c196d9) ANJ-LAUNCH: ANJ-LAUNCH: Created "/run/nitro_enclaves" ANJ-LAUNCH: Created "/var/log/nitro_enclaves" ANJ-LAUNCH: Created "/opt/anjuna/nitro/images"
If you are using an EIF built on the fly, look for the logs:
ANJ-LAUNCH: Building EIF file ANJ-LAUNCH: Generated enclave config: # <snip>... ANJ-LAUNCH: Running: /usr/bin/docker pull <Pod image> ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-cli build-enclave ... ANJ-LAUNCH: EIF build successful
If you are using a prebuilt EIF, look for the logs:
ANJ-LAUNCH: Downloading EIF file from '<uri to EIF>' ANJ-LAUNCH: EIF download successful
Next, verify that the enclave launched successfully by searching for a log that looks like:
ANJ-LAUNCH: Started enclave with enclave-cid: 17, memory: 2048 MiB, cpu-ids: [1, 5]
And lastly, verify that the Pod’s CMD is executed by looking for the log:
ANJ-ENCLAVE: Launched "<path to Pod image’s CMD>" with pid=...
FAQs
The S3 encrypted configuration is not applied
First, make sure that the enclave configuration file contains an entry of the form:
encryptedConfig:
type: s3
uri: <URI to encrypted configuration>
allowList:
...
Note that the allowList
must contain at least one environment variable or one file.
Second, make sure that the file is available to the enclave.
Next, make sure that the enclave downloaded the correct encrypted configuration by looking at the Pod’s logs and searching for a log line of the form:
ANJ-ENCLAVE: Encrypted config from <URI to encrypted configuration> applied
If there are issues with obtaining the encrypted configuration data, see Updating the KMS policy to authorize AWS Nitro Enclaves. That section contains instructions on how to verify that the enclave has access to the key used to encrypt the encrypted configuration.
No identity-based policy allows the kms:Decrypt action
See No identity-based policy allows the kms:Decrypt action on the main Troubleshooting page.
Cannot connect to network services on a Pod
When trying to connect to a Pod, the client receives connection timeouts or errors such as
Connection refused
.
This is most likely due to a misconfigured Pod definition.
Make sure that the Pod’s definition has the required container ports in the field
spec.containers.ports
.
Next, verify that the Pod’s mutated definition has the environment variable ANJ_ENCLAVE_PORT_ALL
and contains all ports that you want to expose, by running the following command.
$ kubectl describe pod <Pod name>
Finally, verify that the launcher Pod has launched the anjuna-nitro-netd-parent
with all required ports exposed, by running the following command:
$ kubectl logs <Pod name>
Look for a log line starting with:
ANJ-LAUNCH: Running: /opt/anjuna/nitro/bin/anjuna-nitro-netd-parent
Additionally, look for flags of the form --expose <port num>
.
You can also look for log lines of the following form in order to verify that the network daemon has indeed exposed the correct ports:
ANJ-NETD-PARENT: [info] Expose enclave port 80 -> 80
A Pod is stuck in pending state during deployment
Start by verifying that all of the Nodes are running and have the required capacities (see Verifying the device manager is properly configured, above).
Second, verify that the Pod is not requesting more resources than are available (see Verifying the Pod was mutated, above).
If the Pod’s memory requirements exceed the available memory capacity of all Nodes,
consider increasing the memory capacity of your Nodes (specifically, the available hugepages-2Mi
).
If your Pod is requesting more smarter-devices/nitro_enclaves
than are available,
consider adding more AWS Nitro-based Nodes to your cluster.
Insufficient memory allocated examples
The following is an example of an error message that might be seen if there is insufficient memory allocated resulting from the built EIF:
ANJ-LAUNCH: [ E26 ] Insufficient memory requested. User provided `memory` is 256 MB, but based on the EIF file size, the minimum memory should be 336 MB
An out-of-memory condition may also be occurring if run-container-process
does not start,
which would be indicated by the log output stopping before the last line in the following example
log excerpt:
ANJ-NETD-PARENT: [info] new connection from vm(17):2101522098 to vm(3):1024 ANJ-NETD-PARENT: [info] Assigning 10.0.4.214/32 gw 169.254.1.1 to vm(17):2101522098 ANJ-NETD-PARENT: [info] Connection closed from vm(17):2101522098 ANJ-NETD-PARENT: [info] new connection from vm(17):2101522099 to vm(3):1024 ANJ-NETD-PARENT: [debug] Connection closed to vm(17):2101522098 ANJ-ENCLAVE: run-container-process version master.0294 (build commit: 89783f9)
A Pod crashes randomly or with odd error messages
These issues might be related to insufficient memory for the Pod in the enclave. Pods running in an enclave have different memory requirements from Pods running directly on worker Nodes.
This is because when Pods run directly on the worker Node, the Pod only needs to consider the memory required for the applications launched. When Pods run in enclaves, the memory is used not only by the applications, but also by the kernel and the filesystem.
Try increasing the memory limits for the Pod in the Pod definition and then redeploying.
The webhook produced a log about “bad certificate”
If your webhook produced the following log, it means that the CA provided in the
MutatingWebhookConfiguration
is not the one used to sign the webhook’s certificate:
ANJ-WEBHOOK: http: TLS handshake error from <some address>: remote error: tls: bad certificate
Re-run the following script, and follow the instructions that it provides,
including updating the Kubernetes secret anjuna-nitro-webhook-cert
,
and updating the MutatingWebhookConfiguration
to use the newly-produced CA bundle.
$ bash generate-webhook-tls-cert.sh