Version: Release 22.2

Troubleshooting Operator Based Installer

This section describes how to troubleshoot some of the common issues you may face when installing the Release application using Operator-based installer.

Error while processing YAML document at line 1 of XL YAML file

Symptom

When the deployment starts, XL CLI script fails
to run and displays the following error message:

Solution

Restart the deployment using XL CLI.

xl apply -v -f digital-ai.yaml

Note:** The above command applies the digital-ai.yaml file that bundles all other files, such as infrastructure file, environment file, and so on.

Only the Operator control manager pod gets deployed only on the Kubernetes cluster

Symptom

The XL CLI script runs successfully, but only the Operator control manager pods are deployed on the Kubernetes cluster. No other pods are deployed.

Solution

Clear the Operator deployment as follows:

Run the following command:
Kubectl get crd
Delete the Operator corresponding to CRD:
kubectl delete crd digitalaireleases.xlr.digital.ai
Go to /digital-ai/kubernetes/template path in extracted ZIP file, and run the following command: kubectl delete -f
Restart the deployment using XL CLI.

Note: To troubleshoot the issue on Openshift AWS cluster, replace the kubectl command with oc.

Deployment activation fails after deleting operator

Symptom

After deleting the operator customer resource definition (CRD) and the operator, the redeployment process fails to create pods when you attempt to activate the deployment process by running the following command:
xl apply -v -f digital-ai.yaml

Solution

If you do not have a local Release instance, only then use the kubectl delete -f command to remove the Release instance. If you have a local Release instance with deployment details, use the make undeploy command to remove the Operator, and retry the deployment process.

Upgrade to Operator-based solution fails

Symptom

The upgrade to Operator-based solution from the Helm Charts-based solution fails.

Solution

Restore the database instance.
Clean the deployments. For more information, see Uninstall Release.
Update the dairelease_cr.yaml file to use the external database as follows:

Search for UseExistingDB parameter in the dairelease_cr.yaml file.
Set Enabled parameter to True.
Remove the comment tag from the following parameters:

XL_DB_PASSWORD
XL_DB_URL
XL_DB_USERNAME

Update the external database credentials.
Redeploy the Release instance.

Symptom

The upgrade Operator-to-Operator solution, fails with following error:
"Fetching values from cluster... / Missing CRD and CR resources during Upgrade, Could not upgrade: exit status 1"

Solution

During the upgrade, the CRD and CR resources are backed up in dairelease_cr_original.yaml file. To troubleshoot the issue:

Restore CRD using following command:
kubectl apply -f dairelease_cr_original.yaml
Restart the upgrade.

If the CRD or CR files are not backed up, then you can only perform a fresh installation after performing a cleanup. To clean up the existing resources, run the following command:

note

Make sure that you don't delete PVCs (answer to question Should we preserve persisted volume claims? with Yes).

xl op --clean

Run the following cleanup script:

kubectl delete crd digitalaireleases.xlr.digital.ai
kubectl delete role xlr-operator-leader-election-role
kubectl delete clusterrole xlr-operator-manager-role
kubectl delete clusterrole xlr-operator-metrics-reader
kubectl delete clusterrole xlr-operator-proxy-role
kubectl delete rolebinding xlr-operator-leader-election-rolebinding
kubectl delete clusterrolebinding xlr-operator-manager-rolebinding
kubectl delete clusterrolebinding xlr-operator-proxy-rolebinding
kubectl delete service xlr-operator-controller-manager-metrics-service
kubectl delete deployment xlr-operator-controller-manager

On OpenShift Keycloak pod is not starting on the namespace

Symptom

Keycloak pod is not starting on OpenShift cluster and you can see this error for keycloak StatefulSet:

Warning FailedCreate 2m11s (x3 over 2m11s) statefulset-controller create Pod dai-ocp-xlr-cn1502-k-0 in StatefulSet dai-ocp-xlr-cn1502-k failed error: pods "dai-ocp-xlr-cn1502-k-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount

Solution

You have to add security constraint context for your custom namespace:

oc adm policy add-scc-to-group anyuid system:serviceaccounts:<custom-namespace>

Periodic Out of Memory (OOM) on the Operator Pod

Symptom

Operator pod xlr-operator-controller-manager periodically restarts with OOM error. Error can be observed by checking describe of the pod in the Events section.

Solution

Edit the deployment for xlr-operator-controller-manager and increase the value for the memory limits: spec.template.spec.containers[1].resources.limits.memory

Error while processing YAML document at line 1 of XL YAML file​

Symptom​

Solution​

Only the Operator control manager pod gets deployed only on the Kubernetes cluster​

Symptom​

Solution​

Deployment activation fails after deleting operator​

Symptom​

Solution​

Upgrade to Operator-based solution fails​

Symptom​

Solution​

Symptom​

Solution​

On OpenShift Keycloak pod is not starting on the namespace​

Symptom​

Solution​

Periodic Out of Memory (OOM) on the Operator Pod​

Symptom​

Solution​

Error while processing YAML document at line 1 of XL YAML file

Symptom

Solution

Only the Operator control manager pod gets deployed only on the Kubernetes cluster

Symptom

Solution

Deployment activation fails after deleting operator

Symptom

Solution

Upgrade to Operator-based solution fails

Symptom

Solution

Symptom

Solution

On OpenShift Keycloak pod is not starting on the namespace

Symptom

Solution

Periodic Out of Memory (OOM) on the Operator Pod

Symptom

Solution