Troubleshooting Operator Based Installer
This section describes how to troubleshoot some of the common issues you may face when installing the Deploy application using Operator-based installer.
Error while processing YAML document at line 1 of XL YAML file
Symptom
When the deployment starts, XL CLI script fails to run and displays the following error message:
Solution
Restart the deployment using XL CLI.
xl apply -v -f digital-ai.yaml
Note: The above command applies the
digital-ai.yaml
wrapper file that bundles all other files, such as infrastructure file, environment file, and so on.
Only the Operator control manager pods gets deployed on the Kubernetes cluster
Symptom
The XL CLI script runs successfully, but only the Operator control manager pods are deployed on the Kubernetes cluster. No other pods are deployed.
Solution
Clear the Operator deployment as follows:
-
Run the following command:
`Kubectl get crd`
-
Delete the Operator corresponding to
CRD
kubectl delete crd digitalaideployocps.xldocp.digital.ai
-
Go to
/digital-ai/kubernetes/template
path in the extracted ZIP file and run the following command.kubectl delete -f
-
Restart the deployment using XL CLI.
Note: To troubleshoot the issue on Openshift AWS cluster, replace the
kubectl
command withoc
.
Deployment activation fails after deleting operator
Symptom
After deleting the operator customer resource definition (CRD) and the operator, the redeployment process fails to create pods when you attempt to activate the deployment process by running the following command: xl apply -v -f digital-ai.yaml
Solution
If you do not have a local Deploy instance, only then use the kubectl delete -f
command to remove the Deploy instance. If you have a local Deploy instance with deployment details, use the make undeploy
command to remove the Operator, and retry the deployment process.
Upgrade to Operator-based solution fails
Symptom
The upgrade to Operator-based solution from the Helm Charts-based solution fails.
Solution
-
Restore the database instance.
-
Clean the deployments. For more information, see Uninstall Deploy.
-
Update the
daideploy_cr.yaml
for deploy to use the external database as follows:a. Search for
UseExistingDB
parameter in thedaideploy_cr.yaml
for Deploy.b. Set
Enabled
parameter to True.c. Remove the comment tag from the following parameters:
XL\_DB\_PASSWORD
XL\_DB\_URL
XL\_DB\_USERNAME
d. Update the external database credentials.
- Redeploy the Deploy instance.
Symptom
The upgrade Operator-to-Operator solution, fails with following error:
"Fetching values from cluster... / Missing CRD and CR resources during Upgrade, Could not upgrade: exit status 1"
Solution
During the upgrade, the CRD and CR resources are backed up in daideploy_cr_original.yaml
file.
To troubleshoot the issue:
Restore CRD using following command:
kubectl apply -f daideploy_cr_original.yaml
- Restart the upgrade.
If the CRD or CR files are not backed up, then you can only perform a fresh installation after performing a cleanup. To clean up the existing resources, run the following command:
Note: Make sure that you don't delete PVCs (answer to question Should we preserve persisted volume claims?
with Yes
).
xl op --clean
or
Run the following cleanup script:
kubectl delete crd digitalaideploys.xld.digital.ai
kubectl delete role xld-operator-leader-election-role
kubectl delete clusterrole xld-operator-manager-role
kubectl delete clusterrole xld-operator-metrics-reader
kubectl delete clusterrole xld-operator-proxy-role
kubectl delete rolebinding xld-operator-leader-election-rolebinding
kubectl delete clusterrolebinding xld-operator-manager-rolebinding
kubectl delete clusterrolebinding xld-operator-proxy-rolebinding
kubectl delete service xld-operator-controller-manager-metrics-service
kubectl delete deployment xld-operator-controller-manager
On OpenShift Keycloak pod is not starting on the namespace
Symptom
Keycloak pod is not starting on OpenShift cluster and you can see this error for keycloak StatefulSet:
Warning FailedCreate 2m11s (x3 over 2m11s) statefulset-controller create Pod dai-ocp-xld-cn1502-k-0 in StatefulSet dai-ocp-xld-cn1502-k failed error: pods "dai-ocp-xld-cn1502-k-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount
Solution
You have to add security constraint context for your custom namespace:
oc adm policy add-scc-to-group anyuid system:serviceaccounts:<custom-namespace>
Periodic Out of Memory (OOM) on the Operator Pod
Symptom
Operator pod xld-operator-controller-manager
periodically restarts with OOM error. Error can be observed by checking describe of the pod in the Events
section.
Solution
Edit the deployment for xld-operator-controller-manager and update the value for the memory limits:
spec.template.spec.containers[1].resources.limits.memory
The Operator upgrader fails when you upgrade from 10.3 to latest 22.2. What should you do?
Symptom
The cc pod is not initialized, due to the following error.
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 55s statefulset-controller create Claim data-dir-dai-xld-nsxld-digitalai-deploy-cc-server-0 Pod dai-xld-nsxld-digitalai-deploy-cc-server-0 in StatefulSet dai-xld-nsxld-digitalai-deploy-cc-server success
Warning FailedCreate 34s (x13 over 55s) statefulset-controller create Pod dai-xld-nsxld-digitalai-deploy-cc-server-0 in StatefulSet dai-xld-nsxld-digitalai-deploy-cc-server failed error: Pod "dai-xld-nsxld-digitalai-deploy-cc-server-0" is invalid: spec.initContainers[0].volumeMounts[0].name: Not found: "source-dir"
Solution
-
Edit the statefulset of dai-xld-nsxld-digitalai-deploy-cc-server
kubectl edit statefulset.apps/dai-xld-nsxld-digitalai-deploy-cc-server
-
Update the Volume section as shown below.
volumes:
- name: source-dir
persistentVolumeClaim:
claimName: data-dir-dai-xld-nsxld-digitalai-deploy-master-0