Install troubleshooting

Diagnose and fix the most common first-install failures — "Auth server unreachable", pods stuck without a Dapr sidecar, api-table not ready, and a failed Helm release.

This runbook covers the failures you are most likely to hit on a first helm install and how to recover from each. Most of them share a single root cause: on a fresh install the Helm chart applies the Dapr control plane, the database-migration jobs, and every application at the same time, so anything that prevents a pod from becoming Ready can stall the rest.

Throughout, replace <ns> with your release namespace (the platform namespace — e.g. scrydon-platform).

Quick triage

# Overall pod health — look for anything not Running/2-of-2
kubectl -n <ns> get pods -o wide

# Helm release state — "failed" means --wait gave up on a not-ready resource
helm -n <ns> history <release>

# What blocked the install (the message names the offending resource)
helm -n <ns> status <release>

A pod count of 1/2 or 0/2 (app + Dapr sidecar), a release STATUS: failed, or pods in CrashLoopBackOff / ContainerStatusUnknown all point at one of the cases below.

"Auth server unreachable" in the platform UI

The platform shows:

Auth server unreachable — The platform can't reach the authentication service right now.

This almost always means the auth server (api-platform) is reachable but a database-backed call returned a 500 — and on a fresh install that is because the database was never migrated (the schema is empty).

Confirm

# api-platform is up and serving (these should be 200)
kubectl -n <ns> logs deploy/api-platform -c api-platform --tail=20   # "Auth server is ready to accept requests."

# The setup-status call the platform makes — a 500 here is the smoking gun
kubectl -n <ns> run authchk --rm -it --restart=Never --image=curlimages/curl:8.7.1 -- \
  curl -s -m 6 -X POST -H "Content-Type: application/json" -d '{}' \
  http://svc-api-platform.<ns>.svc.cluster.local/api/auth/config/get-setup-status

If api-platform logs contain relation "user" does not exist (or any relation … does not exist), the schema is missing.

Cause

Database migrations run as a Helm hook job (post-install,pre-upgrade). Post-install hooks only fire after helm install --wait reports every resource Ready. If any Deployment never became Ready, the install is marked failed and the migration hooks are skipped — leaving every database empty. (See Helm release shows failed for why a pod might not become ready.)

Fix

First make the apps healthy (work through the other sections below).
Re-run the migrations. The cleanest way is to let Helm's pre-upgrade hook run them — it executes before the readiness wait, so it completes even if a slow service is still settling:
```
helm -n <ns> upgrade <release> <chart> --reuse-values
```

Verify the schema exists, then reload the platform:

kubectl -n <ns> exec deploy/db -c postgres -- psql -U postgres -d auth -c '\dt' | head

See Database migrations → Migrations never ran on a fresh install for the manual job-based recovery if you cannot run helm upgrade.

Pods stuck at `1/1` with no Dapr sidecar

Right after a fresh install, some pods come up 1/1 (no daprd sidecar) while others are 2/2. Restarting the 1/1 pods makes them 2/2.

Cause

Dapr injects its sidecar through a mutating webhook (dapr-sidecar-injector). The webhook only injects into pods created after it is registered and Ready. A single helm install creates the Dapr control plane and the app Deployments together, so any app pod scheduled before the injector is ready gets no sidecar. Services that depend on Dapr then fail their readiness checks.

Fix

Confirm the control plane is healthy. With the bundled Dapr (dapr.installControlPlane: true, the default) it runs in the release namespace, not dapr-system:

kubectl -n <ns> get pods | grep dapr     # dapr-operator, dapr-sidecar-injector, dapr-sentry, dapr-placement, dapr-scheduler
kubectl get mutatingwebhookconfigurations | grep dapr

Once the injector is Ready, roll the app deployments so they get sidecars:
```
kubectl -n <ns> rollout restart deployment
```

To avoid the race entirely on repeat installs, install Dapr first and set dapr.installControlPlane: false, then run helm install once the injector is ready. See Helm.

`api-table` stuck at `1/2` (readiness failing)

api-table stays 1/2 (its daprd sidecar is ready, the app container is not). Its readiness probe gates on three dependencies — Postgres, StarRocks, and the Dapr sidecar.

Diagnose

POD_IP=$(kubectl -n <ns> get pod -l app=api-table -o jsonpath='{.items[0].status.podIP}')
kubectl -n <ns> run rtest --rm -it --restart=Never --image=curlimages/curl:8.7.1 -- \
  curl -s -m 8 http://$POD_IP:7500/api/table/health/ready
# => {"status":"not_ready","checks":{"postgresql":true,"starrocks":false,"dapr":true,"opa":true}}

The checks object names the failing dependency. starrocks:false is the most common on a fresh install.

Fix StarRocks (`starrocks:false`)

The chart ships a single-pod StarRocks (infra.starrocks.enabled: true, the allin1 image). Two things commonly keep it from serving queries:

Replication factor. The default STARROCKS_REPLICATION_NUM assumes a multi-node backend. On the single-pod deployment set it to 1, or table writes/queries fail with "tablet replica … not enough":
```
apiTable:
  starrocks:
    replicationNum: "1"   # single-pod StarRocks has one backend
```
Cold start. StarRocks takes longer to accept connections than api-table. Confirm the FE is up and a backend is alive, then let api-table re-probe (it retries every 10s):
```
kubectl -n <ns> logs deploy/starrocks-fe --tail=30 | grep -i 'alive\|backend'
```

If StarRocks is genuinely not needed, disable Managed Tables with apiTable.enabled: false and infra.starrocks.enabled: false.

Helm release shows `failed`

helm history reports failed with a message like:

Release "scrydon" failed: resource Deployment/<ns>/api-table not ready

Cause

helm install --wait waited for every resource to become Ready and one Deployment never did (commonly api-table waiting on StarRocks, or a pod that was OOM-killed — see below). When the wait fails, the release is marked failed and the post-install hooks (database migrations) are skipped — which is what produces "Auth server unreachable".

Fix

Resolve the readiness failure named in the message (the sections above cover the usual ones).
Re-run helm upgrade to reconcile the release to deployed. The migration hook is idempotent and becomes a no-op once schemas exist:
```
helm -n <ns> upgrade <release> <chart> --reuse-values
```

Pods OOM-killed or crash-looping (single small node)

Symptoms: pods in CrashLoopBackOff or ContainerStatusUnknown, exit code 137, every pod failing readiness at once shortly after install.

Cause

The whole stack (all apps + the Dapr control plane + Postgres + StarRocks + object storage) was scheduled onto one undersized node. When every pod starts at once, memory limits overcommit and the kubelet OOM-kills pods, producing a cold-start storm.

Fix

Meet the Resource Requirements — 6 vCPU / 32 GB minimum for the full default stack — and prefer at least two nodes so the scheduler can spread the load and the control plane isn't competing with the apps.

On Azure, scaling a node pool can fail with InsufficientVCPUQuota. Check and raise your regional vCPU quota for the VM family before scaling:

az aks nodepool scale --cluster-name <cluster> --resource-group <rg> --name <pool> --node-count 3
# ERROR: (ErrCode_InsufficientVCPUQuota) … request a quota increase in the Azure portal

To run on a smaller footprint, trim optional components — see Helm → Trimming for low-resource installs.

If /setup bounces straight to /auth/sign-in on a brand-new install where no admin exists yet, the auth server is reporting setup as already complete. Verify with the get-setup-status call shown above — {"setupCompleted":true} with no human user present indicates the platform considers itself initialised. Restore from backup if this is an unexpected state, and contact support — do not edit the auth tables by hand.

Install troubleshooting

On this page