Files
openclaw-mission-control/docs/09-ops-runbooks.md
2026-02-11 06:30:08 +00:00

1.2 KiB

Ops / runbooks

Deep dives

This page is the operator/SRE entry point. It intentionally links to existing deeper docs to minimize churn.

“First 30 minutes” incident checklist

  1. Confirm user impact + scope

    • What is broken: UI, API, auth, or gateway integration?
    • Is it all users or a subset?
  2. Check service health

    • Backend: /healthz and /readyz
    • Frontend: can it load? does it reach the API?
  3. Check auth (Clerk) configuration

    • Frontend: is Clerk enabled unexpectedly? (publishable key set)
    • Backend: is CLERK_JWKS_URL configured correctly?
  4. Check DB connectivity

    • Can backend connect to Postgres (DATABASE_URL)?
  5. Check logs

    • Backend logs for 5xx spikes or auth failures.
    • Frontend logs for proxy/API URL misconfig.
  6. Stabilize

    • Roll back the last change if available.
    • Temporarily disable optional integrations (gateway) to isolate.

Backups / restore (placeholder)

  • Define backup cadence and restore steps once production deployment is finalized.