Files
openclaw-mission-control/docs/09-ops-runbooks.md
2026-02-11 12:59:37 +00:00

82 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operations
This is the ops/SRE entrypoint.
It aims to answer, quickly:
- “Is the system up?”
- “What changed?”
- “What should I check next?”
Deep dives:
- [Deployment](deployment/README.md)
- [Production](production/README.md)
- [Troubleshooting deep dive](troubleshooting/README.md)
## First 30 minutes (incident checklist)
### 0) Stabilize communications
- Identify incident lead and comms channel.
- Capture last deploy SHA/tag and time window.
- Do not paste secrets into chat/tickets.
### 1) Confirm impact
- UI broken vs API broken vs auth vs DB vs gateway integration.
- All users or subset?
### 2) Health checks
- Backend:
- `curl -f http://<backend-host>:8000/healthz`
- `curl -f http://<backend-host>:8000/readyz`
- Frontend:
- can the UI load?
- in browser devtools, are `/api/v1/*` requests failing?
### 3) Configuration sanity
Common misconfigs that look like outages:
- `NEXT_PUBLIC_API_URL` wrong → UI loads but API calls fail.
- `CORS_ORIGINS` missing frontend origin → browser CORS errors.
- Clerk misconfig → auth redirects/401s.
### 4) Database
- If backend is 5xxing broadly, DB is a top suspect.
- Verify `DATABASE_URL` points at the correct host.
### 5) Logs
Compose:
```bash
docker compose -f compose.yml --env-file .env logs -f --tail=200
```
Targeted:
```bash
docker compose -f compose.yml --env-file .env logs -f --tail=200 backend
```
### 6) Rollback / isolate
- If there was a recent deploy and symptoms align, rollback to last known good.
- If gateway integration is implicated, isolate by disabling gateway-dependent flows.
## Common failure modes
- UI loads, Activity feed blank → `NEXT_PUBLIC_API_URL` wrong/unreachable.
- Repeated auth redirects/errors → Clerk keys/redirects misconfigured.
- Backend 5xx → DB outage/misconfig; migration failure.
- Backend wont start → config validation failure (e.g. empty `CLERK_SECRET_KEY`).
## Backups
Evidence: `docs/production/README.md`.
- Minimum viable: periodic `pg_dump` to off-host storage.
- Treat restore as a drill (quarterly), not a one-time checklist.