2.1 KiB
2.1 KiB
Operations
This is the ops/SRE entrypoint.
It aims to answer, quickly:
- “Is the system up?”
- “What changed?”
- “What should I check next?”
Deep dives:
First 30 minutes (incident checklist)
0) Stabilize communications
- Identify incident lead and comms channel.
- Capture last deploy SHA/tag and time window.
- Do not paste secrets into chat/tickets.
1) Confirm impact
- UI broken vs API broken vs auth vs DB vs gateway integration.
- All users or subset?
2) Health checks
- Backend:
curl -f http://<backend-host>:8000/healthzcurl -f http://<backend-host>:8000/readyz
- Frontend:
- can the UI load?
- in browser devtools, are
/api/v1/*requests failing?
3) Configuration sanity
Common misconfigs that look like outages:
NEXT_PUBLIC_API_URLwrong → UI loads but API calls fail.CORS_ORIGINSmissing frontend origin → browser CORS errors.- Clerk misconfig → auth redirects/401s.
4) Database
- If backend is 5xx’ing broadly, DB is a top suspect.
- Verify
DATABASE_URLpoints at the correct host.
5) Logs
Compose:
docker compose -f compose.yml --env-file .env logs -f --tail=200
Targeted:
docker compose -f compose.yml --env-file .env logs -f --tail=200 backend
6) Rollback / isolate
- If there was a recent deploy and symptoms align, rollback to last known good.
- If gateway integration is implicated, isolate by disabling gateway-dependent flows.
Common failure modes
- UI loads, Activity feed blank →
NEXT_PUBLIC_API_URLwrong/unreachable. - Repeated auth redirects/errors → Clerk keys/redirects misconfigured.
- Backend 5xx → DB outage/misconfig; migration failure.
- Backend won’t start → config validation failure (e.g. empty
CLERK_SECRET_KEY).
Backups
Evidence: docs/production/README.md.
- Minimum viable: periodic
pg_dumpto off-host storage. - Treat restore as a drill (quarterly), not a one-time checklist.