docs: add missing context to overview/architecture/ops/troubleshooting

This commit is contained in:
Abhimanyu Saharan
2026-02-11 09:45:29 +00:00
parent 04c6822ea8
commit 7be38140a1
4 changed files with 106 additions and 46 deletions

View File

@@ -6,32 +6,33 @@
- [Production](production/README.md)
- [Troubleshooting](troubleshooting/README.md)
This page is the operator/SRE entry point. It intentionally links to existing deeper docs to minimize churn.
This page is the operator entrypoint. It points to the existing deep-dive runbooks and adds a short “first 30 minutes” checklist.
## First 30 minutes incident checklist
## First 30 minutes (incident checklist)
1. **Confirm user impact + scope**
- What is broken: UI, API, auth, or gateway integration?
- Is it all users or a subset?
1. **Confirm impact**
- Whats broken: UI, API, auth, or gateway integration?
- All users or a subset?
2. **Check service health**
- Backend: `/healthz` and `/readyz`
- Frontend: can it load? does it reach the API?
3. **Check auth (Clerk) configuration**
- Frontend: is Clerk enabled unexpectedly? (publishable key set)
- Backend: is `CLERK_JWKS_URL` configured correctly?
3. **Check auth (Clerk)**
- Frontend: did Clerk get enabled unintentionally? (publishable key set)
- Backend: is `CLERK_SECRET_KEY` configured correctly?
4. **Check DB connectivity**
- Can backend connect to Postgres (`DATABASE_URL`)?
5. **Check logs**
- Backend logs for 5xx spikes or auth failures.
- Frontend logs for proxy/API URL misconfig.
- Frontend logs for API URL/proxy misconfig.
6. **Stabilize**
- Roll back the last change if available.
- Roll back the last change if you can.
- Temporarily disable optional integrations (gateway) to isolate.
## Backups / restore (placeholder)
- Define backup cadence and restore steps once production deployment is finalized.
## Backups / restore
See [Production](production/README.md). If you run Mission Control in production, treat backup/restore as a regular drill, not a one-time setup.