3.5 KiB
3.5 KiB
Gateway Agent Provisioning and Check-In Troubleshooting
This guide explains how agent provisioning converges to a healthy state, and how to debug when an agent appears stuck.
Fast Convergence Policy
Mission Control now uses a fast convergence policy for wake/check-in:
- Check-in deadline after each wake: 30 seconds
- Maximum wake attempts without check-in: 3
- If no check-in after the third attempt: agent is marked offline and provisioning escalation stops
This applies to both gateway-main and board agents.
Expected Lifecycle
- Mission Control provisions/updates the agent and sends wake.
- A delayed reconcile task is queued for the check-in deadline.
- Agent should call heartbeat quickly after startup/bootstrap.
- If heartbeat arrives:
last_seen_atis updated- wake escalation state is reset (
wake_attempts=0, check-in deadline cleared)
- If heartbeat does not arrive by deadline:
- reconcile re-runs lifecycle (wake again)
- up to 3 total wake attempts
- If still no heartbeat after 3 attempts:
- agent status becomes
offline last_provision_erroris set
- agent status becomes
Startup Check-In Behavior
Templates now explicitly require immediate first-cycle check-in:
- Main agent heartbeat instructions require immediate check-in after wake/bootstrap.
- Board lead bootstrap requires heartbeat check-in before orchestration.
- Board worker bootstrap already included immediate check-in.
If a gateway still has older templates, run template sync and reprovision/wake.
What You Should See in Logs
Healthy flow usually includes:
lifecycle.queue.enqueuedqueue.worker.success(for lifecycle tasks)lifecycle.reconcile.skip_not_stuck(after heartbeat lands)
If agent is not checking in:
lifecycle.reconcile.deferred(before deadline)lifecycle.reconcile.retriggered(retry wake)lifecycle.reconcile.max_attempts_reached(final fail-safe at attempt 3)
If you do not see lifecycle events at all, verify queue worker health first.
Common Failure Modes
Wake was sent, but no check-in arrived
Possible causes:
- Agent process never started or crashed during bootstrap
- Agent ignored startup instructions due to stale templates
- Heartbeat call failed (network/auth/base URL mismatch)
Actions:
- Confirm current templates were synced to gateway.
- Re-run provisioning/update to trigger a fresh wake.
- Verify agent can reach Mission Control API and send heartbeat with
X-Agent-Token.
Agent stays provisioning/updating with no retries
Possible causes:
- Queue worker not running
- Queue/Redis mismatch between API process and worker process
Actions:
- Verify worker process is running continuously.
- Verify
rq_redis_urlandrq_queue_nameare identical for API and worker. - Check worker logs for dequeue/handler errors.
Agent ended offline quickly
This is expected when no check-in is received after 3 wake attempts. The system fails fast by design.
Actions:
- Fix check-in path first (startup, network, token, API reachability).
- Re-run provisioning/update to start a new attempt cycle.
Operator Recovery Checklist
- Ensure queue worker is running.
- Sync templates for the gateway.
- Trigger agent update/provision from Mission Control.
- Watch logs for:
lifecycle.queue.enqueuedlifecycle.reconcile.retriggered(if needed)- heartbeat activity /
skip_not_stuck
- If still failing, capture:
- gateway logs around bootstrap
- worker logs around lifecycle events
- agent
last_provision_error,wake_attempts,last_seen_at