🛠️ SOP-002: Diagnostics & Troubleshooting Guide
This Standard Operating Procedure (SOP) runbook provides step-by-step diagnostic paths to resolve common system anomalies inside the AI Workflow Orchestrator.
🔒 1. Issue: SQLite File Lockups (Database is Locked)
Due to concurrent async execution blocks during agent debate rounds, SQLite may occasionally report file access errors: sqlite3.OperationalError: database is locked.
Recovery Steps:
- Identify active processes locking the database file:bash
fuser storage/memory.db - Halt blocking processes (replace
<PID>with the actual process ID):bashkill -9 <PID> - Enable Write-Ahead Logging (WAL) to allow simultaneous read/write cycles without locks:bashWAL mode resolves concurrent access block issues during intense debate iterations.
sqlite3 storage/memory.db "PRAGMA journal_mode=WAL;"
🔌 2. Issue: Token Budget Trips (Token Budget Exceeded)
The FastAPI Token Budget Middleware triggers a runtime circuit breaker. If aggregate API consumption crosses 100,000 tokens during a session, all active tasks are terminated: ValueError: Agent aborted: Token budget exceeded.
Recovery Steps:
- Reset Session Budgets by restarting the Uvicorn web server process:bash
kill -9 $(pgrep -f uvicorn) python -m uvicorn api.routes:app --host 0.0.0.0 --port 7860 - Increase Budget Thresholds inside the
security/token_budget.pyconfiguration file (e.g. updating limits from100000to200000tokens):python# Temporary configuration adjustment for complex operations self.max_input_tokens = 200000 self.max_output_tokens = 50000
🔑 3. Issue: Model Connection Errors (HTTP 429 / 401)
Errors such as 429 Too Many Requests or 401 Unauthorized indicate issues with LLM API credentials.
Recovery Steps:
- Verify credential validity by running a basic python connection test:python
import os from google import genai client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY")) response = client.models.generate_content(model="gemini-2.5-flash", contents="Hi") print(response.text) - If the test script fails, update the active key strings in your env parameters.
- The KeyManager automatically intercepts 429 exceptions and blocks the failing key for 60 seconds. If all configured keys are blocked, execution will stop. In this scenario, inject fresh API keys:bash
export GOOGLE_API_KEY="new_key_1,new_key_2"
🔄 4. Auditing Self-Healing Traces
If a task execution verification fails, the orchestrator triggers automated recovery. The complete diagnostics and repair flow is captured inside JSON trace objects.
File Locations:
All traces are written to the local storage: storage/traces/<trace_id>.json
Reading Self-Healing Diagnostics:
Read the newest trace file using jq to inspect Critic recommendations and Solution repairs:
cat $(ls -t storage/traces/*.json | head -n 1) | jq .Verify the "self_healing" nested parameters within the JSON object:
"critic_diagnosis": Explanation of what caused the command execution failure."solution_patch": Repaired execution commands routed back to the sandboxed runner."healed_successfully": Boolean value indicating recovery success.
⚖️ 5. Manual Calibration of Agent ELO Ratings
For QA testing or debate tuning, administrators may manually alter or reset agent ELO reputation coefficients.
Resetting all agent ratings to the default 1200:
Execute the following SQLite command in your terminal:
sqlite3 storage/memory.db "UPDATE agent_elo SET elo = 1200.0, matches = 0;"Boosting specific agent ratings (e.g. promoting the Security agent to 1400 ELO):
sqlite3 storage/memory.db "UPDATE agent_elo SET elo = 1400.0 WHERE agent_id = 'agent_security';"Altering Elo ratings immediately increases the agent's voting multiplier in consensus calculations.