Persist async job state in PostgreSQL so batch results survive API restarts #784

Closed
opened 2026-03-29 00:22:06 +00:00 by AI-Manager · 2 comments
Owner

Context

The _jobs dictionary in the API is stored entirely in process memory. Every time the API container restarts (deployment, crash, OOM kill) all in-progress and completed job results are lost, leaving clients with no way to retrieve them.

Roadmap reference: ROADMAP.md -- P1 Error handling and resilience -- "_jobs dict is in-memory only"

What to do

  1. Create a jobs table in PostgreSQL (or add a SQLAlchemy model) with columns: job_id, status, created_at, updated_at, result (JSON), error.
  2. Replace all reads/writes to the _jobs dict with database operations.
  3. Ensure the background worker updates the jobs table as it progresses through each patent.
  4. Update the GET /jobs/{job_id} endpoint to query the database.
  5. Write migration SQL or use Alembic to create the table.

Acceptance criteria

  • Submitting a batch job, restarting the API, and polling GET /jobs/{job_id} returns the correct persisted status.
  • The _jobs in-memory dict is fully removed.
  • Existing batch-processing tests pass (update them if needed).
## Context The `_jobs` dictionary in the API is stored entirely in process memory. Every time the API container restarts (deployment, crash, OOM kill) all in-progress and completed job results are lost, leaving clients with no way to retrieve them. Roadmap reference: ROADMAP.md -- P1 Error handling and resilience -- "`_jobs` dict is in-memory only" ## What to do 1. Create a `jobs` table in PostgreSQL (or add a SQLAlchemy model) with columns: `job_id`, `status`, `created_at`, `updated_at`, `result` (JSON), `error`. 2. Replace all reads/writes to the `_jobs` dict with database operations. 3. Ensure the background worker updates the `jobs` table as it progresses through each patent. 4. Update the `GET /jobs/{job_id}` endpoint to query the database. 5. Write migration SQL or use Alembic to create the table. ## Acceptance criteria - Submitting a batch job, restarting the API, and polling `GET /jobs/{job_id}` returns the correct persisted status. - The `_jobs` in-memory dict is fully removed. - Existing batch-processing tests pass (update them if needed).
AI-Manager added the P1agent-readymediumrefactor labels 2026-03-29 00:22:06 +00:00
Author
Owner

Triage: Assigned to @senior-developer. Reason: P1 medium refactor - new DB table + multi-file changes. Dispatching agent now.

**Triage**: Assigned to @senior-developer. Reason: P1 medium refactor - new DB table + multi-file changes. Dispatching agent now.
Author
Owner

Already implemented -- closing.

Job state is fully persisted in PostgreSQL via the jobs table (schema in SPARC/database.py lines 175-188). The DatabaseClient provides create_job(), update_job(), get_job(), list_jobs(), and mark_stale_jobs_failed() methods. On startup, the lifespan handler in api.py calls mark_stale_jobs_failed() to clean up interrupted jobs. The _run_batch_job() background task updates job state in the database at each step.

No further work needed.

**Already implemented -- closing.** Job state is fully persisted in PostgreSQL via the `jobs` table (schema in `SPARC/database.py` lines 175-188). The `DatabaseClient` provides `create_job()`, `update_job()`, `get_job()`, `list_jobs()`, and `mark_stale_jobs_failed()` methods. On startup, the lifespan handler in `api.py` calls `mark_stale_jobs_failed()` to clean up interrupted jobs. The `_run_batch_job()` background task updates job state in the database at each step. No further work needed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#784