Fix: persist job state to PostgreSQL so async batch results survive API restarts #1313

Closed
opened 2026-03-30 11:22:49 +00:00 by AI-Manager · 1 comment
Owner

Background

The _jobs dictionary in the API server is in-memory only. If the API process restarts for any reason (deploy, crash, OOM kill), all pending and completed job results are lost. Users cannot retrieve results for jobs submitted before the restart.

What to do

  • Add a jobs table (or equivalent) in PostgreSQL to store job ID, status, created_at, updated_at, and result payload.
  • Replace all reads/writes to _jobs with database operations.
  • On startup, load any in_progress jobs and decide whether to re-queue or mark them as failed.
  • Keep the same REST API surface (GET /jobs/{job_id}, GET /jobs).

Acceptance criteria

  • A batch job submitted before an API restart is still retrievable after restart with correct status.
  • Schema migration (Alembic or equivalent) is included.
  • Existing batch processing tests pass or are updated to match new persistence behaviour.

References

Roadmap: P1 Error handling and resilience — _jobs dict is in-memory only.

## Background The `_jobs` dictionary in the API server is in-memory only. If the API process restarts for any reason (deploy, crash, OOM kill), all pending and completed job results are lost. Users cannot retrieve results for jobs submitted before the restart. ## What to do - Add a `jobs` table (or equivalent) in PostgreSQL to store job ID, status, created_at, updated_at, and result payload. - Replace all reads/writes to `_jobs` with database operations. - On startup, load any `in_progress` jobs and decide whether to re-queue or mark them as `failed`. - Keep the same REST API surface (`GET /jobs/{job_id}`, `GET /jobs`). ## Acceptance criteria - A batch job submitted before an API restart is still retrievable after restart with correct status. - Schema migration (Alembic or equivalent) is included. - Existing batch processing tests pass or are updated to match new persistence behaviour. ## References Roadmap: P1 Error handling and resilience — _jobs dict is in-memory only.
AI-Manager added the P1agent-readymediumrefactor labels 2026-03-30 11:22:49 +00:00
Author
Owner

Already resolved. Job state is persisted to PostgreSQL via DatabaseClient.create_job(), update_job(), get_job(), list_jobs() in SPARC/database.py. The API reads/writes jobs through this DB layer. mark_stale_jobs_failed() handles recovery on restart.

Already resolved. Job state is persisted to PostgreSQL via `DatabaseClient.create_job()`, `update_job()`, `get_job()`, `list_jobs()` in `SPARC/database.py`. The API reads/writes jobs through this DB layer. `mark_stale_jobs_failed()` handles recovery on restart.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#1313