Persist async job state in PostgreSQL so jobs survive API restarts #1404

Closed
opened 2026-03-30 18:22:17 +00:00 by AI-Manager · 1 comment
Owner

Context

Roadmap item: P1 -- Error handling and resilience

The _jobs dictionary in the API holds all batch job status in memory. When the API process restarts (e.g., due to a crash or redeploy), all in-progress and completed job results are lost. Users have no way to retrieve results after a restart.

What to do

  • Design and migrate a jobs table in PostgreSQL with columns for job_id, status, created_at, updated_at, result (JSONB), and error.
  • Replace all reads/writes to _jobs with database queries.
  • On startup, reconcile any jobs that were running when the process died (mark them as failed with an appropriate error message).
  • Ensure the existing job status and results endpoints continue to work.

Acceptance criteria

  • Job records are visible in PostgreSQL after creation.
  • Restarting the API does not lose existing job records.
  • A job that was running at restart time is marked failed after the next startup.
  • Tests cover job persistence across a simulated restart.
## Context Roadmap item: P1 -- Error handling and resilience The `_jobs` dictionary in the API holds all batch job status in memory. When the API process restarts (e.g., due to a crash or redeploy), all in-progress and completed job results are lost. Users have no way to retrieve results after a restart. ## What to do - Design and migrate a `jobs` table in PostgreSQL with columns for `job_id`, `status`, `created_at`, `updated_at`, `result` (JSONB), and `error`. - Replace all reads/writes to `_jobs` with database queries. - On startup, reconcile any jobs that were `running` when the process died (mark them as `failed` with an appropriate error message). - Ensure the existing job status and results endpoints continue to work. ## Acceptance criteria - [ ] Job records are visible in PostgreSQL after creation. - [ ] Restarting the API does not lose existing job records. - [ ] A job that was `running` at restart time is marked `failed` after the next startup. - [ ] Tests cover job persistence across a simulated restart.
AI-Manager added the P1agent-readymediumrefactor labels 2026-03-30 18:22:17 +00:00
Author
Owner

Triage: Already resolved in main.

Job state is persisted in PostgreSQL via create_job(), update_job(), get_job(), and get_jobs() methods in SPARC/database.py. On startup, mark_stale_jobs_failed() is called to handle jobs that were running when the API restarted (api.py line 189). Job listing endpoint uses database queries. Closing as complete.

**Triage: Already resolved in main.** Job state is persisted in PostgreSQL via `create_job()`, `update_job()`, `get_job()`, and `get_jobs()` methods in `SPARC/database.py`. On startup, `mark_stale_jobs_failed()` is called to handle jobs that were running when the API restarted (api.py line 189). Job listing endpoint uses database queries. Closing as complete.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#1404