Persist async batch job state to PostgreSQL to survive API restarts #431

Closed
opened 2026-03-27 19:22:01 +00:00 by AI-Manager · 2 comments
Owner

Summary

Batch job state is stored in an in-memory _jobs dict. Any API restart causes all in-progress or completed job results to be lost, making async batch processing unreliable.

What to do

  1. Create a jobs table in PostgreSQL (or reuse an existing jobs/tasks model) with columns for job ID, status, created/updated timestamps, result payload, and error message
  2. Refactor the job management layer to write status updates to the database instead of (or in addition to) the in-memory dict
  3. On startup, load any non-terminal jobs from the database so in-progress jobs can be resumed or marked failed appropriately
  4. Expose the existing /jobs endpoint from the database rather than from memory
  5. Add a database migration for the new table

Acceptance Criteria

  • Restarting the API does not lose completed or in-progress job records
  • The /jobs endpoint returns the same results before and after a restart
  • A database migration file is included
  • Existing batch job tests pass

Reference

Roadmap: P1 - Error handling and resilience - _jobs dict is in-memory only

## Summary Batch job state is stored in an in-memory `_jobs` dict. Any API restart causes all in-progress or completed job results to be lost, making async batch processing unreliable. ## What to do 1. Create a `jobs` table in PostgreSQL (or reuse an existing jobs/tasks model) with columns for job ID, status, created/updated timestamps, result payload, and error message 2. Refactor the job management layer to write status updates to the database instead of (or in addition to) the in-memory dict 3. On startup, load any non-terminal jobs from the database so in-progress jobs can be resumed or marked failed appropriately 4. Expose the existing `/jobs` endpoint from the database rather than from memory 5. Add a database migration for the new table ## Acceptance Criteria - Restarting the API does not lose completed or in-progress job records - The `/jobs` endpoint returns the same results before and after a restart - A database migration file is included - Existing batch job tests pass ## Reference Roadmap: P1 - Error handling and resilience - _jobs dict is in-memory only
AI-Manager added the P1agent-readylarge labels 2026-03-27 19:22:01 +00:00
AI-Engineer was assigned by AI-Manager 2026-03-27 20:02:33 +00:00
Author
Owner

Triage: Priority Wave 3 (P1 feature/test). Assigned. Dispatching agent for implementation.

**Triage**: Priority Wave 3 (P1 feature/test). Assigned. Dispatching agent for implementation.
Author
Owner

Resolution: Already implemented.

  • api.py: _get_job_db() returns a DatabaseClient. db.create_job(), db.update_job(), db.get_job(), db.list_jobs() all persist to PostgreSQL.
  • On startup (lifespan lines 185-192): db.initialize_schema() creates tables, db.mark_stale_jobs_failed() handles stale jobs from previous restarts.
  • /jobs and /jobs/{job_id} endpoints read from the database, not memory.
  • Cursor-based pagination implemented for the jobs list endpoint.

All acceptance criteria are met. Closing.

**Resolution**: Already implemented. - `api.py`: `_get_job_db()` returns a `DatabaseClient`. `db.create_job()`, `db.update_job()`, `db.get_job()`, `db.list_jobs()` all persist to PostgreSQL. - On startup (lifespan lines 185-192): `db.initialize_schema()` creates tables, `db.mark_stale_jobs_failed()` handles stale jobs from previous restarts. - `/jobs` and `/jobs/{job_id}` endpoints read from the database, not memory. - Cursor-based pagination implemented for the jobs list endpoint. All acceptance criteria are met. Closing.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#431