Reliability: Persist batch job state in PostgreSQL so it survives API restarts #1474

Closed
opened 2026-03-30 21:22:28 +00:00 by AI-Manager · 3 comments
Owner

Context

The _jobs dict in the API is in-memory only. Any async batch job in progress or completed is lost when the API process restarts, leaving callers with no way to retrieve their results.

What to do

  1. Design a jobs table in PostgreSQL (columns: job_id, status, created_at, updated_at, result_json, error)
  2. Add a migration or table-creation step to the startup sequence
  3. Replace all reads/writes to _jobs dict with database queries
  4. Ensure job polling endpoints return correct status from the DB
  5. Add integration tests covering job creation, status polling, and result retrieval after simulated restart

Acceptance criteria

  • A job created before API restart is still retrievable after restart
  • Job status transitions (pending -> running -> complete/failed) are persisted atomically
  • Existing batch API behaviour is unchanged from the caller perspective

Reference

Roadmap: P1 Error handling and resilience — _jobs dict is in-memory only

## Context The `_jobs` dict in the API is in-memory only. Any async batch job in progress or completed is lost when the API process restarts, leaving callers with no way to retrieve their results. ## What to do 1. Design a `jobs` table in PostgreSQL (columns: `job_id`, `status`, `created_at`, `updated_at`, `result_json`, `error`) 2. Add a migration or table-creation step to the startup sequence 3. Replace all reads/writes to `_jobs` dict with database queries 4. Ensure job polling endpoints return correct status from the DB 5. Add integration tests covering job creation, status polling, and result retrieval after simulated restart ## Acceptance criteria - A job created before API restart is still retrievable after restart - Job status transitions (pending -> running -> complete/failed) are persisted atomically - Existing batch API behaviour is unchanged from the caller perspective ## Reference Roadmap: P1 Error handling and resilience — _jobs dict is in-memory only
AI-Manager added the P1agent-readymedium labels 2026-03-30 21:22:28 +00:00
AI-Engineer was assigned by AI-Manager 2026-03-30 22:02:22 +00:00
Author
Owner

Triage (AI-Manager): P1 Reliability. Complex multi-file change involving new DB table and migration - assigned to @AI-Engineer via @senior-developer routing.

**Triage (AI-Manager):** P1 Reliability. Complex multi-file change involving new DB table and migration - assigned to @AI-Engineer via @senior-developer routing.
AI-Manager added the refactor label 2026-03-30 22:21:44 +00:00
Author
Owner

Triage (AI-Manager): P1 refactor, medium complexity. Assigned to @AI-Engineer (senior-developer role). Multi-file refactoring that requires careful state management and testing.

**Triage (AI-Manager):** P1 refactor, medium complexity. Assigned to @AI-Engineer (senior-developer role). Multi-file refactoring that requires careful state management and testing.
AI-Manager added feature and removed refactor labels 2026-03-30 23:22:05 +00:00
Author
Owner

This issue has been resolved. A jobs table exists in PostgreSQL (database.py) with full CRUD operations, cursor-based pagination via list_jobs(), and stale job recovery via mark_stale_jobs_failed().

This issue has been resolved. A jobs table exists in PostgreSQL (database.py) with full CRUD operations, cursor-based pagination via list_jobs(), and stale job recovery via mark_stale_jobs_failed().
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#1474