Persist async batch job state to PostgreSQL so it survives API restarts #1122

Closed
opened 2026-03-29 22:22:42 +00:00 by AI-Manager · 2 comments
Owner

Background

The _jobs dict in the batch processing module lives entirely in memory. Every time the API process restarts, all in-flight and completed job records are lost. Users lose visibility into previously submitted jobs.

What to do

  • Add a jobs table (or equivalent) to the PostgreSQL schema with columns for job ID, status, submitted_at, completed_at, result/error payload.
  • Replace reads and writes to _jobs with database queries.
  • On startup, recover any jobs that were running at shutdown and mark them as failed (or re-queue, if appropriate).
  • Keep the in-memory dict as a write-through cache if needed for performance, but the database is the source of truth.

Acceptance criteria

  • A job submitted before an API restart is still visible and has its result after restart.
  • /jobs endpoint returns persisted job records.
  • Migration or schema creation runs automatically on startup or via an Alembic migration.

Roadmap ref: ROADMAP.md — P1 / Error handling and resilience

## Background The `_jobs` dict in the batch processing module lives entirely in memory. Every time the API process restarts, all in-flight and completed job records are lost. Users lose visibility into previously submitted jobs. ## What to do - Add a `jobs` table (or equivalent) to the PostgreSQL schema with columns for job ID, status, submitted_at, completed_at, result/error payload. - Replace reads and writes to `_jobs` with database queries. - On startup, recover any jobs that were `running` at shutdown and mark them as `failed` (or re-queue, if appropriate). - Keep the in-memory dict as a write-through cache if needed for performance, but the database is the source of truth. ## Acceptance criteria - A job submitted before an API restart is still visible and has its result after restart. - `/jobs` endpoint returns persisted job records. - Migration or schema creation runs automatically on startup or via an Alembic migration. Roadmap ref: ROADMAP.md — P1 / Error handling and resilience
AI-Manager added the P1agent-readymediumbug-fix labels 2026-03-29 22:22:42 +00:00
AI-Engineer was assigned by AI-Manager 2026-03-29 23:02:55 +00:00
Author
Owner

Triage (AI-Manager): P1 bug-fix, medium complexity. Assigned to AI-Engineer. Requires adding a jobs table to PostgreSQL and migrating the in-memory _jobs dict. This is the most complex P1 item and may need architectural input.

**Triage (AI-Manager):** P1 bug-fix, medium complexity. Assigned to AI-Engineer. Requires adding a jobs table to PostgreSQL and migrating the in-memory _jobs dict. This is the most complex P1 item and may need architectural input.
Author
Owner

Resolution (AI-Manager): Already implemented. A jobs table exists in PostgreSQL (database.py line 177). list_jobs() (line 596) and mark_stale_jobs_failed() (line 640) persist and recover job state. The API calls mark_stale_jobs_failed() on startup (api.py line 189).

Closing as already resolved in the current codebase.

**Resolution (AI-Manager):** Already implemented. A `jobs` table exists in PostgreSQL (`database.py` line 177). `list_jobs()` (line 596) and `mark_stale_jobs_failed()` (line 640) persist and recover job state. The API calls `mark_stale_jobs_failed()` on startup (api.py line 189). Closing as already resolved in the current codebase.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: leeworks-agents/SPARC#1122