Compare commits

..

1 Commits

Author SHA1 Message Date
agent-company 4cb1a6ed21 Update ROADMAP.md to reflect completed work and add next-horizon items
Move all completed items (security hardening, structured logging, dark mode,
export, webhooks, scheduled analysis, multi-model, trend charts, CI, etc.)
into a new Completed section. Reorganize remaining P1/P2/P3 items to reflect
current priorities. Add new next-horizon items: historical diffing, patent
classification tagging, user API keys, batch export, and multi-tenant support.

Closes leeworks-agents/SPARC#1659

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 19:18:22 +00:00
2 changed files with 125 additions and 88 deletions
+118 -76
View File
@@ -7,86 +7,131 @@ Semiconductor Patent & Analytics Report Core -- development priorities.
SPARC is a patent analysis platform with a working end-to-end pipeline:
Python/FastAPI backend, React/TypeScript frontend, PostgreSQL for persistence
and caching, Docker Compose for local development, and Gitea Actions CI/CD for
image builds. Core features (patent retrieval via SerpAPI, PDF parsing, LLM
analysis via OpenRouter/Claude, batch processing, JWT authentication, analytics
dashboard) are all implemented and functional.
image builds and testing. Core features include patent retrieval via SerpAPI,
PDF parsing, LLM analysis via OpenRouter (multi-model: Claude, GPT-4o, Gemini,
Llama), batch processing, JWT authentication, analytics dashboard with patent
trend charts, scheduled recurring analysis with alerting, webhook notifications
(Slack/Discord), CSV and PDF export, S3/MinIO storage, side-by-side company
comparison, and dark mode.
---
## Completed
Items that have been implemented and merged into main.
### Security hardening
- ~~Rotate default JWT secret.~~ Startup check refuses to start with the
default secret in non-development environments.
- ~~CORS allow-origins are hardcoded.~~ Allowed origins are now configurable
via environment variable.
- ~~Database credentials in docker-compose.yml.~~ Compose references `.env`
for sensitive values.
### Error handling and resilience
- ~~`get_db_client()` creates a new `DatabaseClient` on every call.~~ Refactored
to a shared pooled singleton initialized at startup.
- ~~No rate limiting on auth endpoints.~~ Rate limiting middleware added to
`/auth/login` and `/auth/register`.
### Test coverage
- ~~API tests bypass authentication.~~ JWT auth integration tests added (33
cases covering registration, login, protected routes, token refresh, and
admin-only endpoints).
- ~~No test stage in CI.~~ Gitea Actions workflow now runs `pytest` and gates
the build.
- ~~No linting or type checking in CI.~~ `ruff` (Python) and `tsc --noEmit`
(TypeScript) added to CI pipeline.
### Backend
- ~~Add structured logging.~~ Python `logging` module used throughout.
- ~~Make LLM model configurable.~~ `MODEL` environment variable accepted;
multi-model support with per-analysis selection (GPT-4o, Gemini, Claude,
Llama).
- ~~SERP cache TTL hardcoded.~~ `SERP_CACHE_TTL_HOURS` exposed as env var.
- ~~Patent PDF storage.~~ S3/MinIO object storage backend added alongside
local filesystem. Volume mount requirement documented.
- ~~`analyze_single_patent` assumes local file.~~ Auto-download from cached
metadata link integrated.
- ~~`Patent.patent_id` typed as `int`.~~ Fixed to `str`.
### Frontend
- ~~No loading/error states.~~ Skeleton loaders and error states added to
Batch and Analytics pages.
- ~~No dark mode.~~ Full dark mode support with theme-aware chart colors.
- ~~Missing lockfile.~~ `package-lock.json` committed.
### Features (formerly P3)
- ~~Export analysis reports.~~ CSV and PDF export endpoints implemented.
- ~~Comparison view.~~ Side-by-side company patent portfolio comparison added.
- ~~Scheduled/recurring analysis.~~ APScheduler-based periodic re-analysis
with configurable interval and change-threshold alerting.
- ~~Webhook/notification support.~~ Slack, Discord, and generic HTTP POST
webhooks with retry logic.
- ~~Multi-model support.~~ Model picker in Analysis and Batch pages; backend
allow-list validation.
- ~~Patent trend charts.~~ Filing frequency and category distribution
visualizations added to Analytics page.
- ~~OpenAPI client generation.~~ TypeScript API client auto-generated from
FastAPI spec with CI freshness check.
---
## P1 -- High Priority
These items address correctness, security, and reliability gaps that should be
These items address correctness, reliability, and coverage gaps that should be
resolved before broader production use.
### Security hardening
### Resilience
- **Rotate default JWT secret.** `auth.py` ships a fallback
`sparc-secret-key-change-in-production` that will be used if `JWT_SECRET` is
unset. Add a startup check that refuses to start with the default secret in
non-development environments.
- **CORS allow-origins are hardcoded.** `api.py` only permits
`localhost:3000` and `localhost:5173`. Make the allowed origins configurable
via environment variable so the dashboard works when deployed behind a real
domain.
- **Database credentials in docker-compose.yml.** The compose file embeds
`postgres:postgres` in plain text. Reference a `.env` file or Docker secrets
instead.
- **`_jobs` dict is in-memory only.** Job state is lost on API restart.
Persist job status in PostgreSQL or Redis so async batch results survive
restarts.
### Error handling and resilience
### Test coverage gaps
- **`get_db_client()` in `auth.py` creates a new `DatabaseClient` on every
call.** This bypasses the connection pool and can exhaust database
connections under load. Refactor to share a single pooled client.
- **`_jobs` dict is in-memory only.** Job state is lost on API restart. Persist
job status in PostgreSQL or Redis so async batch results survive restarts.
- **No rate limiting on auth endpoints.** `/auth/login` and `/auth/register`
are unprotected against brute-force or abuse. Add rate limiting middleware.
### Test coverage for auth and admin
- The existing API tests (`tests/test_api.py`) bypass authentication entirely.
Add tests that exercise the JWT flow: registration, login, protected-route
access, token refresh, and admin-only endpoints.
- **Export endpoint tests.** The CSV and PDF export endpoints (`/export/`)
lack test coverage. Add tests covering auth, success, 404, and edge cases.
*(Issue #1655)*
- **Tracked company admin endpoint tests.** The `/admin/tracked` CRUD
endpoints and scheduler integration lack test coverage. *(Issue #1656)*
---
## P2 -- Medium Priority
Improvements to usability, performance, and developer experience.
Improvements to reliability, test coverage, and code quality.
### Backend
### Test coverage
- **Add structured logging.** Replace `print()` calls throughout `analyzer.py`,
`serp_api.py`, and `llm.py` with Python `logging` so log levels and
formatting are consistent.
- **Make LLM model configurable.** `llm.py` hardcodes
`anthropic/claude-3.5-sonnet`. Accept a `MODEL` environment variable to allow
switching models without code changes.
- **SERP cache TTL is hardcoded to 24 hours.** Expose `SERP_CACHE_TTL_HOURS`
as an environment variable in `config.py`.
- **Patent PDF storage.** PDFs are saved to a local `patents/` directory. For
containerized deployments, consider object storage (S3/MinIO) or at minimum
document the volume mount requirement more prominently.
- **`analyze_single_patent` assumes local file path.** The method constructs
`patents/{patent_id}.pdf` and reads from disk, but does not download the PDF
first. Either integrate the download step or document the prerequisite.
- **`Patent.patent_id` typed as `int` in `types.py` but used as `str`
everywhere.** Fix the type annotation to `str`.
- **Webhook integration tests.** The retry logic, Slack/Discord payload
format, and multi-URL dispatch in `webhooks.py` need test coverage.
*(Issue #1657)*
- **S3/MinIO storage backend tests.** `storage.py` has local filesystem tests
but no unit tests for the S3 backend (read, write, exists, delete,
error handling). *(Issue #1660)*
- **`analyze_single_patent` auto-download path tests.** The auto-download
fallback (cache lookup, PDF download, FileNotFoundError) in
`analyzer.py` lacks test coverage. *(Issue #1661)*
### Frontend
### Code quality
- **No loading/error states on several pages.** The Batch and Analytics pages
would benefit from skeleton loaders and user-friendly error messages.
- **No dark mode.** Tailwind is configured but no dark variant is applied.
- **Missing `package-lock.json` or `pnpm-lock.yaml`.** The frontend has no
lockfile committed, leading to non-reproducible builds.
- **Scheduler creates its own DatabaseClient.** `scheduler.py` bypasses the
application-level pooled client, creating a new connection on every tick.
Refactor to use `get_db_client()`. *(Issue #1658)*
### CI/CD
### API improvements
- **No test stage in the Gitea Actions workflow.** `build.yaml` builds and
pushes images but never runs `pytest`. Add a test job that gates the build.
- **No linting or type checking.** Add `ruff` (Python) and `tsc --noEmit`
(TypeScript) to CI.
- **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit
from cursor-based pagination for large result sets.
- **Request validation improvements.** Add stricter input validation for
company names (disallow special characters, enforce length limits).
---
@@ -94,23 +139,20 @@ Improvements to usability, performance, and developer experience.
Lower-urgency enhancements and future features.
- **Export analysis reports.** Allow users to download analysis results as PDF
or CSV from the dashboard.
- **Comparison view.** Side-by-side comparison of two companies' patent
portfolios.
- **Scheduled/recurring analysis.** Periodically re-analyze tracked companies
and alert on significant changes.
- **Webhook/notification support.** Send alerts (Slack, Discord, email) when
batch jobs complete or when a company's innovation score changes
significantly.
- **Multi-model support.** Let users choose between LLM providers per analysis
(e.g., GPT-4o, Gemini, Claude) and compare outputs.
- **Patent trend charts.** Visualize patent filing frequency and technology
category distribution over time in the Analytics page.
- **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit
from cursor-based pagination for large result sets.
- **OpenAPI client generation.** Auto-generate the TypeScript API client from
the FastAPI OpenAPI spec to keep frontend types in sync.
- **Historical analysis diffing.** Show what changed between two analysis runs
for the same company, highlighting new patents and score shifts.
- **Patent classification tagging.** Automatically tag patents by technology
domain (AI, semiconductors, materials science) using LLM classification.
- **User-level API keys.** Allow users to generate personal API keys for
programmatic access without JWT token refresh.
- **Batch export.** Export analysis results for multiple companies at once as
a ZIP archive.
- **Rate limiting dashboard.** Surface rate limit status and usage statistics
in the admin panel.
- **Async webhook delivery.** Move webhook delivery to a background task queue
(e.g., Celery, arq) to avoid blocking the scheduler.
- **Multi-tenant support.** Scope analysis results and tracked companies per
user or organization.
---
+7 -12
View File
@@ -2,17 +2,14 @@
Uses APScheduler to periodically re-analyze tracked companies and
detect significant changes in patent counts.
The scheduler reuses the application-level pooled DatabaseClient
(from ``SPARC.auth``) instead of creating its own connection, which
avoids exhausting the database connection pool under load.
"""
import logging
import os
from SPARC import config
from SPARC.analyzer import CompanyAnalyzer
from SPARC.auth import get_db_client
from SPARC.database import DatabaseClient
logger = logging.getLogger(__name__)
@@ -24,13 +21,10 @@ CHANGE_THRESHOLD_PERCENT = int(os.getenv("CHANGE_THRESHOLD_PERCENT", "20"))
def run_scheduled_analysis() -> None:
"""Re-analyze all tracked companies and check for significant changes.
Uses the shared pooled DatabaseClient from ``SPARC.auth.get_db_client()``
rather than creating a disposable connection, so the scheduler participates
in the same connection pool as the rest of the application.
"""
db = get_db_client()
"""Re-analyze all tracked companies and check for significant changes."""
db = DatabaseClient(config.database_url)
db.connect()
db.initialize_schema()
tracked = db.list_tracked_companies()
if not tracked:
@@ -80,6 +74,7 @@ def run_scheduled_analysis() -> None:
except Exception as e:
logger.error("Error analyzing tracked company %s: %s", name, e)
db.close()
logger.info("Scheduled analysis complete")