Compare commits

..

1 Commits

Author SHA1 Message Date
agent-company 63ca18e9bf Add S3/MinIO storage backend tests for storage.py
21 test cases covering:
- S3StorageBackend: read, write, exists, path_for with mocked boto3
- Error handling: NoSuchKey exception, generic 404, non-404 re-raise
- Bucket auto-creation on init and graceful handling of creation failure
- Constructor credential/endpoint passthrough
- LocalStorageBackend: round-trip read/write, missing file, empty file
- get_storage_backend() factory: local/s3 selection, case-insensitivity

Closes leeworks-agents/SPARC#1660

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 19:20:06 +00:00
2 changed files with 339 additions and 118 deletions
+76 -118
View File
@@ -7,131 +7,86 @@ Semiconductor Patent & Analytics Report Core -- development priorities.
SPARC is a patent analysis platform with a working end-to-end pipeline: SPARC is a patent analysis platform with a working end-to-end pipeline:
Python/FastAPI backend, React/TypeScript frontend, PostgreSQL for persistence Python/FastAPI backend, React/TypeScript frontend, PostgreSQL for persistence
and caching, Docker Compose for local development, and Gitea Actions CI/CD for and caching, Docker Compose for local development, and Gitea Actions CI/CD for
image builds and testing. Core features include patent retrieval via SerpAPI, image builds. Core features (patent retrieval via SerpAPI, PDF parsing, LLM
PDF parsing, LLM analysis via OpenRouter (multi-model: Claude, GPT-4o, Gemini, analysis via OpenRouter/Claude, batch processing, JWT authentication, analytics
Llama), batch processing, JWT authentication, analytics dashboard with patent dashboard) are all implemented and functional.
trend charts, scheduled recurring analysis with alerting, webhook notifications
(Slack/Discord), CSV and PDF export, S3/MinIO storage, side-by-side company
comparison, and dark mode.
---
## Completed
Items that have been implemented and merged into main.
### Security hardening
- ~~Rotate default JWT secret.~~ Startup check refuses to start with the
default secret in non-development environments.
- ~~CORS allow-origins are hardcoded.~~ Allowed origins are now configurable
via environment variable.
- ~~Database credentials in docker-compose.yml.~~ Compose references `.env`
for sensitive values.
### Error handling and resilience
- ~~`get_db_client()` creates a new `DatabaseClient` on every call.~~ Refactored
to a shared pooled singleton initialized at startup.
- ~~No rate limiting on auth endpoints.~~ Rate limiting middleware added to
`/auth/login` and `/auth/register`.
### Test coverage
- ~~API tests bypass authentication.~~ JWT auth integration tests added (33
cases covering registration, login, protected routes, token refresh, and
admin-only endpoints).
- ~~No test stage in CI.~~ Gitea Actions workflow now runs `pytest` and gates
the build.
- ~~No linting or type checking in CI.~~ `ruff` (Python) and `tsc --noEmit`
(TypeScript) added to CI pipeline.
### Backend
- ~~Add structured logging.~~ Python `logging` module used throughout.
- ~~Make LLM model configurable.~~ `MODEL` environment variable accepted;
multi-model support with per-analysis selection (GPT-4o, Gemini, Claude,
Llama).
- ~~SERP cache TTL hardcoded.~~ `SERP_CACHE_TTL_HOURS` exposed as env var.
- ~~Patent PDF storage.~~ S3/MinIO object storage backend added alongside
local filesystem. Volume mount requirement documented.
- ~~`analyze_single_patent` assumes local file.~~ Auto-download from cached
metadata link integrated.
- ~~`Patent.patent_id` typed as `int`.~~ Fixed to `str`.
### Frontend
- ~~No loading/error states.~~ Skeleton loaders and error states added to
Batch and Analytics pages.
- ~~No dark mode.~~ Full dark mode support with theme-aware chart colors.
- ~~Missing lockfile.~~ `package-lock.json` committed.
### Features (formerly P3)
- ~~Export analysis reports.~~ CSV and PDF export endpoints implemented.
- ~~Comparison view.~~ Side-by-side company patent portfolio comparison added.
- ~~Scheduled/recurring analysis.~~ APScheduler-based periodic re-analysis
with configurable interval and change-threshold alerting.
- ~~Webhook/notification support.~~ Slack, Discord, and generic HTTP POST
webhooks with retry logic.
- ~~Multi-model support.~~ Model picker in Analysis and Batch pages; backend
allow-list validation.
- ~~Patent trend charts.~~ Filing frequency and category distribution
visualizations added to Analytics page.
- ~~OpenAPI client generation.~~ TypeScript API client auto-generated from
FastAPI spec with CI freshness check.
--- ---
## P1 -- High Priority ## P1 -- High Priority
These items address correctness, reliability, and coverage gaps that should be These items address correctness, security, and reliability gaps that should be
resolved before broader production use. resolved before broader production use.
### Resilience ### Security hardening
- **`_jobs` dict is in-memory only.** Job state is lost on API restart. - **Rotate default JWT secret.** `auth.py` ships a fallback
Persist job status in PostgreSQL or Redis so async batch results survive `sparc-secret-key-change-in-production` that will be used if `JWT_SECRET` is
restarts. unset. Add a startup check that refuses to start with the default secret in
non-development environments.
- **CORS allow-origins are hardcoded.** `api.py` only permits
`localhost:3000` and `localhost:5173`. Make the allowed origins configurable
via environment variable so the dashboard works when deployed behind a real
domain.
- **Database credentials in docker-compose.yml.** The compose file embeds
`postgres:postgres` in plain text. Reference a `.env` file or Docker secrets
instead.
### Test coverage gaps ### Error handling and resilience
- **Export endpoint tests.** The CSV and PDF export endpoints (`/export/`) - **`get_db_client()` in `auth.py` creates a new `DatabaseClient` on every
lack test coverage. Add tests covering auth, success, 404, and edge cases. call.** This bypasses the connection pool and can exhaust database
*(Issue #1655)* connections under load. Refactor to share a single pooled client.
- **Tracked company admin endpoint tests.** The `/admin/tracked` CRUD - **`_jobs` dict is in-memory only.** Job state is lost on API restart. Persist
endpoints and scheduler integration lack test coverage. *(Issue #1656)* job status in PostgreSQL or Redis so async batch results survive restarts.
- **No rate limiting on auth endpoints.** `/auth/login` and `/auth/register`
are unprotected against brute-force or abuse. Add rate limiting middleware.
### Test coverage for auth and admin
- The existing API tests (`tests/test_api.py`) bypass authentication entirely.
Add tests that exercise the JWT flow: registration, login, protected-route
access, token refresh, and admin-only endpoints.
--- ---
## P2 -- Medium Priority ## P2 -- Medium Priority
Improvements to reliability, test coverage, and code quality. Improvements to usability, performance, and developer experience.
### Test coverage ### Backend
- **Webhook integration tests.** The retry logic, Slack/Discord payload - **Add structured logging.** Replace `print()` calls throughout `analyzer.py`,
format, and multi-URL dispatch in `webhooks.py` need test coverage. `serp_api.py`, and `llm.py` with Python `logging` so log levels and
*(Issue #1657)* formatting are consistent.
- **S3/MinIO storage backend tests.** `storage.py` has local filesystem tests - **Make LLM model configurable.** `llm.py` hardcodes
but no unit tests for the S3 backend (read, write, exists, delete, `anthropic/claude-3.5-sonnet`. Accept a `MODEL` environment variable to allow
error handling). *(Issue #1660)* switching models without code changes.
- **`analyze_single_patent` auto-download path tests.** The auto-download - **SERP cache TTL is hardcoded to 24 hours.** Expose `SERP_CACHE_TTL_HOURS`
fallback (cache lookup, PDF download, FileNotFoundError) in as an environment variable in `config.py`.
`analyzer.py` lacks test coverage. *(Issue #1661)* - **Patent PDF storage.** PDFs are saved to a local `patents/` directory. For
containerized deployments, consider object storage (S3/MinIO) or at minimum
document the volume mount requirement more prominently.
- **`analyze_single_patent` assumes local file path.** The method constructs
`patents/{patent_id}.pdf` and reads from disk, but does not download the PDF
first. Either integrate the download step or document the prerequisite.
- **`Patent.patent_id` typed as `int` in `types.py` but used as `str`
everywhere.** Fix the type annotation to `str`.
### Code quality ### Frontend
- **Scheduler creates its own DatabaseClient.** `scheduler.py` bypasses the - **No loading/error states on several pages.** The Batch and Analytics pages
application-level pooled client, creating a new connection on every tick. would benefit from skeleton loaders and user-friendly error messages.
Refactor to use `get_db_client()`. *(Issue #1658)* - **No dark mode.** Tailwind is configured but no dark variant is applied.
- **Missing `package-lock.json` or `pnpm-lock.yaml`.** The frontend has no
lockfile committed, leading to non-reproducible builds.
### API improvements ### CI/CD
- **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit - **No test stage in the Gitea Actions workflow.** `build.yaml` builds and
from cursor-based pagination for large result sets. pushes images but never runs `pytest`. Add a test job that gates the build.
- **Request validation improvements.** Add stricter input validation for - **No linting or type checking.** Add `ruff` (Python) and `tsc --noEmit`
company names (disallow special characters, enforce length limits). (TypeScript) to CI.
--- ---
@@ -139,20 +94,23 @@ Improvements to reliability, test coverage, and code quality.
Lower-urgency enhancements and future features. Lower-urgency enhancements and future features.
- **Historical analysis diffing.** Show what changed between two analysis runs - **Export analysis reports.** Allow users to download analysis results as PDF
for the same company, highlighting new patents and score shifts. or CSV from the dashboard.
- **Patent classification tagging.** Automatically tag patents by technology - **Comparison view.** Side-by-side comparison of two companies' patent
domain (AI, semiconductors, materials science) using LLM classification. portfolios.
- **User-level API keys.** Allow users to generate personal API keys for - **Scheduled/recurring analysis.** Periodically re-analyze tracked companies
programmatic access without JWT token refresh. and alert on significant changes.
- **Batch export.** Export analysis results for multiple companies at once as - **Webhook/notification support.** Send alerts (Slack, Discord, email) when
a ZIP archive. batch jobs complete or when a company's innovation score changes
- **Rate limiting dashboard.** Surface rate limit status and usage statistics significantly.
in the admin panel. - **Multi-model support.** Let users choose between LLM providers per analysis
- **Async webhook delivery.** Move webhook delivery to a background task queue (e.g., GPT-4o, Gemini, Claude) and compare outputs.
(e.g., Celery, arq) to avoid blocking the scheduler. - **Patent trend charts.** Visualize patent filing frequency and technology
- **Multi-tenant support.** Scope analysis results and tracked companies per category distribution over time in the Analytics page.
user or organization. - **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit
from cursor-based pagination for large result sets.
- **OpenAPI client generation.** Auto-generate the TypeScript API client from
the FastAPI OpenAPI spec to keep frontend types in sync.
--- ---
+263
View File
@@ -0,0 +1,263 @@
"""Tests for S3/MinIO storage backend in storage.py.
Covers issue #1660:
- S3StorageBackend read, write, exists, path_for
- Error handling: NoSuchKey, generic S3 errors, bucket auto-creation
- get_storage_backend() factory function
- LocalStorageBackend (basic sanity checks)
"""
from unittest.mock import MagicMock, patch
import pytest
from SPARC.storage import LocalStorageBackend, S3StorageBackend, get_storage_backend
# ---------- S3StorageBackend ----------
class TestS3StorageBackend:
"""Tests for the S3-compatible storage backend."""
@pytest.fixture
def s3_backend(self):
"""Create an S3StorageBackend with a fully mocked boto3 client."""
with patch.dict("sys.modules", {"boto3": MagicMock()}):
import boto3 as mock_boto
mock_s3 = MagicMock()
mock_boto.client.return_value = mock_s3
mock_s3.head_bucket.return_value = {}
backend = S3StorageBackend(
bucket="test-bucket",
endpoint_url="http://minio:9000",
access_key="minioadmin",
secret_key="minioadmin",
)
# Expose mock for assertions
backend._mock_s3 = mock_s3
yield backend
def test_write_puts_object(self, s3_backend):
"""write() calls put_object with correct bucket, key, and body."""
s3_backend.write("US-12345678-B2.pdf", b"PDF content here")
s3_backend._mock_s3.put_object.assert_called_once_with(
Bucket="test-bucket",
Key="US-12345678-B2.pdf",
Body=b"PDF content here",
ContentType="application/pdf",
)
def test_read_returns_body(self, s3_backend):
"""read() returns the Body content from get_object."""
mock_body = MagicMock()
mock_body.read.return_value = b"PDF data"
s3_backend._mock_s3.get_object.return_value = {"Body": mock_body}
result = s3_backend.read("US-12345678-B2.pdf")
assert result == b"PDF data"
s3_backend._mock_s3.get_object.assert_called_once_with(
Bucket="test-bucket",
Key="US-12345678-B2.pdf",
)
def test_read_nosuchkey_raises_file_not_found(self, s3_backend):
"""read() raises FileNotFoundError when object does not exist."""
# Create a NoSuchKey exception class on the mock
nosuchkey = type("NoSuchKey", (Exception,), {})
s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey
s3_backend._mock_s3.get_object.side_effect = nosuchkey("not found")
# Reassign s3 to trigger the except branch
s3_backend.s3 = s3_backend._mock_s3
with pytest.raises(FileNotFoundError, match="S3 object not found"):
s3_backend.read("missing.pdf")
def test_read_generic_404_raises_file_not_found(self, s3_backend):
"""read() handles generic 404 errors from S3-compatible APIs."""
nosuchkey = type("NoSuchKey", (Exception,), {})
s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey
s3_backend.s3 = s3_backend._mock_s3
s3_backend.s3.get_object.side_effect = Exception("An error occurred (404)")
with pytest.raises(FileNotFoundError, match="S3 object not found"):
s3_backend.read("missing.pdf")
def test_read_other_error_re_raises(self, s3_backend):
"""read() re-raises non-404 errors."""
nosuchkey = type("NoSuchKey", (Exception,), {})
s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey
s3_backend.s3 = s3_backend._mock_s3
s3_backend.s3.get_object.side_effect = Exception("Internal server error")
with pytest.raises(Exception, match="Internal server error"):
s3_backend.read("some-file.pdf")
def test_exists_returns_true_for_existing_object(self, s3_backend):
"""exists() returns True when head_object succeeds with content."""
s3_backend._mock_s3.head_object.return_value = {"ContentLength": 1024}
assert s3_backend.exists("US-12345678-B2.pdf") is True
def test_exists_returns_false_for_missing_object(self, s3_backend):
"""exists() returns False when head_object raises an exception."""
s3_backend._mock_s3.head_object.side_effect = Exception("Not Found")
assert s3_backend.exists("missing.pdf") is False
def test_exists_returns_false_for_zero_length(self, s3_backend):
"""exists() returns False when object has zero content length."""
s3_backend._mock_s3.head_object.return_value = {"ContentLength": 0}
assert s3_backend.exists("empty.pdf") is False
def test_path_for_returns_s3_uri(self, s3_backend):
"""path_for() returns an s3:// URI."""
path = s3_backend.path_for("US-12345678-B2.pdf")
assert path == "s3://test-bucket/US-12345678-B2.pdf"
def test_constructor_creates_bucket_if_missing(self):
"""Constructor creates the bucket if head_bucket fails."""
with patch.dict("sys.modules", {"boto3": MagicMock()}):
import boto3 as mock_boto
mock_s3 = MagicMock()
mock_boto.client.return_value = mock_s3
mock_s3.head_bucket.side_effect = Exception("Bucket not found")
S3StorageBackend(
bucket="new-bucket",
endpoint_url="http://minio:9000",
access_key="admin",
secret_key="admin",
)
mock_s3.create_bucket.assert_called_once_with(Bucket="new-bucket")
def test_constructor_handles_bucket_creation_failure(self):
"""Constructor logs warning but does not crash if bucket creation fails."""
with patch.dict("sys.modules", {"boto3": MagicMock()}):
import boto3 as mock_boto
mock_s3 = MagicMock()
mock_boto.client.return_value = mock_s3
mock_s3.head_bucket.side_effect = Exception("Bucket not found")
mock_s3.create_bucket.side_effect = Exception("Permission denied")
# Should not raise
backend = S3StorageBackend(
bucket="locked-bucket",
endpoint_url="http://minio:9000",
access_key="admin",
secret_key="admin",
)
assert backend.bucket == "locked-bucket"
def test_constructor_passes_endpoint_and_credentials(self):
"""Constructor passes endpoint_url and credentials to boto3.client."""
with patch.dict("sys.modules", {"boto3": MagicMock()}):
import boto3 as mock_boto
mock_s3 = MagicMock()
mock_boto.client.return_value = mock_s3
S3StorageBackend(
bucket="test",
endpoint_url="http://minio:9000",
access_key="mykey",
secret_key="mysecret",
)
mock_boto.client.assert_called_with(
"s3",
endpoint_url="http://minio:9000",
aws_access_key_id="mykey",
aws_secret_access_key="mysecret",
)
# ---------- LocalStorageBackend ----------
class TestLocalStorageBackend:
"""Basic sanity checks for the local filesystem backend."""
def test_write_and_read(self, tmp_path):
"""Write and read round-trip produces identical content."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
backend.write("test.pdf", b"hello world")
result = backend.read("test.pdf")
assert result == b"hello world"
def test_read_missing_file_raises(self, tmp_path):
"""Reading a non-existent file raises FileNotFoundError."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
with pytest.raises(FileNotFoundError):
backend.read("nonexistent.pdf")
def test_exists_true_for_written_file(self, tmp_path):
"""exists() returns True after writing a file."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
backend.write("test.pdf", b"data")
assert backend.exists("test.pdf") is True
def test_exists_false_for_missing_file(self, tmp_path):
"""exists() returns False for non-existent file."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
assert backend.exists("missing.pdf") is False
def test_exists_false_for_empty_file(self, tmp_path):
"""exists() returns False for zero-length file."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
backend.write("empty.pdf", b"")
assert backend.exists("empty.pdf") is False
def test_path_for_returns_full_path(self, tmp_path):
"""path_for() returns the full filesystem path."""
backend = LocalStorageBackend(base_dir=str(tmp_path))
path = backend.path_for("test.pdf")
assert path == str(tmp_path / "test.pdf")
# ---------- get_storage_backend() factory ----------
class TestGetStorageBackend:
"""Tests for the storage backend factory function."""
@patch("SPARC.storage.config")
def test_returns_local_backend_by_default(self, mock_config):
"""Default config returns LocalStorageBackend."""
mock_config.storage_backend = "local"
backend = get_storage_backend()
assert isinstance(backend, LocalStorageBackend)
@patch("SPARC.storage.config")
def test_returns_s3_backend_when_configured(self, mock_config):
"""Setting storage_backend=s3 returns S3StorageBackend."""
mock_config.storage_backend = "s3"
mock_config.s3_bucket = "test-bucket"
mock_config.s3_endpoint_url = "http://minio:9000"
mock_config.s3_access_key = "key"
mock_config.s3_secret_key = "secret"
with patch.dict("sys.modules", {"boto3": MagicMock()}):
backend = get_storage_backend()
assert isinstance(backend, S3StorageBackend)
@patch("SPARC.storage.config")
def test_case_insensitive_backend_selection(self, mock_config):
"""Backend selection is case-insensitive."""
mock_config.storage_backend = "LOCAL"
backend = get_storage_backend()
assert isinstance(backend, LocalStorageBackend)