diff --git a/ROADMAP.md b/ROADMAP.md index 42b571a..5b177d9 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -7,86 +7,131 @@ Semiconductor Patent & Analytics Report Core -- development priorities. SPARC is a patent analysis platform with a working end-to-end pipeline: Python/FastAPI backend, React/TypeScript frontend, PostgreSQL for persistence and caching, Docker Compose for local development, and Gitea Actions CI/CD for -image builds. Core features (patent retrieval via SerpAPI, PDF parsing, LLM -analysis via OpenRouter/Claude, batch processing, JWT authentication, analytics -dashboard) are all implemented and functional. +image builds and testing. Core features include patent retrieval via SerpAPI, +PDF parsing, LLM analysis via OpenRouter (multi-model: Claude, GPT-4o, Gemini, +Llama), batch processing, JWT authentication, analytics dashboard with patent +trend charts, scheduled recurring analysis with alerting, webhook notifications +(Slack/Discord), CSV and PDF export, S3/MinIO storage, side-by-side company +comparison, and dark mode. + +--- + +## Completed + +Items that have been implemented and merged into main. + +### Security hardening + +- ~~Rotate default JWT secret.~~ Startup check refuses to start with the + default secret in non-development environments. +- ~~CORS allow-origins are hardcoded.~~ Allowed origins are now configurable + via environment variable. +- ~~Database credentials in docker-compose.yml.~~ Compose references `.env` + for sensitive values. + +### Error handling and resilience + +- ~~`get_db_client()` creates a new `DatabaseClient` on every call.~~ Refactored + to a shared pooled singleton initialized at startup. +- ~~No rate limiting on auth endpoints.~~ Rate limiting middleware added to + `/auth/login` and `/auth/register`. + +### Test coverage + +- ~~API tests bypass authentication.~~ JWT auth integration tests added (33 + cases covering registration, login, protected routes, token refresh, and + admin-only endpoints). +- ~~No test stage in CI.~~ Gitea Actions workflow now runs `pytest` and gates + the build. +- ~~No linting or type checking in CI.~~ `ruff` (Python) and `tsc --noEmit` + (TypeScript) added to CI pipeline. + +### Backend + +- ~~Add structured logging.~~ Python `logging` module used throughout. +- ~~Make LLM model configurable.~~ `MODEL` environment variable accepted; + multi-model support with per-analysis selection (GPT-4o, Gemini, Claude, + Llama). +- ~~SERP cache TTL hardcoded.~~ `SERP_CACHE_TTL_HOURS` exposed as env var. +- ~~Patent PDF storage.~~ S3/MinIO object storage backend added alongside + local filesystem. Volume mount requirement documented. +- ~~`analyze_single_patent` assumes local file.~~ Auto-download from cached + metadata link integrated. +- ~~`Patent.patent_id` typed as `int`.~~ Fixed to `str`. + +### Frontend + +- ~~No loading/error states.~~ Skeleton loaders and error states added to + Batch and Analytics pages. +- ~~No dark mode.~~ Full dark mode support with theme-aware chart colors. +- ~~Missing lockfile.~~ `package-lock.json` committed. + +### Features (formerly P3) + +- ~~Export analysis reports.~~ CSV and PDF export endpoints implemented. +- ~~Comparison view.~~ Side-by-side company patent portfolio comparison added. +- ~~Scheduled/recurring analysis.~~ APScheduler-based periodic re-analysis + with configurable interval and change-threshold alerting. +- ~~Webhook/notification support.~~ Slack, Discord, and generic HTTP POST + webhooks with retry logic. +- ~~Multi-model support.~~ Model picker in Analysis and Batch pages; backend + allow-list validation. +- ~~Patent trend charts.~~ Filing frequency and category distribution + visualizations added to Analytics page. +- ~~OpenAPI client generation.~~ TypeScript API client auto-generated from + FastAPI spec with CI freshness check. --- ## P1 -- High Priority -These items address correctness, security, and reliability gaps that should be +These items address correctness, reliability, and coverage gaps that should be resolved before broader production use. -### Security hardening +### Resilience -- **Rotate default JWT secret.** `auth.py` ships a fallback - `sparc-secret-key-change-in-production` that will be used if `JWT_SECRET` is - unset. Add a startup check that refuses to start with the default secret in - non-development environments. -- **CORS allow-origins are hardcoded.** `api.py` only permits - `localhost:3000` and `localhost:5173`. Make the allowed origins configurable - via environment variable so the dashboard works when deployed behind a real - domain. -- **Database credentials in docker-compose.yml.** The compose file embeds - `postgres:postgres` in plain text. Reference a `.env` file or Docker secrets - instead. +- **`_jobs` dict is in-memory only.** Job state is lost on API restart. + Persist job status in PostgreSQL or Redis so async batch results survive + restarts. -### Error handling and resilience +### Test coverage gaps -- **`get_db_client()` in `auth.py` creates a new `DatabaseClient` on every - call.** This bypasses the connection pool and can exhaust database - connections under load. Refactor to share a single pooled client. -- **`_jobs` dict is in-memory only.** Job state is lost on API restart. Persist - job status in PostgreSQL or Redis so async batch results survive restarts. -- **No rate limiting on auth endpoints.** `/auth/login` and `/auth/register` - are unprotected against brute-force or abuse. Add rate limiting middleware. - -### Test coverage for auth and admin - -- The existing API tests (`tests/test_api.py`) bypass authentication entirely. - Add tests that exercise the JWT flow: registration, login, protected-route - access, token refresh, and admin-only endpoints. +- **Export endpoint tests.** The CSV and PDF export endpoints (`/export/`) + lack test coverage. Add tests covering auth, success, 404, and edge cases. + *(Issue #1655)* +- **Tracked company admin endpoint tests.** The `/admin/tracked` CRUD + endpoints and scheduler integration lack test coverage. *(Issue #1656)* --- ## P2 -- Medium Priority -Improvements to usability, performance, and developer experience. +Improvements to reliability, test coverage, and code quality. -### Backend +### Test coverage -- **Add structured logging.** Replace `print()` calls throughout `analyzer.py`, - `serp_api.py`, and `llm.py` with Python `logging` so log levels and - formatting are consistent. -- **Make LLM model configurable.** `llm.py` hardcodes - `anthropic/claude-3.5-sonnet`. Accept a `MODEL` environment variable to allow - switching models without code changes. -- **SERP cache TTL is hardcoded to 24 hours.** Expose `SERP_CACHE_TTL_HOURS` - as an environment variable in `config.py`. -- **Patent PDF storage.** PDFs are saved to a local `patents/` directory. For - containerized deployments, consider object storage (S3/MinIO) or at minimum - document the volume mount requirement more prominently. -- **`analyze_single_patent` assumes local file path.** The method constructs - `patents/{patent_id}.pdf` and reads from disk, but does not download the PDF - first. Either integrate the download step or document the prerequisite. -- **`Patent.patent_id` typed as `int` in `types.py` but used as `str` - everywhere.** Fix the type annotation to `str`. +- **Webhook integration tests.** The retry logic, Slack/Discord payload + format, and multi-URL dispatch in `webhooks.py` need test coverage. + *(Issue #1657)* +- **S3/MinIO storage backend tests.** `storage.py` has local filesystem tests + but no unit tests for the S3 backend (read, write, exists, delete, + error handling). *(Issue #1660)* +- **`analyze_single_patent` auto-download path tests.** The auto-download + fallback (cache lookup, PDF download, FileNotFoundError) in + `analyzer.py` lacks test coverage. *(Issue #1661)* -### Frontend +### Code quality -- **No loading/error states on several pages.** The Batch and Analytics pages - would benefit from skeleton loaders and user-friendly error messages. -- **No dark mode.** Tailwind is configured but no dark variant is applied. -- **Missing `package-lock.json` or `pnpm-lock.yaml`.** The frontend has no - lockfile committed, leading to non-reproducible builds. +- **Scheduler creates its own DatabaseClient.** `scheduler.py` bypasses the + application-level pooled client, creating a new connection on every tick. + Refactor to use `get_db_client()`. *(Issue #1658)* -### CI/CD +### API improvements -- **No test stage in the Gitea Actions workflow.** `build.yaml` builds and - pushes images but never runs `pytest`. Add a test job that gates the build. -- **No linting or type checking.** Add `ruff` (Python) and `tsc --noEmit` - (TypeScript) to CI. +- **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit + from cursor-based pagination for large result sets. +- **Request validation improvements.** Add stricter input validation for + company names (disallow special characters, enforce length limits). --- @@ -94,23 +139,20 @@ Improvements to usability, performance, and developer experience. Lower-urgency enhancements and future features. -- **Export analysis reports.** Allow users to download analysis results as PDF - or CSV from the dashboard. -- **Comparison view.** Side-by-side comparison of two companies' patent - portfolios. -- **Scheduled/recurring analysis.** Periodically re-analyze tracked companies - and alert on significant changes. -- **Webhook/notification support.** Send alerts (Slack, Discord, email) when - batch jobs complete or when a company's innovation score changes - significantly. -- **Multi-model support.** Let users choose between LLM providers per analysis - (e.g., GPT-4o, Gemini, Claude) and compare outputs. -- **Patent trend charts.** Visualize patent filing frequency and technology - category distribution over time in the Analytics page. -- **API pagination.** The `/analyze/batch` and `/jobs` endpoints could benefit - from cursor-based pagination for large result sets. -- **OpenAPI client generation.** Auto-generate the TypeScript API client from - the FastAPI OpenAPI spec to keep frontend types in sync. +- **Historical analysis diffing.** Show what changed between two analysis runs + for the same company, highlighting new patents and score shifts. +- **Patent classification tagging.** Automatically tag patents by technology + domain (AI, semiconductors, materials science) using LLM classification. +- **User-level API keys.** Allow users to generate personal API keys for + programmatic access without JWT token refresh. +- **Batch export.** Export analysis results for multiple companies at once as + a ZIP archive. +- **Rate limiting dashboard.** Surface rate limit status and usage statistics + in the admin panel. +- **Async webhook delivery.** Move webhook delivery to a background task queue + (e.g., Celery, arq) to avoid blocking the scheduler. +- **Multi-tenant support.** Scope analysis results and tracked companies per + user or organization. --- diff --git a/SPARC/scheduler.py b/SPARC/scheduler.py index 5af3940..4428bfd 100644 --- a/SPARC/scheduler.py +++ b/SPARC/scheduler.py @@ -2,14 +2,17 @@ Uses APScheduler to periodically re-analyze tracked companies and detect significant changes in patent counts. + +The scheduler reuses the application-level pooled DatabaseClient +(from ``SPARC.auth``) instead of creating its own connection, which +avoids exhausting the database connection pool under load. """ import logging import os -from SPARC import config from SPARC.analyzer import CompanyAnalyzer -from SPARC.database import DatabaseClient +from SPARC.auth import get_db_client logger = logging.getLogger(__name__) @@ -21,10 +24,13 @@ CHANGE_THRESHOLD_PERCENT = int(os.getenv("CHANGE_THRESHOLD_PERCENT", "20")) def run_scheduled_analysis() -> None: - """Re-analyze all tracked companies and check for significant changes.""" - db = DatabaseClient(config.database_url) - db.connect() - db.initialize_schema() + """Re-analyze all tracked companies and check for significant changes. + + Uses the shared pooled DatabaseClient from ``SPARC.auth.get_db_client()`` + rather than creating a disposable connection, so the scheduler participates + in the same connection pool as the rest of the application. + """ + db = get_db_client() tracked = db.list_tracked_companies() if not tracked: @@ -74,7 +80,6 @@ def run_scheduled_analysis() -> None: except Exception as e: logger.error("Error analyzing tracked company %s: %s", name, e) - db.close() logger.info("Scheduled analysis complete") diff --git a/tests/test_analyze_single_patent.py b/tests/test_analyze_single_patent.py new file mode 100644 index 0000000..3b2283b --- /dev/null +++ b/tests/test_analyze_single_patent.py @@ -0,0 +1,211 @@ +"""Tests for analyze_single_patent auto-download path. + +Covers issue #1661: +- PDF exists on disk: direct analysis (happy path) +- PDF not on disk, cached link exists: auto-download and analyze +- PDF not on disk, no cached link: FileNotFoundError +- Analysis failure after PDF found: graceful error message +- Model override parameter passthrough +""" + +import os +from unittest.mock import MagicMock, patch + +import pytest + +from SPARC.analyzer import CompanyAnalyzer +from SPARC.types import Patent + + +@pytest.fixture(autouse=True) +def mock_db(mocker): + """Mock DatabaseClient so no real DB is needed.""" + mock_db_cls = mocker.patch("SPARC.analyzer.DatabaseClient") + mock_db_instance = MagicMock() + mock_db_instance.get_cached_patent.return_value = None + mock_db_instance.get_cached_serp_query.return_value = None + mock_db_cls.return_value = mock_db_instance + return mock_db_instance + + +@pytest.fixture +def analyzer(mocker, mock_db): + """Create a CompanyAnalyzer with mocked LLM and DB.""" + mocker.patch("SPARC.analyzer.LLMAnalyzer") + return CompanyAnalyzer(openrouter_api_key="test-key") + + +class TestAnalyzeSinglePatentAutoDownload: + """Test the auto-download logic in analyze_single_patent.""" + + def test_pdf_on_disk_analyzed_directly(self, analyzer, mocker, tmp_path): + """When PDF exists on disk, it is analyzed directly without download.""" + patent_id = "US-11234567-B2" + + # Create the patents dir and PDF file + patents_dir = tmp_path / "patents" + patents_dir.mkdir() + pdf_path = patents_dir / f"{patent_id}.pdf" + pdf_path.write_bytes(b"fake PDF content") + + mock_parse = mocker.patch("SPARC.analyzer.SERP.parse_patent_pdf") + mock_minimize = mocker.patch("SPARC.analyzer.SERP.minimize_patent_for_llm") + mock_parse.return_value = {"abstract": "test", "claims": "test claims"} + mock_minimize.return_value = "minimized content" + analyzer.llm_analyzer.analyze_patent_content.return_value = "Good patent." + + # Change cwd so patents/{patent_id}.pdf resolves to our tmp_path + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + result = analyzer.analyze_single_patent(patent_id, "TestCo") + finally: + os.chdir(original_cwd) + + assert result == "Good patent." + # DB cache should not have been queried since file existed + analyzer.db.get_cached_patent.assert_not_called() + + def test_auto_download_from_cached_link(self, analyzer, mocker, tmp_path): + """When PDF is not on disk but link is cached, auto-download occurs.""" + patent_id = "US-99887766-A1" + + # No patents dir exists (PDF not on disk) + mock_save = mocker.patch("SPARC.analyzer.SERP.save_patents") + downloaded_patent = Patent(patent_id=patent_id, pdf_link="https://example.com/patent.pdf") + downloaded_patent.pdf_path = f"patents/{patent_id}.pdf" + mock_save.return_value = downloaded_patent + + # Cached patent has a PDF link + analyzer.db.get_cached_patent.return_value = { + "patent_id": patent_id, + "pdf_link": "https://example.com/patent.pdf", + } + + # Mock the rest of the analysis pipeline + mock_parse = mocker.patch("SPARC.analyzer.SERP.parse_patent_pdf") + mock_minimize = mocker.patch("SPARC.analyzer.SERP.minimize_patent_for_llm") + mock_parse.return_value = {"abstract": "test abstract"} + mock_minimize.return_value = "minimized content" + analyzer.llm_analyzer.analyze_patent_content.return_value = "Strong innovation." + + # Change cwd so patents/{patent_id}.pdf does NOT exist + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + result = analyzer.analyze_single_patent(patent_id, "DownloadCo") + finally: + os.chdir(original_cwd) + + assert result == "Strong innovation." + analyzer.db.get_cached_patent.assert_called_once_with(patent_id) + mock_save.assert_called_once() + # Verify the Patent passed to save_patents has the correct ID and link + saved_patent = mock_save.call_args[0][0] + assert saved_patent.patent_id == patent_id + assert saved_patent.pdf_link == "https://example.com/patent.pdf" + + def test_no_cached_link_raises_file_not_found(self, analyzer, mocker, tmp_path): + """When PDF is not on disk and no cached link, FileNotFoundError raised.""" + patent_id = "US-00000000-X1" + + analyzer.db.get_cached_patent.return_value = None + + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + with pytest.raises(FileNotFoundError, match="no download link is cached"): + analyzer.analyze_single_patent(patent_id, "MissingCo") + finally: + os.chdir(original_cwd) + + def test_cached_patent_without_pdf_link_raises(self, analyzer, mocker, tmp_path): + """When cached patent exists but has no pdf_link, FileNotFoundError raised.""" + patent_id = "US-11111111-B1" + + analyzer.db.get_cached_patent.return_value = { + "patent_id": patent_id, + "pdf_link": None, + } + + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + with pytest.raises(FileNotFoundError, match="no download link is cached"): + analyzer.analyze_single_patent(patent_id, "NoPDFCo") + finally: + os.chdir(original_cwd) + + def test_analysis_exception_returns_error_message(self, analyzer, mocker, tmp_path): + """When analysis pipeline fails, returns error string instead of raising.""" + patent_id = "US-22222222-A2" + + # Create the PDF on disk so it skips download + patents_dir = tmp_path / "patents" + patents_dir.mkdir() + (patents_dir / f"{patent_id}.pdf").write_bytes(b"fake PDF") + + # Parse fails + mocker.patch( + "SPARC.analyzer.SERP.parse_patent_pdf", + side_effect=ValueError("Corrupt PDF"), + ) + + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + result = analyzer.analyze_single_patent(patent_id, "ErrorCo") + finally: + os.chdir(original_cwd) + + assert "Failed to analyze patent" in result + assert "Corrupt PDF" in result + + def test_model_override_passed_to_llm(self, analyzer, mocker, tmp_path): + """The model parameter is forwarded to the LLM analyzer.""" + patent_id = "US-33333333-B2" + + patents_dir = tmp_path / "patents" + patents_dir.mkdir() + (patents_dir / f"{patent_id}.pdf").write_bytes(b"fake PDF") + + mocker.patch("SPARC.analyzer.SERP.parse_patent_pdf", return_value={"abstract": "test"}) + mocker.patch("SPARC.analyzer.SERP.minimize_patent_for_llm", return_value="content") + analyzer.llm_analyzer.analyze_patent_content.return_value = "Analysis result." + + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + result = analyzer.analyze_single_patent( + patent_id, "ModelCo", model="openai/gpt-4o" + ) + finally: + os.chdir(original_cwd) + + assert result == "Analysis result." + analyzer.llm_analyzer.analyze_patent_content.assert_called_once_with( + patent_content="content", + company_name="ModelCo", + model="openai/gpt-4o", + ) + + def test_file_not_found_during_parse_re_raised(self, analyzer, mocker, tmp_path): + """FileNotFoundError during parsing is re-raised, not caught.""" + patent_id = "US-44444444-C1" + + patents_dir = tmp_path / "patents" + patents_dir.mkdir() + (patents_dir / f"{patent_id}.pdf").write_bytes(b"fake PDF") + + mocker.patch( + "SPARC.analyzer.SERP.parse_patent_pdf", + side_effect=FileNotFoundError("PDF file vanished"), + ) + + original_cwd = os.getcwd() + os.chdir(tmp_path) + try: + with pytest.raises(FileNotFoundError, match="PDF file vanished"): + analyzer.analyze_single_patent(patent_id, "VanishCo") + finally: + os.chdir(original_cwd) diff --git a/tests/test_export.py b/tests/test_export.py new file mode 100644 index 0000000..d0c856f --- /dev/null +++ b/tests/test_export.py @@ -0,0 +1,224 @@ +"""Tests for export endpoints: CSV and PDF export of analysis results. + +Covers issue #1655: +- GET /export/{company_name} (CSV export) +- GET /export/{company_name}/pdf (PDF export) + +All tests mock the database layer and use JWT auth fixtures from test_auth patterns. +""" + +from datetime import datetime, timezone +from unittest.mock import MagicMock, patch + +import pytest +from fastapi.testclient import TestClient + +from SPARC.api import app +from SPARC.auth import create_access_token + + +@pytest.fixture +def client(): + """Create test client.""" + return TestClient(app) + + +@pytest.fixture(autouse=True) +def mock_db(): + """Mock the database client used by export and auth endpoints.""" + db = MagicMock() + + # Default: user exists for auth + db.get_user_by_id.return_value = { + "id": 1, + "email": "user@test.com", + "role": "user", + "created_at": datetime(2025, 1, 1, tzinfo=timezone.utc), + } + + # Mock get_conn for export queries + mock_cursor = MagicMock() + mock_conn = MagicMock() + mock_conn.cursor.return_value.__enter__ = MagicMock(return_value=mock_cursor) + mock_conn.cursor.return_value.__exit__ = MagicMock(return_value=False) + db.get_conn.return_value.__enter__ = MagicMock(return_value=mock_conn) + db.get_conn.return_value.__exit__ = MagicMock(return_value=False) + db._mock_cursor = mock_cursor + + with patch("SPARC.api.get_db_client", return_value=db), \ + patch("SPARC.auth.get_db_client", return_value=db): + yield db + + +def _auth_header(): + """Create an Authorization header with a valid access token.""" + token = create_access_token(1, "user@test.com", "user") + return {"Authorization": f"Bearer {token}"} + + +def _sample_rows(): + """Return sample llm_messages rows as tuples (matching cursor.fetchall format).""" + return [ + ( + "NVIDIA", + "company_analysis", + "anthropic/claude-3.5-sonnet", + "Strong AI patent portfolio with focus on GPU architectures.", + datetime(2025, 6, 15, 10, 30, 0), + ), + ( + "NVIDIA", + "patent_analysis", + "openai/gpt-4o", + "Patent US-12345678-B2 covers novel tensor core design.", + datetime(2025, 6, 14, 9, 0, 0), + ), + ] + + +class TestCSVExport: + """GET /export/{company_name} -- CSV export.""" + + def test_csv_export_success(self, client, mock_db): + """Valid company with results returns a CSV file.""" + mock_db._mock_cursor.fetchall.return_value = _sample_rows() + + response = client.get("/export/NVIDIA", headers=_auth_header()) + + assert response.status_code == 200 + assert response.headers["content-type"].startswith("text/csv") + assert "attachment" in response.headers.get("content-disposition", "") + assert "sparc_nvidia_export.csv" in response.headers["content-disposition"] + + # Verify CSV content (CSV uses \r\n line endings) + lines = response.text.strip().split("\n") + assert len(lines) == 3 # header + 2 data rows + assert lines[0].strip() == "company_name,analysis_type,model,analysis,timestamp" + assert "NVIDIA" in lines[1] + assert "company_analysis" in lines[1] + + def test_csv_export_no_results_returns_404(self, client, mock_db): + """Unknown company returns 404.""" + mock_db._mock_cursor.fetchall.return_value = [] + + response = client.get("/export/nonexistent", headers=_auth_header()) + + assert response.status_code == 404 + assert "No analysis results found" in response.json()["detail"] + + def test_csv_export_unauthenticated_returns_401(self, client): + """Request without token returns 401.""" + response = client.get("/export/NVIDIA") + assert response.status_code == 401 + + def test_csv_export_invalid_token_returns_401(self, client): + """Request with invalid token returns 401.""" + response = client.get( + "/export/NVIDIA", + headers={"Authorization": "Bearer invalid.token.here"}, + ) + assert response.status_code == 401 + + def test_csv_export_filename_sanitization(self, client, mock_db): + """Company names with spaces get sanitized in the filename.""" + mock_db._mock_cursor.fetchall.return_value = [ + ( + "Tesla Motors", + "company_analysis", + "anthropic/claude-3.5-sonnet", + "EV patent portfolio analysis.", + datetime(2025, 6, 15, 10, 0, 0), + ), + ] + + response = client.get("/export/Tesla Motors", headers=_auth_header()) + + assert response.status_code == 200 + assert "tesla_motors" in response.headers["content-disposition"] + + def test_csv_export_single_row(self, client, mock_db): + """Single analysis result produces valid CSV with one data row.""" + mock_db._mock_cursor.fetchall.return_value = [_sample_rows()[0]] + + response = client.get("/export/NVIDIA", headers=_auth_header()) + + assert response.status_code == 200 + lines = response.text.strip().split("\n") + assert len(lines) == 2 # header + 1 data row + + +class TestPDFExport: + """GET /export/{company_name}/pdf -- PDF report export.""" + + def test_pdf_export_success(self, client, mock_db): + """Valid company with results returns a PDF file.""" + mock_db._mock_cursor.fetchall.return_value = _sample_rows() + + response = client.get("/export/NVIDIA/pdf", headers=_auth_header()) + + assert response.status_code == 200 + assert response.headers["content-type"] == "application/pdf" + assert "attachment" in response.headers.get("content-disposition", "") + # PDF files start with %PDF + assert response.content[:4] == b"%PDF" + + def test_pdf_export_no_results_returns_404(self, client, mock_db): + """Unknown company returns 404.""" + mock_db._mock_cursor.fetchall.return_value = [] + + response = client.get("/export/nonexistent/pdf", headers=_auth_header()) + + assert response.status_code == 404 + assert "No analysis results found" in response.json()["detail"] + + def test_pdf_export_unauthenticated_returns_401(self, client): + """Request without token returns 401.""" + response = client.get("/export/NVIDIA/pdf") + assert response.status_code == 401 + + def test_pdf_export_invalid_token_returns_401(self, client): + """Request with invalid token returns 401.""" + response = client.get( + "/export/NVIDIA/pdf", + headers={"Authorization": "Bearer invalid.token.here"}, + ) + assert response.status_code == 401 + + def test_pdf_export_filename_contains_date(self, client, mock_db): + """PDF filename includes the analysis date.""" + mock_db._mock_cursor.fetchall.return_value = _sample_rows() + + response = client.get("/export/NVIDIA/pdf", headers=_auth_header()) + + assert response.status_code == 200 + disposition = response.headers["content-disposition"] + assert "nvidia-analysis-" in disposition + assert ".pdf" in disposition + + def test_pdf_export_special_chars_in_response(self, client, mock_db): + """Analysis text with XML-special chars (<, >, &) does not break PDF generation.""" + rows = [ + ( + "TestCo", + "company_analysis", + "anthropic/claude-3.5-sonnet", + "Revenue > $1B & growth <20% for Q4. Test escaping.", + datetime(2025, 6, 15, 10, 0, 0), + ), + ] + mock_db._mock_cursor.fetchall.return_value = rows + + response = client.get("/export/TestCo/pdf", headers=_auth_header()) + + assert response.status_code == 200 + assert response.content[:4] == b"%PDF" + + def test_pdf_export_multiple_analyses(self, client, mock_db): + """Multiple analysis records produce a valid PDF with content.""" + mock_db._mock_cursor.fetchall.return_value = _sample_rows() + + response = client.get("/export/NVIDIA/pdf", headers=_auth_header()) + + assert response.status_code == 200 + # PDF should have reasonable size (more than just headers) + assert len(response.content) > 500 diff --git a/tests/test_storage.py b/tests/test_storage.py new file mode 100644 index 0000000..ba75be8 --- /dev/null +++ b/tests/test_storage.py @@ -0,0 +1,263 @@ +"""Tests for S3/MinIO storage backend in storage.py. + +Covers issue #1660: +- S3StorageBackend read, write, exists, path_for +- Error handling: NoSuchKey, generic S3 errors, bucket auto-creation +- get_storage_backend() factory function +- LocalStorageBackend (basic sanity checks) +""" + +from unittest.mock import MagicMock, patch + +import pytest + +from SPARC.storage import LocalStorageBackend, S3StorageBackend, get_storage_backend + + +# ---------- S3StorageBackend ---------- + +class TestS3StorageBackend: + """Tests for the S3-compatible storage backend.""" + + @pytest.fixture + def s3_backend(self): + """Create an S3StorageBackend with a fully mocked boto3 client.""" + with patch.dict("sys.modules", {"boto3": MagicMock()}): + import boto3 as mock_boto + mock_s3 = MagicMock() + mock_boto.client.return_value = mock_s3 + mock_s3.head_bucket.return_value = {} + + backend = S3StorageBackend( + bucket="test-bucket", + endpoint_url="http://minio:9000", + access_key="minioadmin", + secret_key="minioadmin", + ) + # Expose mock for assertions + backend._mock_s3 = mock_s3 + yield backend + + def test_write_puts_object(self, s3_backend): + """write() calls put_object with correct bucket, key, and body.""" + s3_backend.write("US-12345678-B2.pdf", b"PDF content here") + + s3_backend._mock_s3.put_object.assert_called_once_with( + Bucket="test-bucket", + Key="US-12345678-B2.pdf", + Body=b"PDF content here", + ContentType="application/pdf", + ) + + def test_read_returns_body(self, s3_backend): + """read() returns the Body content from get_object.""" + mock_body = MagicMock() + mock_body.read.return_value = b"PDF data" + s3_backend._mock_s3.get_object.return_value = {"Body": mock_body} + + result = s3_backend.read("US-12345678-B2.pdf") + + assert result == b"PDF data" + s3_backend._mock_s3.get_object.assert_called_once_with( + Bucket="test-bucket", + Key="US-12345678-B2.pdf", + ) + + def test_read_nosuchkey_raises_file_not_found(self, s3_backend): + """read() raises FileNotFoundError when object does not exist.""" + # Create a NoSuchKey exception class on the mock + nosuchkey = type("NoSuchKey", (Exception,), {}) + s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey + s3_backend._mock_s3.get_object.side_effect = nosuchkey("not found") + + # Reassign s3 to trigger the except branch + s3_backend.s3 = s3_backend._mock_s3 + + with pytest.raises(FileNotFoundError, match="S3 object not found"): + s3_backend.read("missing.pdf") + + def test_read_generic_404_raises_file_not_found(self, s3_backend): + """read() handles generic 404 errors from S3-compatible APIs.""" + nosuchkey = type("NoSuchKey", (Exception,), {}) + s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey + s3_backend.s3 = s3_backend._mock_s3 + s3_backend.s3.get_object.side_effect = Exception("An error occurred (404)") + + with pytest.raises(FileNotFoundError, match="S3 object not found"): + s3_backend.read("missing.pdf") + + def test_read_other_error_re_raises(self, s3_backend): + """read() re-raises non-404 errors.""" + nosuchkey = type("NoSuchKey", (Exception,), {}) + s3_backend._mock_s3.exceptions.NoSuchKey = nosuchkey + s3_backend.s3 = s3_backend._mock_s3 + s3_backend.s3.get_object.side_effect = Exception("Internal server error") + + with pytest.raises(Exception, match="Internal server error"): + s3_backend.read("some-file.pdf") + + def test_exists_returns_true_for_existing_object(self, s3_backend): + """exists() returns True when head_object succeeds with content.""" + s3_backend._mock_s3.head_object.return_value = {"ContentLength": 1024} + + assert s3_backend.exists("US-12345678-B2.pdf") is True + + def test_exists_returns_false_for_missing_object(self, s3_backend): + """exists() returns False when head_object raises an exception.""" + s3_backend._mock_s3.head_object.side_effect = Exception("Not Found") + + assert s3_backend.exists("missing.pdf") is False + + def test_exists_returns_false_for_zero_length(self, s3_backend): + """exists() returns False when object has zero content length.""" + s3_backend._mock_s3.head_object.return_value = {"ContentLength": 0} + + assert s3_backend.exists("empty.pdf") is False + + def test_path_for_returns_s3_uri(self, s3_backend): + """path_for() returns an s3:// URI.""" + path = s3_backend.path_for("US-12345678-B2.pdf") + + assert path == "s3://test-bucket/US-12345678-B2.pdf" + + def test_constructor_creates_bucket_if_missing(self): + """Constructor creates the bucket if head_bucket fails.""" + with patch.dict("sys.modules", {"boto3": MagicMock()}): + import boto3 as mock_boto + mock_s3 = MagicMock() + mock_boto.client.return_value = mock_s3 + mock_s3.head_bucket.side_effect = Exception("Bucket not found") + + S3StorageBackend( + bucket="new-bucket", + endpoint_url="http://minio:9000", + access_key="admin", + secret_key="admin", + ) + + mock_s3.create_bucket.assert_called_once_with(Bucket="new-bucket") + + def test_constructor_handles_bucket_creation_failure(self): + """Constructor logs warning but does not crash if bucket creation fails.""" + with patch.dict("sys.modules", {"boto3": MagicMock()}): + import boto3 as mock_boto + mock_s3 = MagicMock() + mock_boto.client.return_value = mock_s3 + mock_s3.head_bucket.side_effect = Exception("Bucket not found") + mock_s3.create_bucket.side_effect = Exception("Permission denied") + + # Should not raise + backend = S3StorageBackend( + bucket="locked-bucket", + endpoint_url="http://minio:9000", + access_key="admin", + secret_key="admin", + ) + assert backend.bucket == "locked-bucket" + + def test_constructor_passes_endpoint_and_credentials(self): + """Constructor passes endpoint_url and credentials to boto3.client.""" + with patch.dict("sys.modules", {"boto3": MagicMock()}): + import boto3 as mock_boto + mock_s3 = MagicMock() + mock_boto.client.return_value = mock_s3 + + S3StorageBackend( + bucket="test", + endpoint_url="http://minio:9000", + access_key="mykey", + secret_key="mysecret", + ) + + mock_boto.client.assert_called_with( + "s3", + endpoint_url="http://minio:9000", + aws_access_key_id="mykey", + aws_secret_access_key="mysecret", + ) + + +# ---------- LocalStorageBackend ---------- + +class TestLocalStorageBackend: + """Basic sanity checks for the local filesystem backend.""" + + def test_write_and_read(self, tmp_path): + """Write and read round-trip produces identical content.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + backend.write("test.pdf", b"hello world") + + result = backend.read("test.pdf") + assert result == b"hello world" + + def test_read_missing_file_raises(self, tmp_path): + """Reading a non-existent file raises FileNotFoundError.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + + with pytest.raises(FileNotFoundError): + backend.read("nonexistent.pdf") + + def test_exists_true_for_written_file(self, tmp_path): + """exists() returns True after writing a file.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + backend.write("test.pdf", b"data") + + assert backend.exists("test.pdf") is True + + def test_exists_false_for_missing_file(self, tmp_path): + """exists() returns False for non-existent file.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + + assert backend.exists("missing.pdf") is False + + def test_exists_false_for_empty_file(self, tmp_path): + """exists() returns False for zero-length file.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + backend.write("empty.pdf", b"") + + assert backend.exists("empty.pdf") is False + + def test_path_for_returns_full_path(self, tmp_path): + """path_for() returns the full filesystem path.""" + backend = LocalStorageBackend(base_dir=str(tmp_path)) + path = backend.path_for("test.pdf") + + assert path == str(tmp_path / "test.pdf") + + +# ---------- get_storage_backend() factory ---------- + +class TestGetStorageBackend: + """Tests for the storage backend factory function.""" + + @patch("SPARC.storage.config") + def test_returns_local_backend_by_default(self, mock_config): + """Default config returns LocalStorageBackend.""" + mock_config.storage_backend = "local" + + backend = get_storage_backend() + + assert isinstance(backend, LocalStorageBackend) + + @patch("SPARC.storage.config") + def test_returns_s3_backend_when_configured(self, mock_config): + """Setting storage_backend=s3 returns S3StorageBackend.""" + mock_config.storage_backend = "s3" + mock_config.s3_bucket = "test-bucket" + mock_config.s3_endpoint_url = "http://minio:9000" + mock_config.s3_access_key = "key" + mock_config.s3_secret_key = "secret" + + with patch.dict("sys.modules", {"boto3": MagicMock()}): + backend = get_storage_backend() + + assert isinstance(backend, S3StorageBackend) + + @patch("SPARC.storage.config") + def test_case_insensitive_backend_selection(self, mock_config): + """Backend selection is case-insensitive.""" + mock_config.storage_backend = "LOCAL" + + backend = get_storage_backend() + + assert isinstance(backend, LocalStorageBackend) diff --git a/tests/test_webhooks.py b/tests/test_webhooks.py new file mode 100644 index 0000000..9950030 --- /dev/null +++ b/tests/test_webhooks.py @@ -0,0 +1,280 @@ +"""Tests for webhook notification system: retry logic and Slack/Discord payload format. + +Covers issue #1657: +- Retry logic with exponential backoff in _send_with_retry +- Slack/Discord payload formatting in _build_payload +- Generic HTTP POST payload formatting +- notify() dispatching to multiple URLs +- notify_job_completed() and notify_alert() convenience helpers +""" + +from datetime import datetime +from unittest.mock import MagicMock, patch, call + +import pytest +import requests + +from SPARC.webhooks import ( + MAX_RETRIES, + _build_payload, + _is_slack_url, + _send_with_retry, + notify, + notify_alert, + notify_job_completed, +) + + +class TestIsSlackUrl: + """Tests for Slack/Discord URL detection.""" + + def test_slack_webhook_url(self): + assert _is_slack_url("https://hooks.slack.com/services/T00/B00/xxx") is True + + def test_discord_webhook_url(self): + assert _is_slack_url("https://discord.com/api/webhooks/123/abc") is True + + def test_generic_url(self): + assert _is_slack_url("https://example.com/webhook") is False + + def test_empty_url(self): + assert _is_slack_url("") is False + + +class TestBuildPayload: + """Tests for payload construction.""" + + def test_generic_payload_structure(self): + """Generic payload includes event type, timestamp, and data.""" + payload = _build_payload("job_completed", {"job_id": "abc123"}) + + assert payload["event"] == "job_completed" + assert payload["job_id"] == "abc123" + assert "timestamp" in payload + # Timestamp should be ISO format ending with Z + assert payload["timestamp"].endswith("Z") + + def test_slack_payload_wraps_in_text(self): + """Slack payload wraps content in a 'text' field.""" + payload = _build_payload("patent_alert", {"company_name": "NVIDIA"}, slack=True) + + assert "text" in payload + assert "patent_alert" in payload["text"] + assert "NVIDIA" in payload["text"] + # Slack payload should NOT have the event/timestamp at top level + assert "event" not in payload + assert "timestamp" not in payload + + def test_generic_payload_does_not_have_text_field(self): + """Non-Slack payload does not wrap in text.""" + payload = _build_payload("job_completed", {"status": "done"}) + + assert "text" not in payload + assert payload["status"] == "done" + + def test_slack_payload_contains_bold_header(self): + """Slack payload starts with bold event header using Slack markdown.""" + payload = _build_payload("job_completed", {"count": 5}, slack=True) + + assert payload["text"].startswith("*[SPARC] job_completed*") + + def test_payload_merges_all_data_keys(self): + """All data keys are included in the generic payload.""" + data = {"key1": "val1", "key2": 42, "key3": True} + payload = _build_payload("test_event", data) + + assert payload["key1"] == "val1" + assert payload["key2"] == 42 + assert payload["key3"] is True + + +class TestSendWithRetry: + """Tests for retry logic in _send_with_retry.""" + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_success_on_first_attempt(self, mock_post, mock_sleep): + """Successful delivery on first attempt, no retries.""" + mock_post.return_value = MagicMock(status_code=200) + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is True + mock_post.assert_called_once() + mock_sleep.assert_not_called() + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_success_on_second_attempt(self, mock_post, mock_sleep): + """Fails first, succeeds on retry.""" + mock_post.side_effect = [ + MagicMock(status_code=500), + MagicMock(status_code=200), + ] + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is True + assert mock_post.call_count == 2 + mock_sleep.assert_called_once() + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_all_retries_exhausted(self, mock_post, mock_sleep): + """Returns False after all retries fail.""" + mock_post.return_value = MagicMock(status_code=500) + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is False + assert mock_post.call_count == MAX_RETRIES + assert mock_sleep.call_count == MAX_RETRIES - 1 + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_exponential_backoff_timing(self, mock_post, mock_sleep): + """Backoff wait times follow exponential pattern (2^attempt).""" + mock_post.return_value = MagicMock(status_code=500) + + _send_with_retry("https://example.com/hook", {"event": "test"}) + + # With BACKOFF_BASE=2: attempt 1 -> sleep(2), attempt 2 -> sleep(4) + expected_waits = [call(2 ** i) for i in range(1, MAX_RETRIES)] + assert mock_sleep.call_args_list == expected_waits + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_network_error_triggers_retry(self, mock_post, mock_sleep): + """Network exceptions trigger retry, not immediate failure.""" + mock_post.side_effect = [ + requests.ConnectionError("Connection refused"), + MagicMock(status_code=200), + ] + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is True + assert mock_post.call_count == 2 + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_timeout_error_triggers_retry(self, mock_post, mock_sleep): + """Timeout exceptions trigger retry.""" + mock_post.side_effect = [ + requests.Timeout("Request timed out"), + MagicMock(status_code=200), + ] + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is True + assert mock_post.call_count == 2 + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_2xx_status_codes_accepted(self, mock_post, mock_sleep): + """Any 2xx status code is treated as success.""" + mock_post.return_value = MagicMock(status_code=204) + + result = _send_with_retry("https://example.com/hook", {"event": "test"}) + + assert result is True + mock_post.assert_called_once() + + @patch("SPARC.webhooks.time.sleep") + @patch("SPARC.webhooks.requests.post") + def test_posts_json_payload(self, mock_post, mock_sleep): + """Payload is sent as JSON with correct timeout.""" + mock_post.return_value = MagicMock(status_code=200) + payload = {"event": "test", "data": "value"} + + _send_with_retry("https://example.com/hook", payload) + + mock_post.assert_called_once_with( + "https://example.com/hook", json=payload, timeout=10 + ) + + +class TestNotify: + """Tests for the notify() dispatcher.""" + + @patch("SPARC.webhooks._send_with_retry") + @patch("SPARC.webhooks.WEBHOOK_URLS", ["https://example.com/hook1", "https://example.com/hook2"]) + def test_dispatches_to_all_urls(self, mock_send): + """notify() sends to every configured webhook URL.""" + mock_send.return_value = True + + notify("job_completed", {"job_id": "test123"}) + + assert mock_send.call_count == 2 + + @patch("SPARC.webhooks._send_with_retry") + @patch("SPARC.webhooks.WEBHOOK_URLS", []) + def test_no_urls_configured_returns_immediately(self, mock_send): + """No-op when no webhook URLs are configured.""" + notify("job_completed", {"job_id": "test123"}) + + mock_send.assert_not_called() + + @patch("SPARC.webhooks._send_with_retry") + @patch("SPARC.webhooks.WEBHOOK_URLS", [ + "https://hooks.slack.com/services/T00/B00/xxx", + "https://example.com/generic", + ]) + def test_slack_url_gets_slack_payload(self, mock_send): + """Slack URLs receive Slack-formatted payloads, others get generic.""" + mock_send.return_value = True + + notify("test_event", {"key": "val"}) + + # First call (Slack URL) should have "text" key + slack_payload = mock_send.call_args_list[0][0][1] + assert "text" in slack_payload + + # Second call (generic URL) should have "event" key + generic_payload = mock_send.call_args_list[1][0][1] + assert "event" in generic_payload + assert generic_payload["event"] == "test_event" + + +class TestNotifyJobCompleted: + """Tests for notify_job_completed() convenience function.""" + + @patch("SPARC.webhooks.notify") + def test_sends_correct_event_and_data(self, mock_notify): + """Job completion sends proper event type and summary.""" + notify_job_completed( + job_id="batch-001", + status="completed", + total_companies=10, + successful=8, + failed=2, + ) + + mock_notify.assert_called_once() + event, data = mock_notify.call_args[0] + assert event == "job_completed" + assert data["job_id"] == "batch-001" + assert data["successful"] == 8 + assert data["failed"] == 2 + assert "8/10" in data["summary"] + + +class TestNotifyAlert: + """Tests for notify_alert() convenience function.""" + + @patch("SPARC.webhooks.notify") + def test_sends_correct_event_and_data(self, mock_notify): + """Alert notification sends patent_alert event type.""" + notify_alert( + company_name="NVIDIA", + alert_type="patent_count_change", + message="Patent count increased by 30%", + ) + + mock_notify.assert_called_once() + event, data = mock_notify.call_args[0] + assert event == "patent_alert" + assert data["company_name"] == "NVIDIA" + assert data["alert_type"] == "patent_count_change" + assert "30%" in data["message"]