Files
aoc/ROADMAP.md
T
tomas.kracmar 7639f5f69d
Release / build-and-push (push) Successful in 1m23s
CI / lint-and-test (push) Successful in 1m22s
Release v1.7.18: fix Alpine.js SRI + CSP, add frontend modernization roadmap
- Revert @alpinejs/csp (CSP build has no support for template literals,
  optional chaining, or x-html — all used in the app template); switch
  back to the regular alpinejs build
- Pin Alpine.js to 3.15.12 with a verified SRI hash (replaces the
  floating @3.x.x tag that caused the integrity check failure)
- Restore 'unsafe-eval' to script-src (required by Alpine.js's
  new Function() expression evaluator; inline script was already
  eliminated in v1.7.17 so 'unsafe-inline' stays removed)
- Add Phase 7.5 Frontend Modernization to ROADMAP: Vue 3 + Vite
  migration plan that will allow a clean CSP without unsafe-eval

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 08:01:57 +02:00

153 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AOC Roadmap
This roadmap tracks planned improvements for the Admin Operations Center (AOC) project, organized by phase.
---
## Phase 1: Harden ✅
Goal: fix critical security and reliability gaps before production use.
- [x] Fix JWT signature verification in `auth.py`
- [x] Fix broken frontend auth button references (`loginBtn` / `logoutBtn`)
- [x] Add MongoDB indexes (`dedupe_key`, `timestamp`, `service+timestamp`, `id`, text search)
- [x] Add MongoDB TTL index for data retention (`RETENTION_DAYS`)
- [x] Add `/health` endpoint with database connectivity check
- [x] Replace manual `os.getenv` parsing with Pydantic Settings (`pydantic-settings`)
- [x] Add structured JSON logging (`structlog`)
- [x] Configure CORS middleware via `CORS_ORIGINS` environment variable
- [x] Escape user input before MongoDB `$regex` queries (`routes/events.py`)
- [x] Fix incorrect return value in `maintenance.py dedupe()`
---
## Phase 2: Stabilize ✅
Goal: improve resilience, code quality, and development experience.
- [x] Cache Graph API tokens and reuse them until near expiry
- [x] Add exponential backoff / retry logic for Graph API and Office 365 API calls
- [x] Add unit tests for `normalize_event()`, `_make_dedupe_key()`, and `auth.py`
- [x] Add integration tests for `/api/events` and `/api/fetch-audit-logs`
- [x] Configure linter/formatter (`ruff`) and pre-commit hooks
- [x] Set up GitHub Actions CI pipeline (lint + test)
- [x] Add Pydantic request/response models for API endpoints
- [x] Validate `page_size` and `hours` with strict FastAPI constraints
---
## Phase 3: Scale ✅
Goal: handle larger data volumes and support real-time ingestion.
- [x] Replace skip-based pagination with cursor-based (search-after) pagination
- [x] Add Prometheus `/metrics` endpoint and a Grafana dashboard
- [x] Implement incremental fetch watermarking per source (store last fetch timestamp)
- [x] Add webhook endpoints to receive Microsoft Graph change notifications
- [x] Evaluate Elasticsearch or Azure Cognitive Search for advanced full-text search (MongoDB text index sufficient for current scale)
- [x] Add request ID / correlation ID middleware for distributed tracing
---
## Phase 4: Enhance ✅
Goal: evolve from a polling dashboard into a full security operations tool.
- [x] Migrate frontend to Alpine.js for better state management and maintainability
- [x] Add rule-based alerting (e.g., alert on privileged operations, after-hours activity)
- [x] Add SIEM export (Splunk, Sentinel, syslog webhook)
- [x] Build an audit trail for AOC itself (who queried what, who triggered fetches)
- [x] Add event tagging and commenting (e.g., `investigating`, `false_positive`)
- [x] Add export functionality (CSV / JSON) from the UI
- [x] Add source health dashboard showing last fetch time and status per source
---
## Phase 5: Intelligence ✅
Goal: add AI-powered analysis and external tool integration.
- [x] AI feature flag (`AI_FEATURES_ENABLED`) to gate LLM-dependent features
- [x] Natural language query endpoint (`/api/ask`) with intent extraction and smart sampling
- [x] MCP (Model Context Protocol) server for Claude Desktop / Cursor integration
- [x] Valkey caching for LLM responses and frequent queries
- [x] Async queue (arq) for LLM requests to prevent timeout/cost explosions at scale
- [ ] Advanced analytics dashboard (trending operations, anomaly detection)
## Completed in this PR
All Phase 5 items marked done were implemented in v1.3.0v1.5.0.
Redis caching + async queue implemented in v1.6.0, switched to Valkey.
UI polish (topbar, footer, clickable pills) in v1.6.1v1.6.4.
---
## Phase 6: Security Hardening ✅
Goal: address penetration test findings and threat model gaps.
- [x] Fix CORS credentials leak (v1.7.12)
- [x] Add security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy) (v1.7.12)
- [x] Make rate limiter fail-closed on Redis failure (v1.7.12)
- [x] Disable OpenAPI docs by default (v1.7.12)
- [x] Hide tenant_id/client_id from config endpoint when auth disabled (v1.7.12)
- [x] Validate webhook validationToken before echo (v1.7.12)
- [x] Gate `/metrics` behind IP allowlist (v1.7.12)
- [x] Add LLM domain allowlist (`LLM_ALLOWED_DOMAINS`) (v1.7.14)
- [x] Add SIEM webhook SSRF guard + domain allowlist (v1.7.14)
- [x] Add SRI hashes to CDN scripts (v1.7.14)
- [x] Add startup warning for auth misconfiguration (v1.7.14)
- [x] Add Azure Key Vault integration for secrets storage (v1.7.14)
- [x] Internal penetration test + threat model documentation
---
## Phase 7.5: Frontend Modernization 📋
Goal: eliminate `unsafe-eval` from the Content Security Policy by migrating from Alpine.js to a compiled frontend framework.
Status: **Planned**. Current Alpine.js requires `unsafe-eval` because it uses `new Function()` to evaluate attribute expressions at runtime. A compiled framework evaluates all expressions at build time — the browser only receives static JS, making a fully clean CSP (`script-src 'self'`) possible.
### Recommended approach: Vue 3 + Vite
Alpine.js was inspired by Vue, so the migration is largely mechanical:
| Alpine.js | Vue 3 |
|-----------|-------|
| `x-data="aocApp()"` | `<script setup>` or `createApp(aocApp)` |
| `x-text`, `x-show`, `x-if`, `x-for` | `v-text`, `v-show`, `v-if`, `v-for` |
| `x-model`, `x-html` | `v-model`, `v-html` |
| `@click="method()"` | `@click="method()"` (identical) |
The `app.js` logic (`aocApp()` function body, ~820 lines) translates almost directly.
The CDN dependencies on `cdn.jsdelivr.net` and `alcdn.msauth.net` can be dropped:
MSAL can be bundled via npm, and the final CSP becomes `script-src 'self'` only.
### Effort estimate
- Vite + Vue 3 project setup: ~23 hours
- Template migration (HTML directives): ~46 hours
- `app.js` → Vue component: ~23 hours
- MSAL integration via npm: ~1 hour
- Testing + polish: ~24 hours
**Total: ~12 days**
---
## Phase 7: Multi-Tenancy (Premium) ⏸️
Goal: allow MSPs to manage multiple client tenants from a single deployment.
Status: **Planned — not started**. Architecture designed, pending validation of core features (SIEM export, alerting) in production first.
### Architecture
- Row-level isolation: `tenant_id` field on every MongoDB document
- Each tenant has their own Microsoft Entra tenant + app registration credentials
- Auth: user's JWT `tid` claim maps to tenant config automatically
- Super-admin role for MSP staff to access all tenants
### Implementation phases
- **Phase 7.1** (23 days): Tenant model & registry, tenant-aware data layer, per-tenant Graph API auth
- **Phase 7.2** (1 day): Tenant-scoped API routes, tenant-specific config endpoints
- **Phase 7.3** (2 days): Frontend tenant switcher, tenant name display, admin page
- **Phase 7.4** (1 day): License gating — signed JWT `LICENSE_KEY` gates multi-tenant mode
### Licensing model
- Single-tenant: remains MIT/free
- Multi-tenant: premium feature requiring a signed license key
- License key is a JWT with claims: `plan`, `max_tenants`, `exp`, `features`
- Offline license generation tool included
### Effort estimate
~79 days total. Deferred until SIEM export and alerting are battle-tested.