aoc/ROADMAP.md

# AOC Roadmap

This roadmap tracks planned improvements for the Admin Operations Center (AOC) project, organized by phase.

---

## Phase 1: Harden ✅
Goal: fix critical security and reliability gaps before production use.

- [x] Fix JWT signature verification in `auth.py`
- [x] Fix broken frontend auth button references (`loginBtn` / `logoutBtn`)
- [x] Add MongoDB indexes (`dedupe_key`, `timestamp`, `service+timestamp`, `id`, text search)
- [x] Add MongoDB TTL index for data retention (`RETENTION_DAYS`)
- [x] Add `/health` endpoint with database connectivity check
- [x] Replace manual `os.getenv` parsing with Pydantic Settings (`pydantic-settings`)
- [x] Add structured JSON logging (`structlog`)
- [x] Configure CORS middleware via `CORS_ORIGINS` environment variable
- [x] Escape user input before MongoDB `$regex` queries (`routes/events.py`)
- [x] Fix incorrect return value in `maintenance.py dedupe()`

---

## Phase 2: Stabilize ✅
Goal: improve resilience, code quality, and development experience.

- [x] Cache Graph API tokens and reuse them until near expiry
- [x] Add exponential backoff / retry logic for Graph API and Office 365 API calls
- [x] Add unit tests for `normalize_event()`, `_make_dedupe_key()`, and `auth.py`
- [x] Add integration tests for `/api/events` and `/api/fetch-audit-logs`
- [x] Configure linter/formatter (`ruff`) and pre-commit hooks
- [x] Set up GitHub Actions CI pipeline (lint + test)
- [x] Add Pydantic request/response models for API endpoints
- [x] Validate `page_size` and `hours` with strict FastAPI constraints

---

## Phase 3: Scale ✅
Goal: handle larger data volumes and support real-time ingestion.

- [x] Replace skip-based pagination with cursor-based (search-after) pagination
- [x] Add Prometheus `/metrics` endpoint and a Grafana dashboard
- [x] Implement incremental fetch watermarking per source (store last fetch timestamp)
- [x] Add webhook endpoints to receive Microsoft Graph change notifications
- [x] Evaluate Elasticsearch or Azure Cognitive Search for advanced full-text search (MongoDB text index sufficient for current scale)
- [x] Add request ID / correlation ID middleware for distributed tracing

---

## Phase 4: Enhance ✅
Goal: evolve from a polling dashboard into a full security operations tool.

- [x] Migrate frontend to Alpine.js for better state management and maintainability
- [x] Add rule-based alerting (e.g., alert on privileged operations, after-hours activity)
- [x] Add SIEM export (Splunk, Sentinel, syslog webhook)
- [x] Build an audit trail for AOC itself (who queried what, who triggered fetches)
- [x] Add event tagging and commenting (e.g., `investigating`, `false_positive`)
- [x] Add export functionality (CSV / JSON) from the UI
- [x] Add source health dashboard showing last fetch time and status per source

---

## Phase 5: Intelligence ✅
Goal: add AI-powered analysis and external tool integration.

- [x] AI feature flag (`AI_FEATURES_ENABLED`) to gate LLM-dependent features
- [x] Natural language query endpoint (`/api/ask`) with intent extraction and smart sampling
- [x] MCP (Model Context Protocol) server for Claude Desktop / Cursor integration
- [x] Valkey caching for LLM responses and frequent queries
- [x] Async queue (arq) for LLM requests to prevent timeout/cost explosions at scale
- [ ] Advanced analytics dashboard (trending operations, anomaly detection)

## Completed in this PR
All Phase 5 items marked done were implemented in v1.3.0–v1.5.0.
Redis caching + async queue implemented in v1.6.0, switched to Valkey.
UI polish (topbar, footer, clickable pills) in v1.6.1–v1.6.4.

---

## Phase 6: Security Hardening ✅
Goal: address penetration test findings and threat model gaps.

- [x] Fix CORS credentials leak (v1.7.12)
- [x] Add security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy) (v1.7.12)
- [x] Make rate limiter fail-closed on Redis failure (v1.7.12)
- [x] Disable OpenAPI docs by default (v1.7.12)
- [x] Hide tenant_id/client_id from config endpoint when auth disabled (v1.7.12)
- [x] Validate webhook validationToken before echo (v1.7.12)
- [x] Gate `/metrics` behind IP allowlist (v1.7.12)
- [x] Add LLM domain allowlist (`LLM_ALLOWED_DOMAINS`) (v1.7.14)
- [x] Add SIEM webhook SSRF guard + domain allowlist (v1.7.14)
- [x] Add SRI hashes to CDN scripts (v1.7.14)
- [x] Add startup warning for auth misconfiguration (v1.7.14)
- [x] Add Azure Key Vault integration for secrets storage (v1.7.14)
- [x] Internal penetration test + threat model documentation

---

## Phase 7.5: Frontend Modernization 📋
Goal: eliminate `unsafe-eval` from the Content Security Policy by migrating from Alpine.js to a compiled frontend framework.

Status: **Planned**. Current Alpine.js requires `unsafe-eval` because it uses `new Function()` to evaluate attribute expressions at runtime. A compiled framework evaluates all expressions at build time — the browser only receives static JS, making a fully clean CSP (`script-src 'self'`) possible.

### Recommended approach: Vue 3 + Vite
Alpine.js was inspired by Vue, so the migration is largely mechanical:

| Alpine.js | Vue 3 |
|-----------|-------|
| `x-data="aocApp()"` | `<script setup>` or `createApp(aocApp)` |
| `x-text`, `x-show`, `x-if`, `x-for` | `v-text`, `v-show`, `v-if`, `v-for` |
| `x-model`, `x-html` | `v-model`, `v-html` |
| `@click="method()"` | `@click="method()"` (identical) |

The `app.js` logic (`aocApp()` function body, ~820 lines) translates almost directly.
The CDN dependencies on `cdn.jsdelivr.net` and `alcdn.msauth.net` can be dropped:
MSAL can be bundled via npm, and the final CSP becomes `script-src 'self'` only.

### Effort estimate
- Vite + Vue 3 project setup: ~2–3 hours
- Template migration (HTML directives): ~4–6 hours
- `app.js` → Vue component: ~2–3 hours
- MSAL integration via npm: ~1 hour
- Testing + polish: ~2–4 hours

**Total: ~1–2 days**

---

## Phase 7: Multi-Tenancy (Premium) ⏸️
Goal: allow MSPs to manage multiple client tenants from a single deployment.

Status: **Planned — not started**. Architecture designed, pending validation of core features (SIEM export, alerting) in production first.

### Architecture
- Row-level isolation: `tenant_id` field on every MongoDB document
- Each tenant has their own Microsoft Entra tenant + app registration credentials
- Auth: user's JWT `tid` claim maps to tenant config automatically
- Super-admin role for MSP staff to access all tenants

### Implementation phases
- **Phase 7.1** (2–3 days): Tenant model & registry, tenant-aware data layer, per-tenant Graph API auth
- **Phase 7.2** (1 day): Tenant-scoped API routes, tenant-specific config endpoints
- **Phase 7.3** (2 days): Frontend tenant switcher, tenant name display, admin page
- **Phase 7.4** (1 day): License gating — signed JWT `LICENSE_KEY` gates multi-tenant mode

### Licensing model
- Single-tenant: remains MIT/free
- Multi-tenant: premium feature requiring a signed license key
- License key is a JWT with claims: `plan`, `max_tenants`, `exp`, `features`
- Offline license generation tool included

### Effort estimate
~7–9 days total. Deferred until SIEM export and alerting are battle-tested.