- Add async Redis client singleton (redis_client.py) for caching and arq pool
- Add arq job functions (jobs.py) for background LLM processing
- Cache ask/explain LLM responses with TTL (1h ask, 24h explain)
- Add async mode to /api/ask: enqueue job, return job_id, poll /api/jobs/{id}
- Add GET /api/jobs/{job_id} endpoint for job status polling
- Add arq worker service to docker-compose (dev + prod)
- Switch from Redis to Valkey (BSD fork) in Docker Compose
- Add REDIS_URL config setting
- Add tests for cache hit, async mode, and job status
75 lines
3.5 KiB
Markdown
75 lines
3.5 KiB
Markdown
# AOC Roadmap
|
||
|
||
This roadmap tracks planned improvements for the Admin Operations Center (AOC) project, organized by phase.
|
||
|
||
---
|
||
|
||
## Phase 1: Harden ✅
|
||
Goal: fix critical security and reliability gaps before production use.
|
||
|
||
- [x] Fix JWT signature verification in `auth.py`
|
||
- [x] Fix broken frontend auth button references (`loginBtn` / `logoutBtn`)
|
||
- [x] Add MongoDB indexes (`dedupe_key`, `timestamp`, `service+timestamp`, `id`, text search)
|
||
- [x] Add MongoDB TTL index for data retention (`RETENTION_DAYS`)
|
||
- [x] Add `/health` endpoint with database connectivity check
|
||
- [x] Replace manual `os.getenv` parsing with Pydantic Settings (`pydantic-settings`)
|
||
- [x] Add structured JSON logging (`structlog`)
|
||
- [x] Configure CORS middleware via `CORS_ORIGINS` environment variable
|
||
- [x] Escape user input before MongoDB `$regex` queries (`routes/events.py`)
|
||
- [x] Fix incorrect return value in `maintenance.py dedupe()`
|
||
|
||
---
|
||
|
||
## Phase 2: Stabilize ✅
|
||
Goal: improve resilience, code quality, and development experience.
|
||
|
||
- [x] Cache Graph API tokens and reuse them until near expiry
|
||
- [x] Add exponential backoff / retry logic for Graph API and Office 365 API calls
|
||
- [x] Add unit tests for `normalize_event()`, `_make_dedupe_key()`, and `auth.py`
|
||
- [x] Add integration tests for `/api/events` and `/api/fetch-audit-logs`
|
||
- [x] Configure linter/formatter (`ruff`) and pre-commit hooks
|
||
- [x] Set up GitHub Actions CI pipeline (lint + test)
|
||
- [x] Add Pydantic request/response models for API endpoints
|
||
- [x] Validate `page_size` and `hours` with strict FastAPI constraints
|
||
|
||
---
|
||
|
||
## Phase 3: Scale ✅
|
||
Goal: handle larger data volumes and support real-time ingestion.
|
||
|
||
- [x] Replace skip-based pagination with cursor-based (search-after) pagination
|
||
- [x] Add Prometheus `/metrics` endpoint and a Grafana dashboard
|
||
- [x] Implement incremental fetch watermarking per source (store last fetch timestamp)
|
||
- [x] Add webhook endpoints to receive Microsoft Graph change notifications
|
||
- [x] Evaluate Elasticsearch or Azure Cognitive Search for advanced full-text search (MongoDB text index sufficient for current scale)
|
||
- [x] Add request ID / correlation ID middleware for distributed tracing
|
||
|
||
---
|
||
|
||
## Phase 4: Enhance ✅
|
||
Goal: evolve from a polling dashboard into a full security operations tool.
|
||
|
||
- [x] Migrate frontend to Alpine.js for better state management and maintainability
|
||
- [x] Add rule-based alerting (e.g., alert on privileged operations, after-hours activity)
|
||
- [x] Add SIEM export (Splunk, Sentinel, syslog webhook)
|
||
- [x] Build an audit trail for AOC itself (who queried what, who triggered fetches)
|
||
- [x] Add event tagging and commenting (e.g., `investigating`, `false_positive`)
|
||
- [x] Add export functionality (CSV / JSON) from the UI
|
||
- [x] Add source health dashboard showing last fetch time and status per source
|
||
|
||
---
|
||
|
||
## Phase 5: Intelligence
|
||
Goal: add AI-powered analysis and external tool integration.
|
||
|
||
- [x] AI feature flag (`AI_FEATURES_ENABLED`) to gate LLM-dependent features
|
||
- [x] Natural language query endpoint (`/api/ask`) with intent extraction and smart sampling
|
||
- [x] MCP (Model Context Protocol) server for Claude Desktop / Cursor integration
|
||
- [x] Valkey caching for LLM responses and frequent queries
|
||
- [x] Async queue (arq) for LLM requests to prevent timeout/cost explosions at scale
|
||
- [ ] Advanced analytics dashboard (trending operations, anomaly detection)
|
||
|
||
## Completed in this PR
|
||
All Phase 5 items marked done were implemented in v1.3.0–v1.5.0.
|
||
Redis caching + async queue implemented in v1.6.0, switched to Valkey.
|