- Add async Redis client singleton (redis_client.py) for caching and arq pool
- Add arq job functions (jobs.py) for background LLM processing
- Cache ask/explain LLM responses with TTL (1h ask, 24h explain)
- Add async mode to /api/ask: enqueue job, return job_id, poll /api/jobs/{id}
- Add GET /api/jobs/{job_id} endpoint for job status polling
- Add arq worker service to docker-compose (dev + prod)
- Switch from Redis to Valkey (BSD fork) in Docker Compose
- Add REDIS_URL config setting
- Add tests for cache hit, async mode, and job status
3.5 KiB
3.5 KiB
AOC Roadmap
This roadmap tracks planned improvements for the Admin Operations Center (AOC) project, organized by phase.
Phase 1: Harden ✅
Goal: fix critical security and reliability gaps before production use.
- Fix JWT signature verification in
auth.py - Fix broken frontend auth button references (
loginBtn/logoutBtn) - Add MongoDB indexes (
dedupe_key,timestamp,service+timestamp,id, text search) - Add MongoDB TTL index for data retention (
RETENTION_DAYS) - Add
/healthendpoint with database connectivity check - Replace manual
os.getenvparsing with Pydantic Settings (pydantic-settings) - Add structured JSON logging (
structlog) - Configure CORS middleware via
CORS_ORIGINSenvironment variable - Escape user input before MongoDB
$regexqueries (routes/events.py) - Fix incorrect return value in
maintenance.py dedupe()
Phase 2: Stabilize ✅
Goal: improve resilience, code quality, and development experience.
- Cache Graph API tokens and reuse them until near expiry
- Add exponential backoff / retry logic for Graph API and Office 365 API calls
- Add unit tests for
normalize_event(),_make_dedupe_key(), andauth.py - Add integration tests for
/api/eventsand/api/fetch-audit-logs - Configure linter/formatter (
ruff) and pre-commit hooks - Set up GitHub Actions CI pipeline (lint + test)
- Add Pydantic request/response models for API endpoints
- Validate
page_sizeandhourswith strict FastAPI constraints
Phase 3: Scale ✅
Goal: handle larger data volumes and support real-time ingestion.
- Replace skip-based pagination with cursor-based (search-after) pagination
- Add Prometheus
/metricsendpoint and a Grafana dashboard - Implement incremental fetch watermarking per source (store last fetch timestamp)
- Add webhook endpoints to receive Microsoft Graph change notifications
- Evaluate Elasticsearch or Azure Cognitive Search for advanced full-text search (MongoDB text index sufficient for current scale)
- Add request ID / correlation ID middleware for distributed tracing
Phase 4: Enhance ✅
Goal: evolve from a polling dashboard into a full security operations tool.
- Migrate frontend to Alpine.js for better state management and maintainability
- Add rule-based alerting (e.g., alert on privileged operations, after-hours activity)
- Add SIEM export (Splunk, Sentinel, syslog webhook)
- Build an audit trail for AOC itself (who queried what, who triggered fetches)
- Add event tagging and commenting (e.g.,
investigating,false_positive) - Add export functionality (CSV / JSON) from the UI
- Add source health dashboard showing last fetch time and status per source
Phase 5: Intelligence
Goal: add AI-powered analysis and external tool integration.
- AI feature flag (
AI_FEATURES_ENABLED) to gate LLM-dependent features - Natural language query endpoint (
/api/ask) with intent extraction and smart sampling - MCP (Model Context Protocol) server for Claude Desktop / Cursor integration
- Valkey caching for LLM responses and frequent queries
- Async queue (arq) for LLM requests to prevent timeout/cost explosions at scale
- Advanced analytics dashboard (trending operations, anomaly detection)
Completed in this PR
All Phase 5 items marked done were implemented in v1.3.0–v1.5.0. Redis caching + async queue implemented in v1.6.0, switched to Valkey.