Files
aoc/ROADMAP.md
Tomas Kracmar f75f165911
Some checks failed
Release / build-and-push (push) Successful in 1m24s
CI / lint-and-test (push) Failing after 29s
feat: Redis caching + async queue for LLM scaling (v1.6.0)
- Add async Redis client singleton (redis_client.py) for caching and arq pool
- Add arq job functions (jobs.py) for background LLM processing
- Cache ask/explain LLM responses with TTL (1h ask, 24h explain)
- Add async mode to /api/ask: enqueue job, return job_id, poll /api/jobs/{id}
- Add GET /api/jobs/{job_id} endpoint for job status polling
- Add arq worker service to docker-compose (dev + prod)
- Switch from Redis to Valkey (BSD fork) in Docker Compose
- Add REDIS_URL config setting
- Add tests for cache hit, async mode, and job status
2026-04-22 09:55:05 +02:00

75 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AOC Roadmap
This roadmap tracks planned improvements for the Admin Operations Center (AOC) project, organized by phase.
---
## Phase 1: Harden ✅
Goal: fix critical security and reliability gaps before production use.
- [x] Fix JWT signature verification in `auth.py`
- [x] Fix broken frontend auth button references (`loginBtn` / `logoutBtn`)
- [x] Add MongoDB indexes (`dedupe_key`, `timestamp`, `service+timestamp`, `id`, text search)
- [x] Add MongoDB TTL index for data retention (`RETENTION_DAYS`)
- [x] Add `/health` endpoint with database connectivity check
- [x] Replace manual `os.getenv` parsing with Pydantic Settings (`pydantic-settings`)
- [x] Add structured JSON logging (`structlog`)
- [x] Configure CORS middleware via `CORS_ORIGINS` environment variable
- [x] Escape user input before MongoDB `$regex` queries (`routes/events.py`)
- [x] Fix incorrect return value in `maintenance.py dedupe()`
---
## Phase 2: Stabilize ✅
Goal: improve resilience, code quality, and development experience.
- [x] Cache Graph API tokens and reuse them until near expiry
- [x] Add exponential backoff / retry logic for Graph API and Office 365 API calls
- [x] Add unit tests for `normalize_event()`, `_make_dedupe_key()`, and `auth.py`
- [x] Add integration tests for `/api/events` and `/api/fetch-audit-logs`
- [x] Configure linter/formatter (`ruff`) and pre-commit hooks
- [x] Set up GitHub Actions CI pipeline (lint + test)
- [x] Add Pydantic request/response models for API endpoints
- [x] Validate `page_size` and `hours` with strict FastAPI constraints
---
## Phase 3: Scale ✅
Goal: handle larger data volumes and support real-time ingestion.
- [x] Replace skip-based pagination with cursor-based (search-after) pagination
- [x] Add Prometheus `/metrics` endpoint and a Grafana dashboard
- [x] Implement incremental fetch watermarking per source (store last fetch timestamp)
- [x] Add webhook endpoints to receive Microsoft Graph change notifications
- [x] Evaluate Elasticsearch or Azure Cognitive Search for advanced full-text search (MongoDB text index sufficient for current scale)
- [x] Add request ID / correlation ID middleware for distributed tracing
---
## Phase 4: Enhance ✅
Goal: evolve from a polling dashboard into a full security operations tool.
- [x] Migrate frontend to Alpine.js for better state management and maintainability
- [x] Add rule-based alerting (e.g., alert on privileged operations, after-hours activity)
- [x] Add SIEM export (Splunk, Sentinel, syslog webhook)
- [x] Build an audit trail for AOC itself (who queried what, who triggered fetches)
- [x] Add event tagging and commenting (e.g., `investigating`, `false_positive`)
- [x] Add export functionality (CSV / JSON) from the UI
- [x] Add source health dashboard showing last fetch time and status per source
---
## Phase 5: Intelligence
Goal: add AI-powered analysis and external tool integration.
- [x] AI feature flag (`AI_FEATURES_ENABLED`) to gate LLM-dependent features
- [x] Natural language query endpoint (`/api/ask`) with intent extraction and smart sampling
- [x] MCP (Model Context Protocol) server for Claude Desktop / Cursor integration
- [x] Valkey caching for LLM responses and frequent queries
- [x] Async queue (arq) for LLM requests to prevent timeout/cost explosions at scale
- [ ] Advanced analytics dashboard (trending operations, anomaly detection)
## Completed in this PR
All Phase 5 items marked done were implemented in v1.3.0v1.5.0.
Redis caching + async queue implemented in v1.6.0, switched to Valkey.