Sync from dev @ 497baf0

Source: main (497baf0)
Excluded: live tenant exports, generated artifacts, and dev-only tooling.
This commit is contained in:
2026-04-21 22:21:43 +02:00
parent b6ac9524f7
commit 2c41eaca44
25 changed files with 2258 additions and 79 deletions

View File

@@ -0,0 +1,245 @@
# ASTRAL Change Probe
Event-driven backup trigger for ASTRAL. Monitors Intune and Entra ID audit logs via Microsoft Graph, debounces change bursts, and queues the Azure DevOps backup pipeline only when actual drift is detected.
## Why this exists
Microsoft Graph change notifications and delta queries do **not** support Intune device management or Conditional Access resources. The only viable event-driven approach is polling the Graph audit log APIs, which have a 515 minute propagation delay. This probe implements a debouncer on top of that polling to avoid backup storms during bulk changes.
## Architecture
```
┌─────────────────┐ 5 min ┌──────────────┐ quiet window ┌─────────────────┐
│ Timer Trigger │ ─────────────► │ probe_timer │ ─────────────────► │ backup-trigger │
│ (probe_timer) │ │ (debouncer) │ (15 min armed) │ -queue │
└─────────────────┘ └──────────────┘ └────────┬────────┘
│ │
│ load/save state │ dequeue
│ (Azure Table Storage) ▼
│ ┌─────────────────┐
│ │ queue_consumer │
└──────────────────────────────────────────────────────────────►│ (ADO REST API) │
└─────────────────┘
┌─────────────────┐
│ Azure DevOps │
│ backup pipeline│
└─────────────────┘
```
## Components
### `probe_timer` (Timer Trigger)
- **Schedule**: every 5 minutes (`0 */5 * * * *`)
- **Input**: `TimerRequest` from Functions runtime
- **Output**: queue message to `backup-trigger-queue` (via `func.Out[str]`)
- **Actions**:
1. Load debouncer state from Azure Table Storage (`ProbeState` / `singleton` / `default`).
2. Run `scripts/probe_tenant_changes.py` via subprocess.
3. Save updated state back to Table Storage.
4. If `trigger=true`, emit a queue message.
### `queue_consumer` (Queue Trigger)
- **Input**: `QueueMessage` from `backup-trigger-queue`
- **Actions**:
1. Parse JSON payload (`reason`, `checked_at`).
2. Call Azure DevOps REST API to queue the backup pipeline run.
3. Raise on failure so the Functions runtime handles retry and poison-queue logic.
### `scripts/probe_tenant_changes.py`
Standalone CLI script that can also be run locally. It:
- Queries Intune (`deviceManagement/auditEvents`) and Entra (`directoryAudits`) audit logs.
- Implements a three-state debouncer: `idle``armed``cooldown`.
- Returns JSON with `trigger`, `reason`, and `new_state`.
### `scripts/trigger_backup_pipeline.py`
Standalone CLI script that queues an Azure DevOps pipeline run via REST API. Can be used locally or from the queue consumer.
## Debouncer State Machine
| State | Condition to transition | Output |
|---|---|---|
| **idle** | Audit log shows a new change | → `armed` |
| **armed** | Quiet window elapsed (default 15 min) with no newer events | → `cooldown`, `trigger=true` |
| **armed** | Newer event arrives while armed | Stay `armed`, extend quiet window |
| **cooldown** | Cooldown elapsed (default 30 min) | → `idle` |
| **cooldown** | New event arrives | Stay `cooldown` (change is buffered until cooldown ends) |
## Configuration
All settings are provided via Function App application settings (environment variables):
| Setting | Required | Default | Description |
|---|---|---|---|
| `AzureWebJobsStorage` | Yes | — | Storage account connection string (tables + queues) |
| `PROBE_APP_ID` | Yes* | — | Entra app registration client ID |
| `PROBE_APP_SECRET` | Yes* | — | Entra app client secret |
| `TENANT_ID` | Yes* | — | Microsoft 365 tenant ID |
| `GRAPH_TOKEN` | No | — | Optional passthrough token ( skips client credentials flow ) |
| `ADO_ORGANIZATION` | Yes | — | Azure DevOps organization name |
| `ADO_PROJECT` | Yes | — | Azure DevOps project name |
| `ADO_PIPELINE_ID` | Yes | — | Backup pipeline definition ID |
| `ADO_TOKEN` | Yes | — | Azure DevOps PAT with **Build (read & execute)** |
| `ADO_BRANCH` | No | `main` | Git ref to queue the pipeline against |
| `PROBE_QUIET_WINDOW_MINUTES` | No | `15` | Minutes to wait for change burst to settle |
| `PROBE_COOLDOWN_MINUTES` | No | `30` | Minutes between successive triggers |
\* Required unless `GRAPH_TOKEN` is provided.
## Local Development
### Prerequisites
- Python 3.11+
- [Azure Functions Core Tools](https://learn.microsoft.com/en-us/azure/azure-functions/functions-run-local)
- An Azure Storage account (or Azurite for local emulation)
### Install dependencies
```bash
cd infra/change-probe
pip install -r requirements.txt
```
### Copy shared scripts
The probe reuses scripts from the repository root. Copy them into this directory before building or running locally:
```bash
cp ../../scripts/common.py scripts/
cp ../../scripts/probe_tenant_changes.py scripts/
cp ../../scripts/trigger_backup_pipeline.py scripts/
```
### Run locally
```bash
# Start Azurite (Storage emulator)
azurite --silent --location ./azurite --debug ./azurite/debug.log
# Copy local settings template
cp local.settings.json.example local.settings.json
# Edit local.settings.json with your values
# Start the Functions host
func start
```
### Run the probe script standalone
```bash
cd ../..
python3 scripts/probe_tenant_changes.py \
--client-id "$PROBE_APP_ID" \
--client-secret "$PROBE_APP_SECRET" \
--tenant-id "$TENANT_ID" \
--state-file ./probe-state.json \
--output ./probe-result.json
```
### Trigger the backup pipeline standalone
```bash
python3 scripts/trigger_backup_pipeline.py \
--organization "contoso" \
--project "Intune" \
--pipeline-id 1 \
--token "$ADO_TOKEN" \
--branch refs/heads/main
```
## Deployment
Use the unified provisioning script:
```powershell
.\deploy\provision-change-probe.ps1 `
-TenantName "contoso.onmicrosoft.com" `
-ResourceGroupName "rg-astral-probe" `
-Location "westeurope" `
-DeployFunctionApp
```
The script will:
1. Register an Entra app (or reuse an existing one).
2. Grant admin consent for Graph permissions.
3. Create a client secret.
4. Provision Resource Group, Storage Account, and Function App (Linux Consumption, Python 3.11).
5. Configure application settings.
6. Build and deploy the function package.
### Manual deployment (zip package)
If you prefer to deploy manually:
```bash
cd infra/change-probe
# Copy shared scripts into the package directory
cp ../../scripts/common.py scripts/
cp ../../scripts/probe_tenant_changes.py scripts/
cp ../../scripts/trigger_backup_pipeline.py scripts/
# Install production dependencies into the package
pip install -r requirements.txt --target .python_packages/lib/site-packages
# Build the zip (Linux Consumption requires .python_packages/lib/site-packages, NOT python3.11/)
zip -r function-package.zip \
probe_timer/ queue_consumer/ scripts/ .python_packages/ \
host.json requirements.txt \
-x "*.pyc" -x "__pycache__/*"
# Upload and set WEBSITE_RUN_FROM_PACKAGE
az functionapp deployment source config-zip \
--resource-group rg-astral-probe \
--name func-astral-probe \
--src function-package.zip
```
## Permissions
### Entra App (Graph access)
The probe requires the same read permissions as the main backup pipeline:
- `DeviceManagementConfiguration.Read.All`
- `DeviceManagementApps.Read.All`
- `AuditLog.Read.All`
- `Directory.Read.All`
### Azure DevOps PAT
The `ADO_TOKEN` must have:
- **Build** → *Read & execute*
## Monitoring
Check the `ProbeState` table for current debouncer state:
```bash
az storage entity query --table-name ProbeState --account-name <storage>
```
Check the queue depth:
```bash
az storage queue list --account-name <storage>
```
## Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Timer fires but no state update | `schedule_status["last"]` case mismatch (fixed in current version) | Ensure deployed code uses `.get("Last")` |
| Probe script `ModuleNotFoundError` | Bundled packages in wrong path | Use `.python_packages/lib/site-packages`, not `python3.11/site-packages` |
| Queue message lands in poison queue | `ADO_TOKEN` missing or invalid | Verify token in Function App settings and restart |
| Probe never triggers | No audit events in Graph window | Normal if tenant is idle; verify `AuditLog.Read.All` permission |
| Duplicate pipeline runs | Multiple messages queued | Check debouncer state; cooldown should prevent this |

View File

@@ -0,0 +1,15 @@
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[4.*, 5.0.0)"
}
}

View File

@@ -0,0 +1,19 @@
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "python",
"PROBE_APP_ID": "",
"PROBE_APP_SECRET": "",
"TENANT_ID": "",
"GRAPH_TOKEN": "",
"ADO_ORGANIZATION": "",
"ADO_PROJECT": "",
"ADO_PIPELINE_ID": "",
"ADO_TOKEN": "",
"ADO_BRANCH": "main",
"PROBE_QUIET_WINDOW_MINUTES": "15",
"PROBE_COOLDOWN_MINUTES": "30",
"REPO_ROOT": "../../"
}
}

View File

@@ -0,0 +1,137 @@
#!/usr/bin/env python3
"""Azure Function timer trigger that probes tenant audit logs and queues a backup run when changes are detected."""
from __future__ import annotations
import json
import logging
import os
import subprocess
import sys
from typing import Any
import azure.functions as func
from azure.data.tables import TableServiceClient
_TABLE_NAME = "ProbeState"
_PARTITION_KEY = "singleton"
_ROW_KEY = "default"
def _repo_root() -> str:
"""Resolve the repository root so we can invoke scripts/probe_tenant_changes.py."""
env_root = os.environ.get("REPO_ROOT", "").strip()
if env_root:
return os.path.abspath(env_root)
return os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
def _load_state(connection_string: str) -> dict[str, Any]:
"""Load persisted probe state from Azure Table Storage."""
try:
service = TableServiceClient.from_connection_string(conn_str=connection_string)
table = service.get_table_client(table_name=_TABLE_NAME)
entity = table.get_entity(partition_key=_PARTITION_KEY, row_key=_ROW_KEY)
raw = entity.get("state", "{}")
return json.loads(raw) if isinstance(raw, str) else dict(raw)
except Exception as exc:
logging.warning(f"Unable to load state from Table Storage ({exc}); starting fresh.")
return {}
def _save_state(connection_string: str, state: dict[str, Any]) -> None:
"""Persist probe state to Azure Table Storage."""
service = TableServiceClient.from_connection_string(conn_str=connection_string)
table = service.get_table_client(table_name=_TABLE_NAME)
table.upsert_entity(
{
"PartitionKey": _PARTITION_KEY,
"RowKey": _ROW_KEY,
"state": json.dumps(state),
}
)
def main(mytimer: func.TimerRequest, msg: func.Out[str]) -> None:
utc_now = mytimer.schedule_status.get("Last", "n/a") if mytimer.schedule_status else "n/a"
logging.info(f"Probe timer triggered at {utc_now}")
client_id = os.environ.get("PROBE_APP_ID", "").strip()
client_secret = os.environ.get("PROBE_APP_SECRET", "").strip()
tenant_id = os.environ.get("TENANT_ID", "").strip()
token = os.environ.get("GRAPH_TOKEN", "").strip()
auth_args: list[str] = []
if token:
auth_args = ["--token", token]
elif client_id and client_secret and tenant_id:
auth_args = [
"--client-id", client_id,
"--client-secret", client_secret,
"--tenant-id", tenant_id,
]
else:
logging.error("No Graph authentication configured (PROBE_APP_ID/SECRET/TENANT_ID or GRAPH_TOKEN).")
return
connection_string = os.environ.get("AzureWebJobsStorage", "").strip()
if not connection_string:
logging.error("AzureWebJobsStorage connection string is missing.")
return
state = _load_state(connection_string)
state_json = json.dumps(state) if state else ""
quiet_window = os.environ.get("PROBE_QUIET_WINDOW_MINUTES", "15")
cooldown = os.environ.get("PROBE_COOLDOWN_MINUTES", "30")
probe_script = os.path.join(_repo_root(), "scripts", "probe_tenant_changes.py")
if not os.path.exists(probe_script):
logging.error(f"Probe script not found at {probe_script}")
return
cmd = [
sys.executable,
probe_script,
*auth_args,
"--quiet-window-minutes", quiet_window,
"--cooldown-minutes", cooldown,
]
if state_json:
cmd.extend(["--state-json", state_json])
logging.info(f"Running probe script: {probe_script}")
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
except subprocess.TimeoutExpired:
logging.error("Probe script timed out after 60 seconds.")
return
except Exception as exc:
logging.error(f"Failed to run probe script ({exc}).")
return
if result.returncode != 0:
logging.error(f"Probe script failed (exit {result.returncode}): {result.stderr}")
return
try:
output = json.loads(result.stdout)
except json.JSONDecodeError as exc:
logging.error(f"Probe script returned invalid JSON ({exc}): {result.stdout[:500]}")
return
new_state = output.get("new_state", state)
_save_state(connection_string, new_state)
trigger = output.get("trigger", False)
reason = output.get("reason", "no reason given")
logging.info(f"Probe result: trigger={trigger}, reason={reason}")
if trigger:
queue_payload = json.dumps(
{
"reason": reason,
"checked_at": output.get("checked_at", ""),
}
)
msg.set(queue_payload)
logging.info("Queued backup trigger message.")

View File

@@ -0,0 +1,18 @@
{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "mytimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 */5 * * * *"
},
{
"name": "msg",
"type": "queue",
"direction": "out",
"queueName": "backup-trigger-queue",
"connection": "AzureWebJobsStorage"
}
]
}

View File

@@ -0,0 +1,77 @@
#!/usr/bin/env python3
"""Azure Function queue trigger that calls the Azure DevOps REST API to queue a backup pipeline run."""
from __future__ import annotations
import json
import logging
import os
import subprocess
import sys
import azure.functions as func
def _repo_root() -> str:
"""Resolve the repository root so we can invoke scripts/trigger_backup_pipeline.py."""
env_root = os.environ.get("REPO_ROOT", "").strip()
if env_root:
return os.path.abspath(env_root)
return os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
def main(msg: func.QueueMessage) -> None:
body = msg.get_body().decode("utf-8")
logging.info(f"Queue consumer received message: {body}")
org = os.environ.get("ADO_ORGANIZATION", "").strip()
project = os.environ.get("ADO_PROJECT", "").strip()
pipeline_id = os.environ.get("ADO_PIPELINE_ID", "").strip()
token = os.environ.get("ADO_TOKEN", "").strip()
branch = os.environ.get("ADO_BRANCH", "main").strip()
if not all([org, project, pipeline_id, token]):
logging.error("Missing one or more ADO configuration variables (ADO_ORGANIZATION, ADO_PROJECT, ADO_PIPELINE_ID, ADO_TOKEN).")
# Re-raising causes the Functions runtime to retry the message after the visibility timeout.
raise RuntimeError("Incomplete ADO configuration")
trigger_script = os.path.join(_repo_root(), "scripts", "trigger_backup_pipeline.py")
if not os.path.exists(trigger_script):
logging.error(f"Trigger script not found at {trigger_script}")
raise RuntimeError("Trigger script missing")
cmd = [
sys.executable,
trigger_script,
"--organization",
org,
"--project",
project,
"--pipeline-id",
pipeline_id,
"--token",
token,
"--branch",
branch,
]
logging.info(f"Triggering ADO pipeline {pipeline_id} ...")
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=60,
)
except subprocess.TimeoutExpired:
logging.error("Trigger script timed out after 60 seconds.")
raise
except Exception as exc:
logging.error(f"Failed to run trigger script ({exc}).")
raise
if result.returncode != 0:
logging.error(f"Trigger script failed (exit {result.returncode}): {result.stderr}")
raise RuntimeError(f"Trigger script failed: {result.stderr}")
logging.info(f"Trigger script succeeded: {result.stdout.strip()}")

View File

@@ -0,0 +1,12 @@
{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "msg",
"type": "queueTrigger",
"direction": "in",
"queueName": "backup-trigger-queue",
"connection": "AzureWebJobsStorage"
}
]
}

View File

@@ -0,0 +1,3 @@
azure-functions
azure-data-tables
azure-storage-queue