Files
astral/deploy/onboarding-runbook.md
Tomas Kracmar 2c41eaca44 Sync from dev @ 497baf0
Source: main (497baf0)
Excluded: live tenant exports, generated artifacts, and dev-only tooling.
2026-04-21 22:21:43 +02:00

209 lines
8.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ASTRAL Onboarding Runbook
This guide walks through deploying ASTRAL into a new Azure DevOps organization and Microsoft 365 tenant.
## Prerequisites
- Azure DevOps organization and project created.
- Owner or Contributor access to the target Microsoft 365 tenant.
- Permission to create app registrations and grant admin consent in Entra ID.
- PowerShell 7+ or Windows PowerShell 5.1 with the `Microsoft.Graph` module (for the bootstrap script).
## Step 1: Import the repository
1. In Azure DevOps, create a new Git repository in your project.
2. Push the contents of this repository into it, or use **Import repository** from a public Git URL.
## Step 2: Create the tenant variable group
1. In Azure DevOps, go to **Pipelines > Library** and create a new Variable Group.
2. Recommended name: `vg-astral-tenant` (you can choose any name).
3. Add the variables from `templates/variables-tenant.yml`. Use your real tenant values:
| Variable | Example value | Notes |
| --- | --- | --- |
| `TENANT_NAME` | `contoso.onmicrosoft.com` | Your M365 tenant domain |
| `SERVICE_CONNECTION_NAME` | `sc-astral-backup` | Name you will use for the service connection |
| `USER_NAME` | `ASTRAL Backup Service` | Git committer name |
| `USER_EMAIL` | `astral-backup@contoso.com` | Git committer email |
| `AGENT_POOL_NAME` | `Azure Pipelines` | Change if using a self-hosted pool |
| `BACKUP_TIMEZONE` | `Europe/Prague` | Valid tz database name |
| `FULL_RUN_HOUR` | `00` | Hour that triggers full export |
| `AUTO_REMEDIATE_RESTORE_PIPELINE_ID` | *(leave empty)* | Filled in Step 8 |
4. If you plan to use Azure OpenAI summaries, also add:
- `ENABLE_PR_AI_SUMMARY` = `true`
- `AZURE_OPENAI_ENDPOINT`
- `AZURE_OPENAI_DEPLOYMENT`
- `AZURE_OPENAI_API_KEY` *(mark as secret)*
## Step 3: Link the variable group to the pipelines
Open each pipeline YAML and uncomment the variable group line near the top:
```yaml
variables:
- group: vg-astral-tenant # <-- uncomment this line
- template: templates/variables-common.yml
```
Do this for:
- `azure-pipelines.yml`
- `azure-pipelines-review-sync.yml`
- `azure-pipelines-restore.yml`
Commit and push the changes.
## Step 4: Run the tenant bootstrap script
Run `deploy/bootstrap-tenant.ps1` in a PowerShell session authenticated to your target tenant.
```powershell
# Example
.\deploy\bootstrap-tenant.ps1 -TenantName "contoso.onmicrosoft.com" -ServiceConnectionName "sc-astral-backup"
```
The script will:
1. Create a single-tenant app registration.
2. Add required Microsoft Graph application permissions.
3. Grant admin consent.
4. Create a workload federated credential for Azure DevOps.
5. Print the App ID and instructions for creating the Azure DevOps service connection.
## Step 5: Create the Azure DevOps service connection
1. In Azure DevOps, go to **Project settings > Service connections**.
2. Click **New service connection > Azure Resource Manager > Workload identity federation (manual)**.
3. Fill in:
- **Subscription**: leave blank or select if you also want ARM access (not required).
- **Tenant ID**: your Microsoft 365 tenant ID.
- **Service Connection Name**: the same value you set in `SERVICE_CONNECTION_NAME` (e.g. `sc-astral-backup`).
- **App ID**: from the bootstrap script output.
4. Save the service connection.
## Step 6: Import the pipelines
1. Go to **Pipelines > Create pipeline > Azure Repos Git**.
2. Select your repository.
3. Choose **Existing Azure Pipelines YAML file**.
4. Import each of the three YAMLs one by one:
- `azure-pipelines.yml` (main backup)
- `azure-pipelines-review-sync.yml` (review sync)
- `azure-pipelines-restore.yml` (restore)
## Step 7: Grant repository permissions to the build identity
1. Go to **Project settings > Repositories**.
2. Select your repository.
3. Under **Security**, grant the **Build Service** account:
- Contribute
- Create branch
- Force push
- Create pull request
- Edit pull request
- Tag creation (if you enable tagging)
4. Under **Pipelines**, grant the build service **Queue builds** permission on `azure-pipelines-restore.yml` if you plan to use auto-remediation.
## Step 8: Set the restore pipeline definition ID
After importing `azure-pipelines-restore.yml`, find its definition ID:
1. Open the restore pipeline in Azure DevOps.
2. The URL contains `definitionId=XX`. Note the number.
3. Go back to your variable group (`vg-astral-tenant`) and set:
- `AUTO_REMEDIATE_RESTORE_PIPELINE_ID` = `XX`
## Step 9: Validate the deployment
1. Import `deploy/validate-deployment.yml` as a one-time pipeline.
2. Run it.
3. Verify that all checks pass:
- Graph token acquisition
- Required roles present
- Test read from Graph
- Test PR creation and abandonment
## Step 10: Run the first backup
1. Queue a manual run of `azure-pipelines.yml`.
2. Set `forceFullRun=true` to get a complete initial snapshot.
3. Verify that `tenant-state/` is populated and a rolling PR is created.
## Step 11: Provision the event-driven change probe (optional but recommended)
The change probe replaces the previous hourly polling model with responsive, event-driven backup triggers.
### Option A: Automated provisioning
Run the unified provisioning script:
```powershell
.\deploy\provision-change-probe.ps1 `
-TenantName "contoso.onmicrosoft.com" `
-ResourceGroupName "rg-astral-probe" `
-Location "westeurope" `
-DeployFunctionApp
```
The script will create an Entra app, grant admin consent, provision Azure resources, and deploy the Function App.
### Option B: Manual provisioning
If you prefer manual setup:
1. **Create an app registration** in Entra ID for the probe.
2. **Grant admin consent** for:
- `DeviceManagementConfiguration.Read.All`
- `DeviceManagementApps.Read.All`
- `AuditLog.Read.All`
- `Directory.Read.All`
3. **Create a client secret** and note the value.
4. **Provision Azure resources**:
- Resource Group
- Storage Account (Standard LRS)
- Function App (Linux Consumption, Python 3.11)
5. **Configure Function App settings**:
| Setting | Value |
|---|---|
| `AzureWebJobsStorage` | Storage account connection string |
| `PROBE_APP_ID` | App registration client ID |
| `PROBE_APP_SECRET` | App registration client secret |
| `TENANT_ID` | Your Microsoft 365 tenant ID |
| `ADO_ORGANIZATION` | Your Azure DevOps org name |
| `ADO_PROJECT` | Your Azure DevOps project name |
| `ADO_PIPELINE_ID` | Definition ID of `azure-pipelines.yml` |
| `ADO_TOKEN` | Azure DevOps PAT with **Build (read & execute)** |
| `ADO_BRANCH` | `main` (or your baseline branch) |
6. **Deploy the function package** using `WEBSITE_RUN_FROM_PACKAGE` (see `infra/change-probe/README.md`).
### Verify the probe
1. Make a test change in Intune (e.g., create a temporary device configuration profile).
2. Wait 520 minutes for the audit log to propagate.
3. Check the `ProbeState` table in your Storage Account — the `singleton/default` entity should show `debouncer.state = armed`.
4. After the quiet window (default 15 min) elapses, a queue message will be emitted.
5. The `queue_consumer` will dequeue it and queue the backup pipeline.
6. Verify the pipeline run appears in Azure DevOps with reason `manual` (API-triggered runs show as manual).
> **Note:** The probe uses the same Entra app as the main backup pipeline. You can reuse the app registration created by `bootstrap-tenant.ps1` if you add the `AuditLog.Read.All` permission and create a client secret for it.
## Optional: progressive feature rollout
| Phase | What to enable |
| --- | --- |
| Backup-only | `ENABLE_PR_REVIEW_SUMMARY=false`, `ENABLE_PR_REVIEWER_DECISIONS=false`, `AUTO_REMEDIATE_AFTER_MERGE=false` |
| Review package | `ENABLE_PR_REVIEW_SUMMARY=true`, `ENABLE_PR_REVIEWER_DECISIONS=true` |
| Full package | Also enable restore and set `AUTO_REMEDIATE_AFTER_MERGE=true` if desired |
| AI summaries | `ENABLE_PR_AI_SUMMARY=true` plus Azure OpenAI variables |
## Troubleshooting
| Symptom | Likely cause | Fix |
| --- | --- | --- |
| Pipeline fails at "Get Graph Token" | Wrong service connection name or missing federated credential | Verify `SERVICE_CONNECTION_NAME` matches the service connection exactly |
| "Missing required Graph roles" | Admin consent not granted | Run bootstrap script again or grant consent manually in Entra ID |
| Rolling PR not created | Build identity lacks PR permissions | Add **Create pull request** and **Edit pull request** permissions |
| Restore pipeline queue fails | `AUTO_REMEDIATE_RESTORE_PIPELINE_ID` wrong or missing queue permission | Verify the ID and grant **Queue builds** on the restore pipeline |
| Empty `tenant-state/` after run | First run may have no data if Graph returns nothing; also check `BACKUP_FOLDER` path | Verify Graph permissions and re-run |