3
0
mirror of https://github.com/spantaleev/matrix-docker-ansible-deploy.git synced 2026-02-28 01:43:10 +00:00

Align homeserver/coturn service priorities to avoid first-start cert race

The startup issue came from a timing dependency around coturn TLS certs:

- `matrix-coturn.service` depends on
  `matrix-traefik-certs-dumper-wait-for-domain@<matrix-fqdn>.service`
- That waiter succeeds only after Traefik has obtained and dumped a cert for
  the Matrix hostname (typically driven by homeserver labels/routes becoming
  active)
- If coturn is started too early, it can block/fail waiting for cert files
  that are not yet present

Historically, coturn priority was mode-dependent:

- `one-by-one`: coturn at 1500 (delayed after homeserver)
- other modes: coturn at 900 (before homeserver)

This could still trigger undesirable startup ordering and confusing behavior
in non-`one-by-one` modes, especially during initial bootstrap/restart flows
where cert availability lags service startup.

This change makes ordering explicit and consistent:

1. Introduce `matrix_homeserver_systemd_service_manager_priority` (default 1000)
   in `roles/custom/matrix-base/defaults/main.yml`.
2. Use that variable for the homeserver service entry in
   `group_vars/matrix_servers`.
3. Set coturn priority relative to homeserver priority in all modes:
   `matrix_homeserver_systemd_service_manager_priority + 500`.
4. Update inline documentation comments in `group_vars/matrix_servers` to
   match the new behavior and rationale.

Result:

- Homeserver/coturn ordering is deterministic and mode-agnostic.
- Coturn is intentionally started later than the homeserver by default,
  reducing first-start certificate wait/fail races.
- Priority intent is now centralized and configurable via a dedicated
  homeserver priority variable.
- Coturn may still be stated earlier, because the homeserver typically
  has a `Wants` "dependency" on it, but that's alright
This commit is contained in:
Slavi Pantaleev
2026-02-20 23:54:21 +02:00
parent 976d2c4cd0
commit 4761ff7e9a
2 changed files with 11 additions and 8 deletions

View File

@@ -246,15 +246,14 @@ matrix_addons_homeserver_systemd_services_list: |
# - so that addon services (starting later) can communicte with the homeserver via Traefik's internal entrypoint
# (see `matrix_playbook_internal_matrix_client_api_traefik_entrypoint_enabled`)
# - core services (the homeserver) get a level of ~1000
# - services that the homeserver depends on (database, Redis, ntfy, coturn, etc.) get a lower level — between 500 and 1000
# - coturn gets a higher priority level (= starts later) if `devture_systemd_service_manager_service_restart_mode == 'one-by-one'` to intentionally delay it, because:
# - starting services one by one means that the service manager role waits for each service to fully start before proceeding to the next one
# - services that the homeserver depends on (database, Redis, ntfy, etc.) get a lower level — between 500 and 1000
# - coturn gets a higher priority level (= starts later) in all cases, to intentionally delay it in relation to the homeserver, because:
# - when starting services one by one, the service manager waits for each service to fully start before proceeding to the next one
# - if coturn has a lower priority than the homeserver, it would be started before it
# - since coturn is started before the homeserver, there's no container label telling Traefik to get a `matrix.example.com` certificate
# - if coturn is started before the homeserver, there'd be no container label (usually on the homeserver) telling Traefik to get a `matrix.example.com` certificate
# - thus, coturn would spin and wait for a certificate until it fails. We'd get a playbook failure due to it, but service manager will proceed to start all other services anyway.
# - only later, when the homeserver actually starts, would that certificate be fetched and dumped
# - this is not a problem with `all-at-once` (default) or `priority-batched` (services start concurrently),
# or with `clean-stop-start` (everything stops first, then starts in priority order — coturn at 900 is fine)
# - this is a problem for `one-by-one`, `clean-stop-start` (which behaves like one-by-one initially) and possibly other modes, except `all-at-once`
# - reverse-proxying services get level 3000
# - Matrix utility services (bridges, bots) get a level of 2000/2200, so that:
# - they can start before the reverse-proxy
@@ -607,7 +606,7 @@ devture_systemd_service_manager_services_list_auto: |
+
([{
'name': ('matrix-' + matrix_homeserver_implementation + '.service'),
'priority': 1000,
'priority': matrix_homeserver_systemd_service_manager_priority,
'restart_necessary': true,
'groups': ['matrix', 'homeservers', matrix_homeserver_implementation],
}] if matrix_homeserver_enabled else [])
@@ -635,7 +634,7 @@ devture_systemd_service_manager_services_list_auto: |
+
([{
'name': (coturn_identifier + '.service'),
'priority': (1500 if devture_systemd_service_manager_service_restart_mode == 'one-by-one' else 900),
'priority': (matrix_homeserver_systemd_service_manager_priority + 500),
'restart_necessary': (coturn_restart_necessary | bool),
'groups': ['matrix', 'coturn'],
}] if coturn_enabled else [])

View File

@@ -92,6 +92,10 @@ matrix_homeserver_enabled: true
# Note that the homeserver implementation of a server will not be able to be changed without data loss.
matrix_homeserver_implementation: synapse
# The priority that the homeserver starts with (lower = starts earlier).
# Related to the systemd_service_manager role and `devture_systemd_service_manager_services_list*` variables.
matrix_homeserver_systemd_service_manager_priority: 1000
# This contains a secret, which is used for generating various other secrets later on.
matrix_homeserver_generic_secret_key: ''