From 4761ff7e9a1f60116106ddf563bcf51474bb6102 Mon Sep 17 00:00:00 2001 From: Slavi Pantaleev Date: Fri, 20 Feb 2026 23:54:21 +0200 Subject: [PATCH] Align homeserver/coturn service priorities to avoid first-start cert race The startup issue came from a timing dependency around coturn TLS certs: - `matrix-coturn.service` depends on `matrix-traefik-certs-dumper-wait-for-domain@.service` - That waiter succeeds only after Traefik has obtained and dumped a cert for the Matrix hostname (typically driven by homeserver labels/routes becoming active) - If coturn is started too early, it can block/fail waiting for cert files that are not yet present Historically, coturn priority was mode-dependent: - `one-by-one`: coturn at 1500 (delayed after homeserver) - other modes: coturn at 900 (before homeserver) This could still trigger undesirable startup ordering and confusing behavior in non-`one-by-one` modes, especially during initial bootstrap/restart flows where cert availability lags service startup. This change makes ordering explicit and consistent: 1. Introduce `matrix_homeserver_systemd_service_manager_priority` (default 1000) in `roles/custom/matrix-base/defaults/main.yml`. 2. Use that variable for the homeserver service entry in `group_vars/matrix_servers`. 3. Set coturn priority relative to homeserver priority in all modes: `matrix_homeserver_systemd_service_manager_priority + 500`. 4. Update inline documentation comments in `group_vars/matrix_servers` to match the new behavior and rationale. Result: - Homeserver/coturn ordering is deterministic and mode-agnostic. - Coturn is intentionally started later than the homeserver by default, reducing first-start certificate wait/fail races. - Priority intent is now centralized and configurable via a dedicated homeserver priority variable. - Coturn may still be stated earlier, because the homeserver typically has a `Wants` "dependency" on it, but that's alright --- group_vars/matrix_servers | 15 +++++++-------- roles/custom/matrix-base/defaults/main.yml | 4 ++++ 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/group_vars/matrix_servers b/group_vars/matrix_servers index 805bfb819..95154d1d4 100755 --- a/group_vars/matrix_servers +++ b/group_vars/matrix_servers @@ -246,15 +246,14 @@ matrix_addons_homeserver_systemd_services_list: | # - so that addon services (starting later) can communicte with the homeserver via Traefik's internal entrypoint # (see `matrix_playbook_internal_matrix_client_api_traefik_entrypoint_enabled`) # - core services (the homeserver) get a level of ~1000 -# - services that the homeserver depends on (database, Redis, ntfy, coturn, etc.) get a lower level — between 500 and 1000 -# - coturn gets a higher priority level (= starts later) if `devture_systemd_service_manager_service_restart_mode == 'one-by-one'` to intentionally delay it, because: -# - starting services one by one means that the service manager role waits for each service to fully start before proceeding to the next one +# - services that the homeserver depends on (database, Redis, ntfy, etc.) get a lower level — between 500 and 1000 +# - coturn gets a higher priority level (= starts later) in all cases, to intentionally delay it in relation to the homeserver, because: +# - when starting services one by one, the service manager waits for each service to fully start before proceeding to the next one # - if coturn has a lower priority than the homeserver, it would be started before it -# - since coturn is started before the homeserver, there's no container label telling Traefik to get a `matrix.example.com` certificate +# - if coturn is started before the homeserver, there'd be no container label (usually on the homeserver) telling Traefik to get a `matrix.example.com` certificate # - thus, coturn would spin and wait for a certificate until it fails. We'd get a playbook failure due to it, but service manager will proceed to start all other services anyway. # - only later, when the homeserver actually starts, would that certificate be fetched and dumped -# - this is not a problem with `all-at-once` (default) or `priority-batched` (services start concurrently), -# or with `clean-stop-start` (everything stops first, then starts in priority order — coturn at 900 is fine) +# - this is a problem for `one-by-one`, `clean-stop-start` (which behaves like one-by-one initially) and possibly other modes, except `all-at-once` # - reverse-proxying services get level 3000 # - Matrix utility services (bridges, bots) get a level of 2000/2200, so that: # - they can start before the reverse-proxy @@ -607,7 +606,7 @@ devture_systemd_service_manager_services_list_auto: | + ([{ 'name': ('matrix-' + matrix_homeserver_implementation + '.service'), - 'priority': 1000, + 'priority': matrix_homeserver_systemd_service_manager_priority, 'restart_necessary': true, 'groups': ['matrix', 'homeservers', matrix_homeserver_implementation], }] if matrix_homeserver_enabled else []) @@ -635,7 +634,7 @@ devture_systemd_service_manager_services_list_auto: | + ([{ 'name': (coturn_identifier + '.service'), - 'priority': (1500 if devture_systemd_service_manager_service_restart_mode == 'one-by-one' else 900), + 'priority': (matrix_homeserver_systemd_service_manager_priority + 500), 'restart_necessary': (coturn_restart_necessary | bool), 'groups': ['matrix', 'coturn'], }] if coturn_enabled else []) diff --git a/roles/custom/matrix-base/defaults/main.yml b/roles/custom/matrix-base/defaults/main.yml index 49b3c89f3..202b1aea3 100644 --- a/roles/custom/matrix-base/defaults/main.yml +++ b/roles/custom/matrix-base/defaults/main.yml @@ -92,6 +92,10 @@ matrix_homeserver_enabled: true # Note that the homeserver implementation of a server will not be able to be changed without data loss. matrix_homeserver_implementation: synapse +# The priority that the homeserver starts with (lower = starts earlier). +# Related to the systemd_service_manager role and `devture_systemd_service_manager_services_list*` variables. +matrix_homeserver_systemd_service_manager_priority: 1000 + # This contains a secret, which is used for generating various other secrets later on. matrix_homeserver_generic_secret_key: ''