Identity & Directory
Identity & Directory
Unified user/group identity store with multi-provider sync, distributed cache, and event-driven lifecycle
Overview
The identity subsystem provides a unified user and group store that aggregates identity data from multiple upstream providers into a single cluster-wide distributed cache. It serves as the authoritative source for user lookups, group memberships, authentication status checks, and account lifecycle events across HexonGateway.
Core capabilities:
- Distributed directory cache with O(1) indexed lookups
- Multi-provider identity aggregation (LDAP, SCIM 2.0, OIDC RP)
- Full and delta synchronization with configurable intervals
- Automatic credential revocation on user disable (OIDC tokens, sessions, VPN)
- Event-driven callbacks for user disable and user update notifications
- Nested group resolution with DAG traversal and cycle detection
- Multi-provider merge with priority-based conflict resolution
- Real-time webhook updates from SCIM providers (Okta, Azure AD, OneLogin)
- External IdP authentication via OIDC Relying Party (PKCE, DPoP, PAR)
The subsystem is organized into three layers:
Providers (data sources): - LDAP provider: LDAP client with connection pooling and bind auth - SCIM provider: SCIM 2.0 pull sync and webhook push sync - OIDC RP provider: OIDC Relying Party for external IdP SSO Directory (unified cache): - Cluster-wide cache with indexed queries and sync orchestration Consumers (downstream modules): - Authentication, authorization, proxy, VPN, bastion, firewallArchitecture
Data flow from providers to consumers:
LDAP Server -----> [ldap provider] ----+ | SCIM Provider ---> [scim provider] ----+--> [directory cache] --> cluster storage (Okta, Azure AD) (pull + webhook) | (indexes, TTL) | | OIDC IdP --------> [oidc rp] ----------+ | (Azure, Google) (SSO claims) v [consumer modules query directory] auth, proxy, firewall, vpn, bastionDirectory cache indexes (all O(1)):
email -> username, username -> groups, groupname -> members, disabled usersSynchronization modes:
LDAP full sync: rebuilds entire cache (default: every 60 minutes) LDAP delta sync: incremental via modifyTimestamp (default: every 5 minutes) SCIM pull sync: scheduled per-provider (default: every 15 minutes) SCIM push sync: real-time webhooks with HMAC-SHA256 verificationCluster behavior:
Each node maintains independent provider connections and sync loops. Directory data is replicated to all nodes (eventual consistency). Queries are local-only for low latency with no quorum requirements. OIDC auth sessions are replicated cluster-wide for cross-node callback handling.Relationships
Child modules:
- directory: Central cache and query API. All other modules consume
directory for user lookups, group checks, and auth status verification.- identity.ldap: Primary on-premise provider. Supplies users and groups
via full/delta sync. Handles password bind authentication.- identity.scim: Cloud identity provider. Syncs from Okta, Azure AD,
OneLogin via pull and webhook push. Multi-provider merge by priority.- identity.oidc_rp: External IdP authentication. Enables SSO via
Authorization Code Flow with PKCE, DPoP, and PAR support.Key consumers:
- Authentication modules: query directory for user existence, disabled
status, password expiry, and group memberships.- Proxy: fetches fresh group memberships on every request for
authorization and identity header injection.- Firewall: uses group memberships for ACL rule evaluation.
- Sessions: receives revocation calls when users are disabled.
- OIDC provider: receives token revocation on user disable.
- VPN/Bastion: sessions terminated on user disable.
Directory Cache
Cluster-wide distributed user/group directory cache with LDAP sync and automatic index management
Overview
The directory module provides a distributed user/group cache that synchronizes data from LDAP and stores it for fast cluster-wide access. It serves as the foundation for authentication and authorization decisions across the entire HexonAuth cluster.
Core capabilities:
- Periodic full and delta syncs from LDAP via the ldap provider module
- Cluster-wide data distribution replicated to all nodes
- Automatic index maintenance on data changes
- Fast O(1) query API for user, group, email, and membership lookups
- Paginated listing of users and groups with server-side offset/limit
- Comprehensive authentication status checks (exists, disabled, password expiry)
- Automatic credential revocation when users are disabled
- Callback registration for user disable and user update events
- Background sync loops running independently on all cluster nodes
Data flow:
LDAP Server -> [ldap module] -> [directory module] -> cluster-wide cache | [auth modules query directory]Indexes maintained automatically:
- Email to username (fast email lookup) - Username to groups (fast group membership lookup) - Group name to members (fast member listing) - Disabled users listEventual consistency model: each node syncs independently, data is replicated to all nodes. Queries run on the local node for low latency with no quorum requirements.
Config
Configuration is primarily driven by LDAP settings since directory syncs from LDAP. The directory module uses these config keys:
[identity.ldap]
url = "ldaps://ldap.example.com" delta_sync = "5m" # Delta sync interval (default: 5 minutes) full_sync = "60m" # Full sync interval (default: 60 minutes) base_dn = "dc=example,dc=com" user_base_dn = "ou=users,dc=example,dc=com" group_base_dn = "ou=groups,dc=example,dc=com" bind_dn = "cn=service,dc=example,dc=com" password = "service-password" # ... additional LDAP config (see ldap module)Sync behavior:
Full sync: retrieves ALL users and groups from LDAP, rebuilds entire cache. Users are processed in small batches to prevent cluster overload on large directories. Indexes are built in bulk after all users are stored, avoiding per-user-per-group round trips that scale as O(users x groups). Runs on startup plus periodic interval. Delta sync: retrieves only MODIFIED users/groups since last sync timestamp, updates changed entries cluster-wide. Indexes updated incrementally. Default interval: 5 minutes.Data TTL: 24 hours by default. Entries evicted automatically if not refreshed by sync. This acts as a safety net for stale data.
Hot-reloadable: sync intervals, LDAP connection settings. Cold (restart required): none specific to directory module.
Troubleshooting
Common symptoms and diagnostic steps:
User not found in directory cache:
- Check LDAP connectivity: 'auth ldap' or LDAP health check - Verify user exists in LDAP: search by username in LDAP directly - Check sync status: 'directory status' to see last sync time and health - Force full sync: 'directory sync' to trigger immediate LDAP sync - Check delta_sync interval: user may not yet be synced if recently created - TTL expiry: if sync has not run in 24 hours, entries may have been evictedStale group memberships (user has old groups):
- Delta sync only picks up changes since last sync timestamp - Force full sync to rebuild entire cache from LDAP - Check modifyTimestamp field in LDAP (delta_field config) - Group membership changes in LDAP must update the user's modifyTimestamp - Verify index consistency: indexes rebuild automatically on data changesUser disabled but still has active sessions:
- Directory auto-revokes OIDC tokens and web sessions on disable - Revocations are asynchronous and replicated to all nodes (eventual) - VPN sessions terminated via user-disabled callback (if VPN enabled) - SSH bastion sessions terminate on next token refresh cycle - Check if disable was detected: look for directory disable log entriesSync failures (directory shows unhealthy):
- LDAP connection failures: check network, TLS certs, bind credentials - Timeout errors: increase search_timeout in LDAP config - Large directory: increase page_size for paginated LDAP searches - Memory pressure: large user counts may stress the cluster cache - Check metrics: directory_sync_total{result="failure"} for error counts - Check logs: search for "directory sync" in telemetry outputCluster operation queue saturation during sync:
- Can occur during full sync on directories with many groups per user - Check 'directory status' -- if last sync timestamp matches the error, sync is the cause - The directory uses batched sync with bulk indexing to prevent this, but extremely large directories or low concurrency limits may trigger it - Mitigation: increase operations.max_concurrent_ops in cluster configIndex inconsistency (email lookup fails but user exists):
- Indexes are maintained automatically, should auto-repair on next sync - Force full sync to rebuild all indexes from scratch - Check if email field is populated in LDAP user entry - Verify ldap_attribute_map.email matches LDAP schemaMetrics for monitoring:
- directory_sync_total{type="full|delta", result="success|failure"} - directory_users_synced (gauge): users synchronized in last sync - directory_groups_synced (gauge): groups synchronized in last sync - directory_sync_duration{type="full|delta"} (histogram): sync timing - Use these to track sync health, LDAP connectivity, and capacity growthSecurity
Security properties and enforcement:
No password caching:
The directory module never stores passwords. Authentication always goes through LDAP bind (via the ldap provider module). The cache contains only identity attributes: username, email, groups, status flags.Automatic credential revocation on user disable:
When a user is marked as disabled (Disabled=true), the directory module automatically revokes all authentication credentials: 1. OIDC tokens: all access and refresh tokens deleted cluster-wide 2. Web sessions: all user sessions invalidated immediately 3. VPN sessions: terminated via registered user-disabled callback 4. SSH bastion: sessions terminate on next token refresh All revocations are replicated asynchronously to all cluster nodes.Account status enforcement:
- Disabled accounts: checked by all auth flows before granting access - Password expiry: tracked and enforced, expiry time available in AuthStatus - Group membership: drives authorization decisions across all modulesCluster-wide consistency:
- Data replicated to all nodes automatically - Disabled user detection triggers revocation on all nodes - No single point of failure for authorization decisionsInput validation:
- Username lookups are case-sensitive (as stored in LDAP) - Group name matching is case-insensitive in consumer modules - Email index uses normalized form for consistent lookupsInterpreting tool output:
'directory status': Healthy: Status="Ready / Healthy", Sync Errors=0, Consecutive Errors=0 Degraded: Status shows errors, Consecutive Errors > 0 — LDAP may be unreachable Stale: Last Sync time is old (> 2x sync interval) — sync may be stuck Action: Degraded → 'auth ldap' to check LDAP connection health 'directory user <username>': Found: Shows username, email, groups, disabled status, password expiry Not found: User does not exist in directory cache — check LDAP source Disabled=true: User is locked — sessions will be revoked at next refresh Expired password: User will get password-expired session type on next login Action: User missing → 'directory sync' to force immediate LDAP syncRelationships
Module dependencies and interactions:
- LDAP provider: Primary data source. Directory calls LDAP for all sync
operations (search users, search groups, full and delta sync).- SCIM provider: Alternative data source. SCIM providers sync
users/groups into the directory, bypassing LDAP for cloud-sourced identities (Okta, Azure AD, OneLogin).- Cluster cache: All user/group data stored with 24h TTL and replicated
to all cluster nodes. Indexes maintained automatically on data changes.- Sessions: Session revocation on user disable. Directory triggers
revocation of all web sessions cluster-wide.- OIDC provider: Token revocation on user disable. Directory triggers
deletion of all access and refresh tokens.- Firewall: Consumes group membership for ACL rule matching.
Groups fetched at peer chain update time via directory queries.- VPN (IKEv2/WireGuard/OpenVPN): VPN session termination on user disable.
Group changes trigger firewall updates automatically.- Configuration: Hot-reloadable sync intervals and LDAP settings.
- Telemetry: Structured logging for sync operations, user
disable events, and error conditions.LDAP Provider
LDAP client library with connection pooling, search, bind, and directory integration
Overview
The LDAP module is a passive client library that manages connection pools and provides search, bind, and query operations against LDAP directories. It serves as the foundation for the directory module to cache and serve user/group data across the cluster.
Core capabilities:
- Pre-populated connection pool with configurable size
- TLS connections with optional custom CA certificate validation
- User search with custom LDAP filters and paginated result sets
- Group search with nested group resolution (recursive member expansion)
- User authentication via LDAP bind (no password caching)
- Delta sync support via modifyTimestamp queries for incremental updates
- Health checks with per-server latency reporting
- Configurable timeouts for search, bind, connection, and pool operations
- Smart startup retry logic with permanent vs transient error classification
- Hot-reloadable configuration for connection settings
Each cluster node maintains its own LDAP connection pool independently. The LDAP module does not replicate data or maintain caches — the directory module handles cluster-wide caching. No quorum requirements for LDAP operations.
Supported LDAP schemas:
- FreeIPA - Active Directory - OpenLDAP - Generic LDAP servers (via configurable attribute mapping)Config
Required configuration in config.toml:
[identity.ldap]
url = "ldaps://ldap.example.com" # Primary LDAP server (ldaps:// for TLS) base_dn = "dc=example,dc=com" # Base DN for all searches user_base_dn = "ou=users,dc=example,dc=com" # Base DN for user searches group_base_dn = "ou=groups,dc=example,dc=com" # Base DN for group searches bind_dn = "cn=service,dc=example,dc=com" # Service account DN for binding password = "service-password" # Service account password user_attribute = "uid" # Primary user identifier attribute user_filter = "(&(objectClass=inetOrgPerson)(uid=*))" # User search filter delta_field = "modifyTimestamp" # Field for delta sync queries page_size = 1000 # LDAP paged search size ldap_connection_pool = 5 # Connection pool size ca_pem = """-----BEGIN CERTIFICATE-----...""" # CA cert for TLS validation # Timeout Configuration (Go duration strings) search_timeout = "30s" # Search operation timeout (default: 30s) bind_timeout = "10s" # Bind/auth operation timeout (default: 10s) connection_timeout = "10s" # New connection timeout (default: 10s) pool_wait_timeout = "5s" # Pool connection wait timeout (default: 5s) # Delta sync and full sync intervals (used by directory module) delta_sync = "5m" # Delta sync interval full_sync = "60m" # Full sync interval[identity.ldap_attribute_map]
username = "uid" # Username attribute full_name = "cn" # Full name attribute email = "mail" # Email attribute given_name = "givenName" # First name attribute surname = "sn" # Last name attribute member_of = "memberOf" # Group membership attribute # Additional attributes configurable per LDAP schemaFallback URLs:
Multiple LDAP URLs can be configured. The connection pool tries the primary URL first and falls back to alternates on failure.Hot-reloadable: connection settings, timeouts, attribute mappings. New connections use current config automatically. Cold (restart required): none, but pool recreated on config change.
Troubleshooting
Common symptoms and diagnostic steps:
LDAP connection failures on startup:
- Permanent errors (fail fast, no retry): * LDAP Code 49: Invalid Credentials (wrong bind_dn or password) * LDAP Code 32: No Such Object (wrong base_dn) * LDAP Code 34: Invalid DN Syntax * Certificate validation failures (wrong or expired CA) - Transient errors (retry with exponential backoff up to 2 minutes): * Connection timeout, connection refused * DNS resolution failures, network errors - Check: 'auth ldap' for LDAP health statusConnection pool exhaustion:
- Symptom: pool_wait_timeout errors, slow LDAP operations - Check pool metrics: ldap_pool_available, ldap_pool_utilization_pct - Increase ldap_connection_pool size in config - Check for slow LDAP queries holding connections: ldap_operation_duration - Check for stale connections: ldap_pool_reconnects counterAuthentication failures (Bind operation):
- LDAP Code 49: Invalid credentials for user - Check ldap_bind_failures{reason="invalid_credentials"} metric - Verify user DN resolution: user_attribute + user_base_dn must match - Check if user is locked/disabled in LDAP (not same as bind failure)Search returning no results:
- Verify user_filter syntax: must be valid LDAP filter expression - Check user_base_dn: must be correct organizational unit - Verify page_size: too small may cause incomplete results - Check search_timeout: large directories may need longer timeout - Test with custom filter via SearchUsers operationDelta sync not detecting changes:
- Verify delta_field attribute exists in LDAP schema (e.g., modifyTimestamp) - Check timestamp format: must be generalizedTime - Some LDAP servers don't update modifyTimestamp on group membership changes - Force full sync via directory module if delta sync misses changesHealth check reporting unhealthy:
- HealthCheck probes each configured URL individually (primary + fallbacks) - Returns per-server latency and error details - At least one server must be reachable for Healthy=true - Check network connectivity, firewall rules, TLS certificatesKey metrics for monitoring:
- ldap_operations_total{operation, status}: operation success/failure rates - ldap_operation_duration{operation}: latency histograms - ldap_pool_utilization_pct: connection pool health - ldap_pool_errors{reason}: pool-level errors - ldap_bind_success / ldap_bind_failures: authentication ratesSecurity
Security properties and hardening:
Transport security:
All connections use TLS (ldaps://). The module supports custom CA certificate validation via the ca_pem config field. TLS is mandatory; plain LDAP (ldap://) connections are not supported in production.No password caching:
User authentication is performed via LDAP bind on every request. Passwords are never stored, cached, or logged by the module. The service account password is the only credential stored in config.LDAP injection prevention:
All user-supplied values in LDAP filters are properly escaped using the go-ldap library's escaping functions. This prevents filter injection attacks where crafted usernames could alter search semantics.DN validation:
Distinguished Names are validated to prevent directory traversal attacks where crafted DNs could access entries outside the configured base DN.Startup credential validation:
The module refuses to start if LDAP credentials are invalid (Code 49). This prevents running with misconfigured service accounts that would silently fail all authentication attempts.Connection pool security:
- Service account bind performed on every new connection - Stale connections detected and reconnected automatically - Pool connections are not shared between user bind operations - Each user bind gets a dedicated connection from the poolModule data:
Module data storage has been moved to Hexon KV (NATS JetStream). LDAP is no longer used as a moduledata storage backend.Relationships
Module dependencies and interactions:
- Directory: Primary consumer. Directory module calls LDAP for all sync
operations (full and delta sync), user authentication via bind, individual user lookups, and readiness checks before starting syncs.- Authentication modules: Use LDAP bind indirectly via the directory for
authentication decisions. Some modules use bind directly for password verification.- webauthn/wireguard/totp: Module data storage is now handled by the
moduledata module (Hexon KV), not by LDAP.- Configuration: Hot-reloadable. New connections automatically use current
config settings. Attribute mappings and timeouts are reloadable.- Telemetry: Structured logging for connection events,
search operations, bind results, and error conditions.OIDC Relying Party
OpenID Connect Relying Party for external IdP authentication with PKCE, DPoP, PAR, and token introspection
Overview
The OIDC Relying Party module enables HexonGateway to authenticate users with external identity providers such as Azure AD, Okta, Google, and any RFC-compliant OIDC provider. It implements the Authorization Code Flow with comprehensive security features.
Core capabilities:
- Authorization Code Flow with PKCE (RFC 7636, S256 method only)
- DPoP token binding (RFC 9449) for proof-of-possession
- Pushed Authorization Requests (RFC 9126) for secure parameter submission
- Token Introspection (RFC 7662) for active token validation
- Token Revocation (RFC 7009) for controlled token lifecycle
- ID token validation with signature, issuer, audience, nonce, and age checks
- UserInfo endpoint for fetching additional user claims
- Configurable claim mapping for provider-specific attribute names
- OIDC discovery with 24-hour caching and lazy initialization
- JWKS fetching with 1-hour caching and key rotation support
- AES-GCM encrypted state parameters with cluster-derived keys
- Multi-provider support with independent configuration per provider
- Session management (OIDC Session 1.0) with session ID tracking
Authorization flow:
1. Client calls Authorize -> module generates PKCE verifier, encrypts state 2. Module stores session in distributed cache (10 min TTL) 3. Module returns authorization URL for user redirect 4. User authenticates with external IdP 5. IdP redirects back with authorization code and state 6. Client calls Callback -> module validates state, exchanges code for tokens 7. Module validates ID token (signature, claims, nonce) 8. Module returns tokens and user identity claimsStartup behavior:
Module does NOT pre-fetch discovery metadata on startup to avoid blocking initialization if IdPs are temporarily unreachable. Discovery is fetched lazily on first use and cached for 24 hours.Config
Providers configured via TOML array:
[[identity.oidc_providers]]
name = "azure" # Internal provider identifier display_name = "Microsoft Azure AD" # UI display name icon = "microsoft" # Icon identifier (optional) issuer = "https://login.microsoftonline.com/{tenant}/v2.0" # OIDC issuer URL client_id = "your-client-id" # OAuth 2.0 client ID client_secret = "your-client-secret" # Client secret (optional with PKCE) scopes = ["openid", "profile", "email", "groups"] # Must include "openid" redirect_uris = ["https://app.example.com/callback"] # Allowed redirect URIs pkce_required = true # Require PKCE (default: true) dpop_enabled = false # Enable DPoP token binding timeout = "30s" # HTTP timeout (default: 30s) discovery_ttl = "24h" # Discovery cache TTL (default: 24h) dev_mode = false # Relaxed validation (NEVER in production) clock_skew_tolerance = "2m" # Per-provider clock tolerance strict_key_expiry = false # Reject expired JWKS keys required_amr = ["mfa"] # Required auth methods (optional) suppress_error_details = false # Hide error details in responses [identity.oidc_providers.claim_mapping] preferred_username = "upn" # Map IdP claims to standard names groups = "groups" [identity.oidc_providers.extra_params] prompt = "select_account" # Additional authorize endpoint paramsMultiple providers can be configured simultaneously. Each provider has independent settings, discovery cache, and health status.
Cache TTLs (distributed across the cluster):
Discovery metadata: 24 hours JWKS keys: 1 hour Auth sessions: 10 minutes DPoP JTI replay prevention: 2 minutesHot-reloadable: provider settings, claim mappings, extra params, timeouts. Discovery and JWKS caches refresh on TTL expiry.
Troubleshooting
Common symptoms and diagnostic steps:
Authorization flow fails with “invalid_request”:
- Verify redirect_uri matches exactly what is registered with the IdP - Check scopes include "openid" (required for OIDC) - Verify client_id is correct for the target IdP - Check if provider requires client_secret (some do even with PKCE) - For PAR failures, module falls back to standard authorization URLCallback fails with “invalid_grant”:
- Authorization code may have expired (typically 5-10 minutes) - Code already exchanged (single-use enforcement by IdP) - PKCE code_verifier mismatch (state corruption or session expired) - Check auth session TTL: sessions expire after 10 minutesID token validation fails:
- Issuer mismatch: verify issuer config matches IdP exactly - Audience mismatch: client_id must be in token audience claim - Token expired: check clock sync between gateway and IdP - Nonce mismatch: state/session corruption during flow - Signature failure: JWKS cache may be stale, force refresh - Blocked algorithm: none, HS256/384/512 are rejected for security - Allowed algorithms: RS256-512, ES256-512, EdDSADiscovery fetch failures:
- Check network connectivity to IdP's .well-known endpoint - Verify issuer URL is correct (common mistake: wrong tenant ID) - TLS errors: gateway must trust IdP's certificate chain - Timeout: increase timeout setting for slow IdP responses - Check metrics: oidc_rp.discovery_fetch_failures_totalDPoP errors:
- "invalid_dpop_proof": proof JWT validation failed - "use_dpop_nonce": server requires DPoP nonce (retry with nonce) - JTI replay detected: same proof used twice (2-min dedup window) - Key thumbprint mismatch between proof and token bindingToken refresh fails:
- Refresh token expired or revoked by IdP - Scope validation: requested scopes must be subset of original grant - DPoP-bound tokens require DPoP proof on refresh - Check oidc_rp.token_refresh_failures_total{reason} for detailsProvider health check shows unhealthy:
- Discovery endpoint unreachable (network, DNS, TLS issues) - JWKS not cached (no tokens validated yet, lazy fetch) - Check per-provider health via 'auth oidc' admin commandKey metrics for monitoring:
- oidc_rp.authorization_initiated_total: flow start rate - oidc_rp.token_exchange_success_total / failures_total: conversion rate - oidc_rp.state_validation_failures_total: security events - oidc_rp.discovery_fetch_failures_total: IdP connectivity - oidc_rp.token_exchange_duration: IdP latencySecurity
Security measures and hardening:
PKCE (RFC 7636):
64-byte code_verifier (512 bits entropy) with S256 method only. Plain method is blocked. Code verifier stored server-side only, never exposed to the browser. Prevents authorization code interception attacks.State protection:
AES-GCM encryption with cluster-derived key and domain separation. Single-use enforcement (deleted after validation). 10-minute TTL prevents replay attacks on stale authorization sessions.ID token validation (defense-in-depth):
- Signature verified against JWKS from IdP - Issuer must match configured issuer exactly - Audience must contain the configured client_id - Expiration checked with configurable clock skew tolerance - Nonce validated to prevent replay attacks - Token age validation (max 10 minutes from issuance) - at_hash validation linking ID token to access token (OIDC Core 3.1.3.6)Algorithm restrictions:
Blocked: none, HS256, HS384, HS512 (symmetric algorithms) Allowed: RS256-512, ES256-512, EdDSA (asymmetric only) RSA key size: 2048-8192 bits only RSA exponent: only standard values (3, 17, 65537) These restrictions prevent algorithm confusion and weak key attacks.DPoP (RFC 9449):
Proof-of-possession for token binding. JTI replay prevention with 2-minute distributed cache. JWK thumbprint computation per RFC 7638. Confirmation claim (cnf) validation required when DPoP is enabled.Pushed Authorization Requests (RFC 9126):
Authorization parameters sent directly to IdP (not in browser URL). Prevents parameter exposure in browser history and URL bar. Larger parameter payloads possible without URL length limits. Falls back to standard URL if PAR endpoint unavailable.Error disclosure control:
Configurable suppress_error_details for production environments. Sensitive information masked in API responses. Full details logged server-side for debugging. Token values and secrets never logged.Connection security:
TLS 1.2+ required for all IdP connections. Max 3 redirects with HTTPS required on redirect. Response size limits prevent memory exhaustion (Discovery: 1MB, JWKS: 512KB, Token/UserInfo: 256KB).Relationships
Module dependencies and interactions:
- Sign-in flow engine: Primary consumer. The sign-in flow engine uses
the OIDC RP module to initiate authorization flows and process callbacks for SSO-based authentication.- proxy auth: Reverse proxy mappings can use OIDC providers for
per-application SSO authentication via unified cookie solution.- Distributed memory cache: Distributed cache for
auth sessions (10 min TTL), discovery metadata (24h), JWKS (1h), and DPoP JTI replay prevention (2 min). Enables cross-node callback handling when user returns from IdP to a different cluster node.- Directory: After OIDC authentication, user identity is
matched against directory for group memberships and authorization. OIDC claims (email, groups) may be used for just-in-time provisioning.- Configuration: Hot-reloadable provider configuration. Provider
settings, claim mappings, and timeouts are reloadable.- Cluster: Read operations run locally for low latency. Session storage
and DPoP JTI tracking are replicated to all nodes so callbacks work regardless of which node the user returns to.- Telemetry: Structured logging for authorization flows,
token operations, and security events. Metrics exported for all major operations with provider-level labels.SCIM Identity Provider
SCIM 2.0 identity provider with multi-provider merge, webhook push sync, circuit breaker, and deletion safety
Overview
The SCIM identity provider synchronizes users and groups from external SCIM 2.0 providers (Okta, Azure AD, OneLogin, JumpCloud, etc.) into the directory with full lifecycle management. It implements RFC 7643 (Core Schema), RFC 7644 (Protocol), and RFC 7644 Section 3.10 (Path Expressions).
Core capabilities:
- Pull sync: scheduled full sync at configurable intervals with delta
computation for minimal directory writes, per-provider sync workers- Push sync (webhooks): real-time updates via HMAC-SHA256 signed events with
atomic deduplication, fail-closed for destructive operations- Multi-provider merge: priority-based attribute conflict resolution when
multiple SCIM providers are configured (lower priority number wins)- Nested group resolution: DAG traversal with cycle detection, configurable
direction (up/down/both), and max depth limits- Deletion safety: per-sync thresholds, cumulative daily limits, zero-user
protection, two-step delete (disable then remove)- Circuit breaker: consecutive failure detection, exponential backoff with
automatic recovery on success- Authentication: OAuth2 client_credentials, Bearer token, HTTP Basic
- SCIM path expressions: simple (userName), nested (name.givenName),
array filter (emails[primary eq true].value)Cluster behavior:
Read operations (status, health, user/group queries) run on the local node. Write operations (sync, directory updates) are replicated to all nodes for cluster-wide consistency. Each node maintains independent SCIM clients and sync loops.Config
Providers configured via TOML array:
[[identity.scim_providers]]
name = "okta" # Internal identifier for the provider enabled = true # Whether provider is active (default: true) priority = 1 # Merge priority, lower = higher (default: 10) base_url = "https://example.okta.com/scim/v2" # SCIM 2.0 base URL auth_type = "oauth2" # Authentication: "oauth2", "bearer", "basic" oauth2_token_url = "https://example.okta.com/oauth2/v1/token" oauth2_client_id = "client_id" # OAuth2 client ID oauth2_client_secret = "secret" # OAuth2 client secret oauth2_scopes = ["scim"] # OAuth2 scopes to request bearer_token = "" # Static bearer token (auth_type = "bearer") basic_username = "" # HTTP Basic username (auth_type = "basic") basic_password = "" # HTTP Basic password (auth_type = "basic") sync_interval = "15m" # Background sync interval (default: "15m") sync_timeout = "3m" # Per-sync timeout (default: "3m") max_nesting_depth = 5 # Maximum group nesting depth (default: 5) nested_groups = false # Enable nested group resolution (default: false) nested_groups_direction = "up" # Resolution direction: "up", "down", "both" webhook_secret = "min-32-byte-secret" # HMAC-SHA256 secret (minimum 32 bytes) [identity.scim_providers.attribute_map] username = "userName" email = "emails[primary eq true].value" full_name = "displayName" given_name = "name.givenName" surname = "name.familyName" groups = "groups[].display"Multiple providers with merge:
Provider okta (priority: 1) and azure (priority: 2) both have user alice. If both have different emails, okta's email wins (lower priority number). Group memberships are merged as union across all providers.Webhook endpoint:
POST /webhook/scim/{provider} Signature headers (checked in order): X-Webhook-Signature: sha256=<hex-hmac> X-Hub-Signature-256: sha256=<hex-hmac> X-Signature-256: sha256=<hex-hmac> Max payload size: 256KB Supported events: user.created, user.updated, user.deleted, user.disabled, group.created, group.updated, group.deletedHot-reloadable: provider settings, attribute maps, sync intervals, timeouts,
webhook secrets, nested group settings.Cold (restart required): none; providers fully reinitialize on config change.
Troubleshooting
Common symptoms and diagnostic steps:
Sync not running or providers not initializing:
- Check provider enabled=true in config - Verify base_url is reachable from the gateway node - Check auth credentials: OAuth2 token URL, client_id/secret, bearer token - Check: 'scim status' for provider initialization and sync status - Check: 'scim health' for per-provider connectivitySync completing but no users/groups appearing:
- Verify attribute_map matches provider's SCIM schema - Check SCIM path expressions match the provider's data format - Trigger manual sync: 'scim sync' to test - Check: 'directory users' and 'directory groups' for cached dataCircuit breaker open (sync suspended):
- The circuit opens after 10 consecutive sync failures - Backoff starts at 30 seconds, doubles each time up to 30 minutes max - Check: 'scim health' for circuit breaker state - Fix underlying issue and wait for auto-recovery, or 'scim sync'Webhook events not being processed:
- Verify webhook_secret is configured (minimum 32 bytes) - Check HMAC signature format from provider - Payload must be valid JSON, max 256KB - Deduplication: same event ID processed only once (1 hour window) - Check: 'logs search webhook' for rejection reasonsWebhook deletions being blocked:
- Per-sync threshold: defaults to max 10% of users or 50 absolute per cycle - Cumulative daily limit: defaults to 200 deletions in a rolling 24-hour window - Zero-user protection: deletions blocked when current user count is zero - Timestamp freshness: destructive events require timestamp within 5 minutesMulti-provider merge conflicts:
- Lower priority number wins for conflicting attributes - Group memberships are always union (no conflict) - Check: 'logs search "merge conflict"' for conflict detailsNested group resolution issues:
- Verify nested_groups=true and correct direction - Check max_nesting_depth (default 5): deep hierarchies may be truncated - Circular references: detected and logged, cycles broken at detection pointSecurity
Security properties and hardening:
Webhook verification (HMAC-SHA256):
Constant-time signature comparison prevents timing attacks. Signature verified before any payload parsing. Webhook secret must be minimum 32 bytes. Webhooks rejected if no secret configured for the provider.Fail-closed destructive operations:
When deduplication fails due to cache errors, delete and disable events are blocked rather than allowed through. This prevents accidental mass deletion if the distributed cache is temporarily unavailable.Deletion safety (defense-in-depth):
Per-sync thresholds limit the percentage and absolute count of deletions per cycle. A cumulative daily limit caps total deletions in a rolling 24-hour window. Zero-user protection blocks deletions when current count is zero. Two-step delete: disable first (triggers session revocation), then remove.Timestamp freshness:
Destructive webhook events require a timestamp within 5 minutes. Stale destructive events are rejected to prevent replay.Deduplication:
Atomic single-use enforcement with 1-hour TTL. Each webhook event ID consumed exactly once across the cluster. Prevents replay attacks.Input validation:
UTF-8 correctness, control character rejection, length limits (256 chars max for usernames). Case-insensitive identity matching.Connection security:
TLS 1.2+ required for all provider connections. Per-provider HTTP client with connection pooling. Configurable timeout (default 30s per request).Relationships
Module dependencies and interactions:
- Directory: Primary consumer of synced data. SCIM writes users and groups
to the directory with cluster-wide replication for consistency.- Sessions: Receives cascading callbacks on user deprovisioning. When a user
is disabled or deleted, active sessions are revoked immediately.- OIDC provider: Receives cascading callbacks on user removal for token
revocation and session cleanup.- Configuration: Hot-reloadable provider settings, attribute maps, sync
intervals, timeouts, webhook secrets, and nested group settings.- Admin CLI: ‘scim status’, ‘scim health’, ‘scim sync’ commands for
diagnostics and manual sync triggering.