Authentication

Multi-method authentication hub with MFA orchestration, session management, and federation

Overview

The authentication subsystem is the central identity verification layer for Hexon. It coordinates multiple authentication methods, multi-factor authentication (MFA), and session lifecycle across the cluster.

Supported primary authentication methods:

passwd: LDAP password authentication with bind verification
passkey: WebAuthn/FIDO2 passwordless authentication (Touch ID, YubiKey)
x509: X.509 client certificate authentication (Subject DN mapping)
oidc: OpenID Connect SSO via internal provider or external IdP (RP)
saml: SAML 2.0 Identity Provider for SP-initiated and IdP-initiated SSO
magiclink: Email-based passwordless sign-in using RFC 8628 device code polling
kerberos: SPNEGO/Kerberos ticket-based authentication (Active Directory)

Supported MFA methods (second factor):

otp: Email-delivered one-time password via SMTP (per-device fingerprinting)
totp: Time-based one-time password (RFC 6238, authenticator apps)

Additional modules:

devicecode: RFC 8628 device authorization grant (bastion SSH, magic link infra)
jit2fa: Just-in-time second factor enrollment and verification
scim: SCIM 2.0 identity provider with multi-provider merge and webhooks

The signin service orchestrates authentication flows. It selects the primary method (configurable via service.signin.primary), falls through to secondary methods, enforces MFA requirements per method, and manages session creation with cluster-wide replication.

Architecture

Authentication flow (signin service orchestration):

Client request arrives at /signin (or /api/signin for API clients)
Method selection: primary method presented first, secondary methods available
Credential verification dispatched to the appropriate auth module:

   - passwd:    LDAP bind (no password storage)
   - passkey:   WebAuthn challenge-response ceremony
   - x509:      Certificate chain validation + Subject DN mapping
   - oidc:      Authorization Code + PKCE exchange
   - saml:      AuthnRequest processing + signed assertion generation
   - magiclink: Device code creation + magic link email delivery
   - kerberos:  SPNEGO token validation

Identity lookup: username resolved to user record from directory
Group resolution: group memberships fetched from directory
Account status check: disabled/locked accounts rejected synchronously
MFA gate (if require_mfa includes the method):

   a. Pre-authentication session created (limited, 5-minute TTL)
   b. MFA challenge presented (OTP email or TOTP authenticator)
   c. MFA code verified
   d. Pre-auth session revoked, new authenticated session created (rotation)

Session creation: replicated to all nodes (cluster-wide quorum)
Directory sync: user record synchronized cluster-wide
Session cookie set, redirect to return_url or landing page

All auth modules are invoked cluster-wide, ensuring consistency and observability regardless of which node handles the request.

Session types:

Authenticated: full access, configurable TTL (default 24h)
MFA pending: limited capabilities, short TTL (default 5min)
Password expired: forced password change, restricted access

Configuration:

  [service.signin]
  primary = "passkey"              # Default authentication method
  secondary = ["passwd", "x509"]   # Alternative methods shown on signin page
  require_mfa = ["passwd"]         # Methods that require MFA after primary auth
  mfa_methods = ["otp", "totp"]    # Available MFA methods for users

Relationships

Child modules (authentication.*):

oidc: OIDC provider — SSO hub for proxy, bastion, external apps
webauthn: FIDO2/WebAuthn — passwordless passkey authentication
ldap: LDAP authentication backend — password bind verification
x509: X.509 certificate auth — client cert to username mapping
saml: SAML 2.0 IdP — assertion generation, SLO, metadata
kerberos: SPNEGO/Kerberos — Active Directory ticket authentication
otp: Email OTP — one-time codes via SMTP with device fingerprinting
totp: TOTP — RFC 6238 authenticator app verification
devicecode: RFC 8628 device authorization — bastion SSH, magic link infra
magiclink: Email-based passwordless — magic link token generation/verification
jit2fa: Just-in-time 2FA — enrollment and verification middleware

Upstream dependencies:

directory: User lookup, group membership, account status (disabled/locked)
sessions: Session creation (quorum), revocation, TTL management
smtp: Email delivery for OTP codes and magic link emails
firewall: Network-level access rules applied before auth endpoints

Downstream consumers:

proxy: Proxy SSO via OIDC provider (dedicated internal client)
bastion: SSH authentication via device authorization grant
services: All HTTP services check session cookies for access control
vpn: IKEv2 VPN authentication via EAP-TLS or certificate methods

Cross-cutting:

protection: Rate limiting on signin endpoints (JA4 fingerprint-based)
cluster: All auth operations are cluster-wide for consistency
notify: Authentication event notifications (webhooks, email alerts)

Device Code Authorization

RFC 8628 Device Authorization Grant for input-constrained devices

Overview

The device code module implements RFC 8628 (OAuth 2.0 Device Authorization Grant) for authenticating input-constrained devices such as smart TVs, CLI tools, IoT devices, and headless systems that lack a web browser.

Core capabilities:

Full RFC 8628 compliance (Sections 3.1 through 3.5, 6.1)
BASE20 user codes using consonants only (BCDFGHJKLMNPQRSTVWXZ) to avoid

  profanity in generated codes

Configurable code length (default: 8 characters) and expiration TTL
Constant-time comparison for user code validation (timing attack prevention)
SHA-256 hashed cache keys to prevent code enumeration
Optimistic locking with version-based concurrency control to prevent

  double-authorization race conditions in distributed environments

Distributed code storage with cluster-wide replication and quorum consensus
Automatic expiration with configurable TTL (default: 10 minutes)
Single-use enforcement: codes cannot be reused after authorization or denial
Directory integration for fresh user claims at authorization time

Flow summary:

  1. Device requests authorization codes
  2. Device displays short user_code to the user (e.g., "BCDFGHJK")
  3. User visits verification URI on another device (phone or computer)
  4. User enters user_code and authorizes or denies the device
  5. Device polls token endpoint until authorized, denied, or expired
  6. On authorization, device receives access token via OIDC token endpoint

The OIDC service handles the HTTP endpoints (/device page) and token endpoint with device_code grant type. The device code module provides the core logic; the OIDC service provides the HTTP transport.

Config

Device code behavior is configured under the OIDC authentication section:

[authentication.oidc]

  device_code_ttl = "10m"            # Code expiration (default: 10 minutes)
  device_code_interval = 5           # Minimum polling interval in seconds (default: 5)
  device_code_user_code_length = 8   # User code character count (default: 8)

Code generation parameters:

  - Device code: 40-digit cryptographically random token for client polling
  - User code: 8-character BASE20 string (consonants only) for human entry
  - Verification URI: auto-generated from server base URL + /device path
  - VerificationURIComplete: includes pre-filled user_code query parameter

Polling behavior (per RFC 8628 Section 3.5):

  - Clients must wait at least device_code_interval seconds between polls
  - "slow_down" response instructs client to add 5 seconds to interval
  - "authorization_pending" means user has not yet acted
  - "expired_token" means device_code TTL has passed

Hot-reloadable: device_code_ttl, device_code_interval. Cold (restart required): device_code_user_code_length.

The module auto-enables when OIDC is configured. No separate enable flag is needed. Magic link module also auto-enables device code when activated.

Troubleshooting

Common symptoms and diagnostic steps:

User code not accepted at verification page:

  - Verify code format: must be exactly 8 uppercase consonants (BASE20 charset)
  - Check expiration: codes expire after device_code_ttl (default: 10 minutes)
  - Check single-use: codes cannot be reused after authorization or denial
  - Verify AlreadyHandled flag: VerifyUserCode returns AlreadyHandled=true if
    the code was already authorized or denied
  - Case sensitivity: user codes are case-insensitive but stored uppercase

Device polling returns “expired_token” too quickly:

  - Check device_code_ttl configuration (default: 10m)
  - Verify cluster time synchronization (NTP) across nodes
  - Check if code was created with custom TTL override via AdditionalData

Device polling returns “slow_down” repeatedly:

  - Client must increase polling interval by 5 seconds on each slow_down
  - Minimum interval: device_code_interval (default: 5 seconds)
  - Verify client implements backoff correctly per RFC 8628 Section 3.5

“authorization_pending” never resolves:

  - Verify user visited the correct verification URI
  - Check that user entered the correct user_code
  - Verify OIDC service handlers are registered and accessible
  - Check network connectivity to the verification endpoint
  - Confirm user completed the full authorization flow (not just code entry)

Race condition or double authorization:

  - Optimistic locking detects concurrent modifications via version counter
  - Post-broadcast verification rejects stale version authorization attempts
  - Check structured logs for "version mismatch" warnings
  - Multiple users entering same code: statistically improbable with BASE20x8

Token exchange fails after authorization:

  - Verify OIDC token endpoint is configured and accessible
  - Check client_id matches between authorization and token request
  - Verify scope is valid for the OIDC provider configuration
  - Check directory module health (user claims fetched at authorization time)

Codes not replicating across cluster nodes:

  - Check cluster health and quorum status
  - Verify memory storage module is healthy
  - Check cluster connectivity between nodes
  - Codes use distributed storage with quorum; partial cluster may cause issues

Diagnostic commands:

  - auth devicecodes: list active device code authorization flows
  - auth status: check authentication system overview
  - health components: verify device code subsystem health

Security

Security features and hardening measures:

BASE20 charset (RFC 8628 Section 6.1):

  User codes use only consonants (BCDFGHJKLMNPQRSTVWXZ) to prevent profanity
  in randomly generated codes. This is an explicit RFC recommendation.

Constant-time comparison:

  User code validation uses crypto/subtle.ConstantTimeCompare to prevent
  timing side-channel attacks that could leak valid codes. This follows
  RFC 8628 Section 5.2 security recommendations.

SHA-256 hashed storage keys:

  Cache keys for device codes are SHA-256 hashed to prevent enumeration
  attacks. Even with access to the storage layer, codes cannot be extracted
  from their hash keys.

Optimistic locking (distributed race prevention):

  - Each authorization increments a version counter
  - Post-broadcast verification detects concurrent modifications
  - Rejects authorization if version mismatch detected
  - Prevents double-authorization in multi-node clusters
  - Critical for environments where multiple users may attempt simultaneous auth

Single-use enforcement:

  Once a code is authorized or denied, it cannot be reused. The AlreadyHandled
  flag prevents replay attacks on consumed codes.

Directory re-validation:

  CompleteAuthorization fetches the latest user data from the directory module
  rather than relying on stale session data. This ensures:
  - Disabled users cannot complete device authorization
  - Group memberships reflect current state (security-critical)
  - ID tokens contain fresh, authoritative user claims
  - Graceful fallback to session metadata if directory is temporarily unavailable

Automatic expiration:

  Codes expire after configurable TTL (default: 10 minutes). Expired codes
  are automatically cleaned up from distributed storage.

Fuzz testing coverage:

  - FuzzUserCodeValidation: injection attack resistance
  - FuzzUserCodeConstantTimeComparison: timing attack verification
  - FuzzDeviceCodeGeneration: cryptographic randomness quality
  - FuzzOptimisticLockingVersionHandling: race condition prevention
  - FuzzDeviceAuthorizationRequest: parameter handling validation

Relationships

Module dependencies and interactions:

OIDC service: Primary consumer. OIDC token endpoint handles the

  device_code grant type. OIDC service provides HTTP handlers for the /device
  verification page. Token generation occurs after CompleteAuthorization.

Magic link: Reuses device code infrastructure for its polling

  mechanism. Magic link auto-enables device code module when activated.

Directory: Canonical source for user attributes at authorization

  time. CompleteAuthorization fetches email, full_name, given_name, surname,
  and group memberships from directory. Graceful fallback to session metadata
  if directory is temporarily unavailable.

Distributed memory cache: Cache for code storage. Codes replicated

  across cluster with quorum consensus. TTL-based automatic cleanup.

Sessions: Session integration for authenticated user context

  during the verification flow.

Client access: Server-side device code auth for hexonclient QUIC tunnels.

  Gateway generates device code, sends challenge to client
  over QUIC control stream, polls until authorized. Same pattern as bastion SSH.

Bastion SSH: Server-side device code auth for SSH sessions.

  Gateway generates device code, displays QR in terminal.

OIDC service: HTTP transport layer. Handles /device endpoint rendering,

  /oidc/device/authorize for code generation, and token endpoint for code
  exchange.

config: Runtime configuration access for TTL, interval, and code length

  settings. Hot-reload supported for TTL and interval.

telemetry: Structured logging for all device code operations including

  authorization attempts, completions, and expiration events.

Just-In-Time Two-Factor Authentication

Transparent OTP-based 2FA for legacy applications via login interception and credential replay

Overview

JIT-2FA adds two-factor authentication to legacy web applications without any backend modifications. It operates as a transparent middleware layer within a proxy mapping, intercepting form-based login submissions and gating access with email-based OTP verification.

Core capabilities:

Transparent login interception: intercepts POST submissions to configurable login paths
Webhook credential validation: validates username/password via external HTTP webhook
Email OTP challenge: sends one-time password to user email extracted from webhook response
Credential replay: after OTP success, replays the original POST request to the backend
Auth header mode: alternative to replay, injects X-Hexon-* headers for proxy-aware backends
Asymmetric encryption: NaCl box (X25519 + XSalsa20 + Poly1305) for credential storage
Split-knowledge security: server holds ciphertext, client holds private key in HttpOnly cookie
Secure memory handling: plaintext credentials zeroed immediately after encryption
OTP resend without re-encryption: same ciphertext and cookie reused across resends
Session-based access: authenticated sessions bypass 2FA for subsequent requests
Double logout: destroys both JIT-2FA session and forwards logout to backend

Two trust models controlled by inject_credentials config option:

Credential Replay Mode (inject_credentials = true, default):

  For legacy apps with no proxy-auth support. The full NaCl encryption pipeline
  encrypts the login POST body, stores ciphertext in session, then decrypts and
  replays the original request after OTP verification.
  Flow: Login POST -> Encrypt body -> Store ciphertext -> OTP -> Decrypt -> Replay POST -> Backend

Auth Header Mode (inject_credentials = false):

  For apps supporting trusted reverse proxy authentication (Grafana, GitLab, Gitea,
  Jenkins, etc.). Eliminates the encryption pipeline entirely. After OTP success,
  redirects user to login URL. The proxy layer injects auth headers (X-Hexon-User,
  X-Hexon-Mail, etc.) on every authenticated request.
  Flow: Login POST -> Webhook validate -> OTP -> Redirect 302 -> Auth headers injected -> Backend
  Requires add_auth_headers = true on the parent proxy mapping.

Request flow:

  1. Request arrives at proxy mapping with JIT-2FA enabled
  2. Logout path check: if match, destroy session and forward to backend
  3. Login path POST check: if match, extract credentials and call webhook
  4. Webhook success with email: encrypt body (replay mode) or store username (header mode)
  5. Send OTP email and render verification page
  6. User submits OTP: verify code, decrypt and replay POST (or redirect with headers)
  7. Authenticated session established for subsequent requests
  8. Non-login requests: check session validity, forward if authenticated or redirect to login

Config

JIT-2FA is configured per proxy mapping under [proxy.mapping.jit_2fa]:

[proxy.mapping.jit_2fa]

  enabled = true                    # Enable JIT-2FA for this mapping
  login_url = "/login"              # Redirect target for unauthenticated users
  login_path_regex = "^/login$"     # Regex matching login POST endpoint
  logout_path_regex = "^/logout$"   # Regex matching logout endpoint
  username_field = "username"       # Form field name for username extraction
  password_field = "password"       # Form field name for password extraction
  inject_credentials = true         # true = credential replay, false = auth header mode

Webhook configuration under [proxy.mapping.jit_2fa.webhook]:

[proxy.mapping.jit_2fa.webhook]

  url = "https://api.internal/validate"   # Webhook endpoint URL
  method = "GET"                          # HTTP method (GET or POST)
  timeout = "5s"                          # Webhook response timeout (default: 5s)
  success_field = "$.status"              # JSONPath to success indicator in response
  success_value = "ok"                    # Expected value at success_field
  extract_email = "$.email"               # JSONPath to user email for OTP delivery

Optional HTTP transport tuning (defaults aligned with proxy connection pool):

  max_idle_conns = 50                     # Total idle connections (default: 50)
  max_idle_conns_per_host = 20            # Idle connections per host (default: 20)
  force_attempt_http2 = true              # Force HTTP/2 (default: true)
  disable_compression = true              # Disable compression (default: true)
  write_buffer_size = 32768               # Write buffer bytes (default: 32768)
  read_buffer_size = 32768                # Read buffer bytes (default: 32768)
  dial_timeout = "30s"                    # TCP dial timeout (default: 30s)
  keep_alive = "30s"                      # TCP keepalive interval (default: 30s)

OTP configuration under [proxy.mapping.jit_2fa.otp]:

[proxy.mapping.jit_2fa.otp]

  type = "numeric"                  # OTP type: "numeric" or "base20" (default: global)
  length = 6                        # OTP digit count
  valid = "5m"                      # OTP validity duration
  max_retries = 3                   # Maximum OTP entry attempts (default: global)
  resend_time = 30                  # Seconds before resend allowed (default: global)

When using auth header mode (inject_credentials = false), the parent proxy mapping must also set add_auth_headers = true to inject X-Hexon-User, X-Hexon-Mail, and other identity headers on authenticated requests.

All OTP settings fall back to global email OTP defaults when not specified per mapping.

Troubleshooting

Common symptoms and diagnostic steps:

User submits login but sees an error instead of OTP page:

  - Webhook failure: check webhook URL reachability and response format
  - JSONPath mismatch: verify success_field and success_value match the webhook response
  - No email in response: extract_email JSONPath must resolve to a valid email address
  - Webhook timeout: increase timeout if backend validation is slow (default 5s)
  - Form field names wrong: username_field and password_field must match the HTML form

OTP email not received:

  - Check SMTP configuration: 'smtp health' to verify email delivery system
  - Email address extraction: webhook must return email at the configured JSONPath
  - Rate limiting: protection module may throttle OTP requests
  - Check email OTP module health: OTP generation depends on the emailotp service

OTP verification fails (invalid code):

  - Expired OTP: default validity is 5 minutes, user may have waited too long
  - Max retries exceeded: after max_retries (default 3), session is invalidated
  - Wrong mapping context: DeviceID is mappingID:sessionID, must match original
  - Clock skew: cluster nodes must have synchronized time for OTP validation

Credential replay fails after OTP success:

  - Private key cookie missing: browser may have cleared cookies or cookie expired (5 min)
  - Session expired: NATS session data has TTL, check if ciphertext still exists
  - Decryption error: private key cookie must match the public key used for encryption
  - Backend rejected replayed POST: CSRF token in original form may have expired
  - Content-Type mismatch: replayed request preserves original Content-Type header

Auth header mode not working (inject_credentials = false):

  - Missing add_auth_headers = true on parent proxy mapping configuration
  - Backend not configured to trust X-Hexon-* headers
  - Redirect loop: login_url must match the path the backend expects for login
  - Session cookie not set: check browser cookie settings and SameSite policy

Session issues (user keeps getting redirected to login):

  - Cookie blocked: Secure flag requires HTTPS, SameSite=Strict blocks cross-origin
  - Session storage: verify NATS/JetStream connectivity for session persistence
  - Multiple domains: session cookies are domain-scoped, check cookie domain setting
  - Logout path regex matching too broadly: verify logout_path_regex specificity

  - Regex syntax: login_path_regex uses Go regexp syntax (RE2)
  - Path normalization: check if proxy rewrites the path before JIT-2FA sees it
  - Method filter: only POST requests to login_path_regex trigger interception
  - Case sensitivity: regex is case-sensitive by default

Performance and webhook diagnostics:

  - Webhook latency: high timeout values block the user login flow
  - Connection pooling: webhook HTTP transport shares pool settings with proxy
  - Cluster-wide OTP tracking: retries tracked across all cluster nodes

Security

Cryptographic design and security properties:

Encryption model (credential replay mode):

  NaCl box authenticated encryption using X25519 key agreement, XSalsa20 stream
  cipher, and Poly1305 message authentication. Fresh X25519 keypair generated per
  login attempt. Ciphertext includes 32-byte ephemeral public key and 16-byte
  authentication tag (48 bytes overhead total).

Split-knowledge architecture:

  Server stores: encrypted body ciphertext and public key (cannot decrypt alone)
  Client stores: private key in HttpOnly cookie (cannot access ciphertext alone)
  Both halves required to recover plaintext credentials. Compromise of either
  storage in isolation reveals nothing about the original credentials.

Cookie security:

  Private key cookie attributes: HttpOnly, Secure, SameSite=Strict, Max-Age=300
  - HttpOnly: prevents JavaScript access to private key
  - Secure: only transmitted over HTTPS connections
  - SameSite=Strict: prevents CSRF-based cookie theft
  - Max-Age=300: 5-minute window to complete OTP verification

Memory safety:

  - Plaintext credentials zeroed immediately after encryption
  - Private key zeroed on server side immediately after decryption
  - Zeroing uses subtle.ConstantTimeCopy to prevent compiler optimization
  - No plaintext credentials ever written to disk or session storage

OTP security:

  - OTP hashed with bcrypt before storage (not stored in plaintext)
  - Constant-time comparison prevents timing side-channel attacks
  - Cluster-wide retry tracking prevents distributed brute-force attempts
  - Rate limiting inherited from protection module
  - DeviceID binding: OTP tied to specific mapping and session (prevents reuse)

Webhook security:

  - Webhook URL should use HTTPS for credential transmission
  - Webhook timeout prevents slow-loris style resource exhaustion
  - Credentials sent to webhook only, never stored in plaintext on server
  - JSONPath extraction validates response structure before proceeding

Auth header mode security:

  - No credential storage or encryption needed (eliminates cryptographic attack surface)
  - Backend must be configured to only trust headers from the gateway IP
  - X-Hexon-* headers stripped from external requests by the proxy layer
  - Session-based: authentication state maintained via secure session cookie

CSRF protection:

  - Original form CSRF tokens preserved in encrypted body for replay
  - OTP form uses separate anti-replay mechanism
  - SameSite=Strict cookies prevent cross-origin request forgery

Relationships

Module dependencies and interactions:

proxy: Parent module. JIT-2FA is configured per proxy mapping and runs as

  middleware in the proxy request pipeline. Auth header mode requires
  add_auth_headers = true on the mapping. Proxy handles X-Hexon-* header
  injection on authenticated requests.

authentication.emailotp: Provides OTP generation, delivery, and verification.

  JIT-2FA delegates all OTP operations to emailotp using DeviceID format
  of mappingID:sessionID for cluster-wide tracking. OTP settings (type,
  length, validity, max_retries, resend_time) can be overridden per mapping
  or fall back to global emailotp defaults.

smtp: Email delivery for OTP codes. SMTP health directly affects OTP delivery.

  Check smtp health when OTP emails are not received.

sessions: Session storage via NATS/JetStream. Stores encrypted credentials

  (replay mode) or username/email (header mode). Session TTL governs how long
  authenticated state persists. Session destruction on logout.

protection.ratelimit: Rate limiting for login attempts and OTP submissions.

  Prevents brute-force attacks on both webhook validation and OTP verification.

identity.directory: User identity enrichment. In auth header mode, directory

  attributes populate X-Hexon-* headers (user, email, groups, display name).

config: Per-mapping configuration under [proxy.mapping.jit_2fa]. Webhook,

  OTP, and transport settings are all configurable. Changes require proxy
  mapping reload to take effect.

protection.pow: Related but independent POST body preservation mechanism.

  PoW uses symmetric AES-256-GCM for short-lived form data during proof-of-work
  challenges. JIT-2FA uses asymmetric NaCl box for longer-lived credential
  storage during OTP verification. Both implement split-knowledge security
  but with different threat models and durations.

telemetry: Structured logging for login interceptions, webhook calls, OTP

  events, encryption operations, and session lifecycle. Metrics for monitoring
  JIT-2FA health and usage patterns.

Kerberos Ticket Management & SPNEGO Browser SSO

Kerberos ticket acquisition for SSH bastion (proxy model) and SPNEGO browser authentication (server model)

Overview

The Kerberos module provides pure in-memory Kerberos ticket acquisition and management using gokrb5 v8.4.4 for KDC authentication. Hexon acts as an authentication proxy: users provide credentials, Hexon authenticates to the KDC on their behalf, and stores tickets for SSH jump hosts, proxies, and service delegation.

Core capabilities:

Pure in-memory operation: passwords NEVER touch disk
gokrb5 v8.4.4 pure Go Kerberos implementation (no external binaries for auth)
Pure in-memory TGT extraction from gokrb5 session data
CCache format marshaling (version 4, big-endian, MIT Kerberos compatible)
Automatic encryption at rest via the sessions module
ACL protection for ticket retrieval operations
Memory locking via mlockall(MCL_CURRENT) to prevent password swapping
Comprehensive security audit logging for all operations
Password change support via kpasswd protocol (RFC 3244)
Prometheus metrics for ticket lifecycle monitoring

Architecture highlights:

Hexon is NOT part of the Kerberos realm (no keytab required)
Authenticates to KDC on behalf of users (proxy model)
TGT extracted in-memory from gokrb5 session
CCache bytes built manually in standard MIT Kerberos format
Tickets stored as sessions (Type: “kerberos”, ModuleKey: principal)
Ticket lifetime synchronized with session TTL
Compatible with SSH GSSAPI, kinit, klist, and all Kerberos tools

Platform notes:

Memory locking requires CAP_IPC_LOCK capability
Container deployment: use —cap-add=IPC_LOCK
Graceful degradation if memory locking fails (logs warning)

Config

Kerberos module configuration:

[authentication.kerberos]

  realm = "EXAMPLE.COM"           # Kerberos realm (uppercase by convention)
  kdc = "kdc.example.com"         # Key Distribution Center address
  ticket_ttl = "8h"               # Ticket lifetime (default: 8 hours)
  password_change = true           # Enable kpasswd password change (default: false)
  kpasswd_path = "/usr/bin/kpasswd"  # Optional: override kpasswd binary path

Ticket storage model:

  Tickets are stored as sessions (type: "kerberos") indexed by the Kerberos
  principal (e.g., "alice@EXAMPLE.COM"). Session metadata includes: CCache
  bytes (auto-encrypted), ticket type, realm, principal, creation timestamp,
  and authentication method.

  This provides:
    - Principal-based indexing for fast user lookup
    - Automatic TTL expiration matching Kerberos ticket lifetime
    - Distributed storage with encryption across cluster
    - Cluster-wide ticket access from any node

Password change feature:

  When password_change = true, users can change their Kerberos passwords
  via the ChangePassword operation. Uses standard kpasswd protocol (RFC 3244).
  Password complexity is enforced by the KDC policy, not Hexon.
  All existing tickets are automatically revoked after a successful change.
  Requires kpasswd binary (auto-detected in PATH or specify kpasswd_path).

Hot-reloadable: ticket_ttl, password_change. Cold (restart required): realm, kdc, kpasswd_path.

Troubleshooting

Common symptoms and diagnostic steps:

AcquireTicket fails with authentication error:

  - Verify KDC is reachable: check network connectivity to kdc address
  - Verify realm is correct (must be uppercase by Kerberos convention)
  - Check user credentials: invalid password returns auth_failed
  - KDC clock skew: Kerberos requires clocks within 5 minutes (check NTP)
  - DNS resolution: KDC hostname must resolve correctly

Ticket not found after acquisition:

  - Check session storage health across cluster
  - Verify session TTL has not expired (matches ticket_ttl config)
  - Check cluster quorum status: tickets require quorum for distributed write
  - Verify cluster connectivity between nodes

GetTicket returns access denied:

  - ACLs control which modules can retrieve tickets
  - Only authorized modules (SSH proxy, bastion) should have access
  - Check ACL configuration in the cluster authorization policy
  - Verify the calling module is in the allowed list

SSH GSSAPI authentication fails with valid ticket:

  - Verify CCache format compatibility: use 'klist -c <file>' to inspect
  - Check KRB5CCNAME environment variable is set to the temp file path
  - Verify the ticket principal matches the SSH service principal
  - Check ticket expiration: expired tickets are rejected by SSH server
  - Ensure SSH server has GSSAPIAuthentication enabled

WriteTicketFile fails:

  - Check filesystem permissions for temp directory
  - Verify disk space available for temporary file creation
  - Remember: caller MUST securely delete temp file after use

Reflection errors (TGT extraction):

  - gokrb5 internal structure may change between versions
  - Module is pinned to gokrb5 v8.4.4; do not upgrade without testing
  - Check structured logs for reflection failure messages
  - Fallback behavior may apply if structure changes

Password change fails:

  - Verify password_change = true in configuration
  - Check kpasswd binary availability (auto-detect or kpasswd_path)
  - KDC password policy may reject the new password (complexity requirements)
  - Check structured logs for kpasswd protocol errors
  - Verify KDC supports kpasswd protocol (RFC 3244)

Memory locking warnings:

  - CAP_IPC_LOCK capability required for mlockall
  - Container: add --cap-add=IPC_LOCK to docker run
  - Kubernetes: add IPC_LOCK to securityContext capabilities
  - Without memory locking, passwords may be swapped to disk (security risk)

Ticket lifecycle monitoring:

  - kerberos_ticket_acquisition_total: track acquisition success/failure
  - kerberos_ticket_refresh_total: monitor refresh operations
  - kerberos_ticket_revocation_total: verify revocation operations
  - kerberos_password_change_total: audit password changes

Diagnostic commands:

  - auth kerberos: check Kerberos health and configuration
  - sessions list --type=kerberos: list active Kerberos ticket sessions
  - health components: verify Kerberos subsystem health

Security

Security model and hardening measures:

In-memory password handling:

  Passwords are typed as []byte (not string) to enable secure clearing.
  Every password is cleared immediately after use.
  gokrb5 authenticates with the KDC entirely in memory. Passwords are
  NEVER written to disk, logs, or any persistent storage.

Memory locking:

  mlockall(MCL_CURRENT) prevents the process memory (including passwords
  and ticket data) from being swapped to disk. Requires CAP_IPC_LOCK
  capability. Graceful degradation: logs a warning if locking fails but
  continues operating.

Pure in-memory TGT extraction:

  gokrb5 stores TGT in private internal fields. The module uses low-level
  Go techniques to extract TGT, session key, timestamps, and renewal data.
  This is version-pinned to gokrb5 v8.4.4 with error handling for structural changes.

CCache format security:

  CCache bytes are built manually in standard MIT Kerberos format (version 4,
  big-endian). This ensures compatibility with all Kerberos tools while
  maintaining full control over the byte layout. No external dependencies
  for marshaling.

Sessions encryption:

  CCache bytes stored in session metadata are automatically encrypted at
  rest by the sessions module. No manual encryption is needed. Encryption
  keys are managed by the sessions infrastructure.

Access control (defense in depth):

  - ACLs restrict which modules can retrieve and revoke tickets
  - Typically limited to SSH proxy, bastion, and service delegation modules
  - ACL configuration in the cluster authorization policy
  - Encryption provides second layer even if ACL is misconfigured

Constant-time comparison:

  Uses crypto/subtle.ConstantTimeCompare for security-sensitive comparisons.

Secure file handling:

  WriteTicketFile creates files with 0600 permissions. Secure file deletion
  overwrites with random data before removal. Callers MUST securely delete
  temp files after use.

Audit logging:

  All ticket operations (acquire, access, revoke, password change) are logged
  via the telemetry system with structured fields. Security events logged at
  appropriate severity levels for SIEM integration.

On-behalf-of trust boundary:

  Hexon is NOT part of the Kerberos realm and requires no keytab. Users
  provide credentials directly. The Hexon cluster is the security perimeter.
  Tickets are used for SSH jump hosts, proxies, and delegation.

Spnego

SPNEGO/Negotiate browser authentication (server model):

SPNEGO (RFC 4559) enables transparent SSO for domain-joined workstations. When a browser hits a protected route, the gateway challenges with “WWW-Authenticate: Negotiate”, the browser obtains a service ticket from the KDC and sends it back. The gateway validates the ticket against a keytab file — no password crosses the wire.

This is the SERVER model, contrasting with the existing PROXY model (AcquireTicket) where Hexon authenticates to the KDC on behalf of users.

Two authentication paths (mirrors the X.509 pattern):

  1. Explicit: /signin/kerberos — user navigates here, browser gets 401
     Negotiate challenge, sends SPNEGO token, session created, redirect.
  2. Auto-SPNEGO: When spnego_auto_auth=true, proxy routes try a Negotiate
     challenge before falling back to OIDC redirect. Uses a marker cookie
     (hexon_spnego_tried, 60s TTL) to prevent infinite 401 loops for
     non-domain browsers.

Configuration:

  [authentication.kerberos]
    spnego_enabled = true
    keytab_path = "/etc/krb5.keytab"         # File path (traditional)
    keytab_base64 = ""                       # Base64 string (K8s/containers)
    service_principal = "HTTP/gw.example.com" # Default: HTTP/<service.hostname>
    spnego_auto_auth = false                 # Transparent SPNEGO on proxy routes
    spnego_exclude_nets = ["10.200.0.0/16"]  # Skip auto-SPNEGO for external nets

Keytab setup (FreeIPA example):

  ipa service-add HTTP/gateway.example.com
  ipa-getkeytab -s ipa.example.com -p HTTP/gateway.example.com -k /etc/krb5.keytab
  chmod 0600 /etc/krb5.keytab

Browser compatibility:

  - Chrome/Edge (Windows/macOS): automatic for domain-joined machines
  - Firefox: requires network.negotiate-auth.trusted-uris configuration
  - Safari (macOS): uses system Kerberos ticket
  - Mobile browsers: no SPNEGO support, falls through to OIDC/password

Troubleshooting:

  - "keytab unavailable": check keytab_path permissions (should be 0600)
  - SPNEGO token unmarshal fails: token may not be a valid SPNEGO token
  - Auth failure: check SPN matches keytab (klist -k /etc/krb5.keytab)
  - Clock skew: Kerberos requires clocks within 5 minutes (check NTP)
  - Non-domain browser loop: hexon_spnego_tried cookie should prevent it
  - "user disabled": valid Kerberos ticket but user disabled in directory

Relationships

Module dependencies and interactions:

SSH bastion: Primary consumer. SSH bastion uses Kerberos tickets for

  GSSAPI authentication to target hosts. Retrieves tickets via GetTicket
  and sets KRB5CCNAME for SSH connections. Writes temp files via
  WriteTicketFile for tools requiring file-based credential caches.

Sessions: Distributed ticket storage with automatic

  encryption at rest. Sessions provide TTL expiration, cluster-wide
  replication, principal-based indexing via ModuleKey, and atomic operations.

Directory: User identity verification. Directory provides the

  canonical username and group memberships used in ticket principal
  construction and access control decisions.

Cluster: ACL definitions control which modules can retrieve tickets.
config: Hot-reloadable configuration for ticket_ttl and password_change.

  Realm and KDC address require restart.

telemetry: Security audit logging for all ticket operations. Metrics

  exported as Prometheus counters for monitoring ticket lifecycle, KDC
  health, authentication failures, and password change operations.

External dependency: gokrb5 v8.4.4 for pure Go Kerberos protocol.

  Version-pinned due to in-memory TGT extraction from internal fields.

External dependency: kpasswd binary for password change operations

  (auto-detected in PATH or configured via kpasswd_path).

LDAP Authentication

LDAP bind-based username/password authentication with directory cache integration and pre-flight account status checks

Overview

The LDAP authentication module provides username/password verification by performing LDAP bind operations against configured directory servers. It acts as a bridge between the directory cache (for fast pre-flight checks) and the LDAP provider (for live password verification).

Core capabilities:

LDAP bind authentication (no local password storage)
Pre-flight account status checks via directory cache (disabled, expired)
Group membership retrieval from directory cache
Full user profile enrichment on successful authentication (email, name, groups)
Graceful degradation when directory details unavailable after successful bind
Prometheus metrics for authentication success/failure with labeled reasons
Stateless operation suitable for any cluster node

Authentication flow (5-step pipeline):

  1. Input validation: trim username, reject empty fields
  2. Directory status check: existence, disabled, password expiry (via directory cache)
  3. LDAP bind: live password verification against LDAP server
  4. User details retrieval: full profile from directory cache
  5. Response construction: comprehensive result with user metadata

The module never stores, caches, or logs passwords. Every authentication attempt requires a live LDAP bind, ensuring password policy enforcement is always delegated to the LDAP server (lockouts, complexity, expiry).

Failure reasons returned in AuthenticateResponse.Reason:

  - "username required" / "password required" (input validation)
  - "user not found" (not in directory cache)
  - "account disabled" / "password expired" (pre-flight status)
  - "invalid credentials" (LDAP bind failed)
  - "directory unavailable" / "authentication service unavailable" (module errors)

Config

The LDAP authentication module itself has no dedicated configuration section. It depends entirely on configuration from two upstream modules:

Directory module [directory]:

  url = "ldaps://ldap.example.com:636"    # LDAP server URL
  bind_dn = "cn=svc,dc=example,dc=com"   # Service account for searches
  bind_password = "secret"                # Service account password
  user_base = "ou=users,dc=example,dc=com"  # User search base DN
  group_base = "ou=groups,dc=example,dc=com" # Group search base DN
  sync_interval = "5m"                    # Delta sync interval (default: 5m)
  full_sync_interval = "60m"              # Full sync interval (default: 60m)

LDAP provider module [ldap]:

  url = "ldaps://ldap.example.com:636"    # LDAP server URL for bind operations
  bind_dn = "cn=svc,dc=example,dc=com"   # Service account DN
  user_base = "ou=users,dc=example,dc=com" # User search base for DN resolution
  user_filter = "(uid=%s)"               # User lookup filter (%s = username)
  user_attribute = "uid"                  # Username attribute (uid, sAMAccountName)

Active Directory considerations:

  - Use user_attribute = "sAMAccountName" for AD environments
  - Use user_filter = "(sAMAccountName=%s)" for AD user lookups
  - AD lockout policies enforced server-side via LDAP bind
  - Password expiry detected via directory cache sync

Connection pooling is managed by the LDAP provider module, not this module. LDAP bind operations reuse pooled connections for reduced overhead.

Cache staleness window:

  - Account status changes (disable, expiry) reflected within sync_interval
  - Default: up to 5 minutes delay for status changes to propagate
  - Full sync ensures eventual consistency every 60 minutes
  - Immediate effect: password changes always verified live via LDAP bind

Troubleshooting

Common symptoms and diagnostic steps:

User gets “Invalid username or password” but credentials are correct:

  - Run 'diagnose user <username>' to check cross-subsystem status
  - Run 'directory user <username>' to verify user exists in cache
  - Check directory sync status: 'directory status' for last sync time
  - If user recently created, wait for sync or trigger manual sync
  - Verify LDAP server reachability: 'auth ldap' for connection health
  - Check if account locked in LDAP (server-side lockout policy)
  - Verify user_attribute matches LDAP schema (uid vs sAMAccountName)

User gets “account disabled” but account is active in LDAP:

  - Directory cache may be stale; check last sync: 'directory status'
  - Trigger manual sync: 'directory sync <username>' to refresh user
  - Verify the disabled attribute mapping in directory config
  - Check delta sync interval (default 5m) for expected propagation delay

User gets “password expired” unexpectedly:

  - Verify password expiry attribute mapping in directory config
  - Check LDAP password policy (ppolicy overlay or AD fine-grained policy)
  - Trigger user sync to refresh expiry status: 'directory sync <username>'

Authentication returns “directory unavailable”:

  - Check directory module health: 'directory status'
  - Verify cluster bridge status: 'cluster status'
  - Check LDAP server connectivity: 'auth ldap'
  - Review logs: 'logs search "directory"' for connection errors
  - Verify directory module is registered and running

Authentication returns “authentication service unavailable”:

  - Check LDAP provider module health: 'auth ldap'
  - Verify LDAP server URL and port in configuration
  - Check TLS certificate validity for ldaps:// connections
  - Test LDAP connectivity: 'net tcp <ldap-host>:636 --tls'
  - Review logs: 'logs search "ldap"' for bind or connection errors
  - Check connection pool: 'connpool stats' for pool exhaustion

  - LDAP bind is the slow path (50-200ms typical, network dependent)
  - Check LDAP server latency: 'net latency <ldap-host>:636 --tls'
  - Verify connection pooling is working: 'connpool pools'
  - High latency indicates LDAP server load or network issues
  - Directory cache lookups should be <5ms (fast path)

All logins failing simultaneously:

  - LDAP server down: 'auth ldap' for health status
  - Network partition: 'net tcp <ldap-host>:636' for connectivity
  - TLS certificate expired: 'net tls <ldap-host>:636' to inspect cert
  - DNS failure: 'dns test <ldap-hostname>' for resolution check
  - Check cluster health: 'health status' for node-level issues

Metrics for monitoring:

  - ldap_authentication_total{result="success"} -- successful logins
  - ldap_authentication_total{result="failure",reason="invalid_credentials"} -- wrong passwords
  - ldap_authentication_total{result="failure",reason="user_not_found"} -- unknown users
  - ldap_authentication_total{result="failure",reason="account_disabled"} -- disabled accounts
  - ldap_authentication_total{result="failure",reason="directory_unavailable"} -- infra issues
  - ldap_authentication_total{result="failure",reason="ldap_unavailable"} -- LDAP down
  - Spike in invalid_credentials may indicate brute force or credential stuffing
  - Spike in directory_unavailable or ldap_unavailable indicates infrastructure problems

Security

Password handling and credential security:

No local password storage:

  Passwords are never stored, cached, or hashed locally. Every authentication
  requires a live LDAP bind, eliminating the risk of a local password database
  compromise. No password appears in logs, telemetry, metrics, or response objects.

Pre-authentication checks (fail-fast security):

  Account status is verified BEFORE attempting LDAP bind. This prevents
  unnecessary LDAP queries for disabled or expired accounts, reducing load on
  the LDAP server and providing faster rejection of invalid accounts.
  Evaluation order: existence -> disabled -> expired -> LDAP bind.

Enumeration prevention:

  The module returns distinct internal reasons ("user not found" vs "invalid
  credentials") but consuming services MUST map these to a generic message
  (e.g., "Invalid username or password") to prevent username enumeration.
  Timing is kept consistent: directory cache lookups are fast (<5ms) regardless
  of user existence. The module itself does not expose any public API that
  reveals user existence.

Brute force and credential stuffing:

  Account lockout is delegated to the LDAP server's password policy (ppolicy
  overlay or Active Directory lockout settings). The module does not implement
  its own lockout or rate limiting. Consuming services (signin, proxy auth)
  should implement:
  - Per-IP rate limiting (recommended: 10 attempts/minute)
  - Per-username rate limiting (recommended: 5 attempts/minute)
  - CAPTCHA after repeated failures
  - Device fingerprinting for anomaly detection

Injection prevention:

  Username is trimmed of whitespace before use. LDAP filter escaping is handled
  by the downstream LDAP provider module. There are no
  local database queries or command executions, eliminating SQL injection and
  command injection vectors entirely.

Password policy enforcement:

  All password complexity, history, and rotation requirements are enforced by the
  LDAP server. The module reports password expiry status from the directory cache
  but does not enforce policies locally. This ensures a single source of truth
  for password policy (the LDAP directory).

Credential logging policy:

  Debug level: username and operation stage (never password)
  Info level: successful authentication with username and groups
  Warn level: authentication failures with reason (never password)
  Error level: infrastructure failures with error details
  Never logged: password, email (unless required for specific audit)

Memory safety:

  Password memory clearing after LDAP bind is handled by the LDAP provider
  module. The authentication module passes the password through to the bind
  operation and does not retain references after the call completes.

Relationships

Module dependencies and interactions:

Directory: Primary dependency for pre-flight checks. Provides cached

  user metadata for account status checks (existence, disabled,
  expired, groups) and full profile retrieval (email, name). Directory cache
  is synced from LDAP on configurable intervals (delta: 5m, full: 60m). Cache
  staleness determines the window for status change propagation.

LDAP provider: Primary dependency for password verification. Performs

  LDAP bind operations for username/password verification.
  Manages LDAP connection pooling, user DN resolution, and
  TLS negotiation. Bind success/failure is the authoritative password check.

Sign-in service: Primary consumer. The sign-in flow engine calls ldapauth

  Authenticate as part of the username/password authentication stage. The flow
  engine maps internal failure reasons to user-facing messages and manages
  session creation on success.

Reverse proxy: Consumer for proxy authentication. HTTP proxied applications

  can require LDAP authentication via proxy auth provider configuration. Uses
  the same Authenticate operation with credentials from Basic Auth or form POST.

Telemetry: All operations logged with structured fields

  (username, groups, error, type). Prometheus metrics exported for authentication
  success/failure counts with reason labels. Metrics enable real-time monitoring,
  security event detection, and capacity planning.

Cluster: All operations are node-local with no cluster coordination required.

  The module is stateless and does not require session affinity or leader election.

Rate limiting: Not directly integrated. Rate limiting for authentication

  endpoints should be configured at the service layer (signin, proxy) using the
  rate limit module. Recommended: per-IP and per-username limits.

sessions: On successful authentication, the consuming service creates a session

  with the returned user metadata (username, email, groups). Session lifecycle is
  managed by the session module, not the authentication module.

Cluster behavior:

  Fully stateless -- no local state, no cluster coordination required. All state
  lives in the directory cache (distributed via NATS/JetStream) and the LDAP
  server. Any cluster node can handle authentication independently. No session
  affinity needed. Directory cache consistency is bounded by sync intervals.

Magic Link Authentication

Passwordless sign-in via email magic links with cross-device support

Overview

The magic link module implements passwordless authentication by sending a sign-in link to the user’s email address. Users click the link to authenticate without entering a password or code.

Core capabilities:

Passwordless authentication via email-delivered links
Cross-device support: request link on one device, click on another
Three verification actions: authorize (remote), sign-in-here (local), deny
Anti-enumeration: identical response shape regardless of email validity
Session-based tokens with 128-bit entropy (UUID v4)
Atomic single-use via cluster-wide session revocation
Per-IP and per-email rate limiting to prevent abuse and inbox flooding
Directory re-validation at verify time (disabled users cannot complete auth)
PreVerify is read-only (safe from link-preview bots consuming tokens)
Confirmation page shows request context (IP, location, browser) for phishing detection
Geo-enriched emails showing request origin for user awareness

Flow summary:

  1. User enters email on /signin/magiclink
  2. Module creates device code pair with geo context in AdditionalData
  3. If email matches an active directory user, a "magiclink" session is
     created (cluster-replicated) containing user info and device code key
  4. Email sent with link: /signin/magiclink/verify?token=<SESSION_ID>
  5. Frontend polls /api/signin/magiclink/poll with the device_code
  6. User clicks link in email (possibly on a different device)
  7. PreVerify validates session (read-only) and renders confirmation page
     showing destination, browser, IP, and geographic location
  8. User chooses: Authorize, Sign in here, or Deny
  9. Verify revokes session (atomic single-use) and acts on device code
  10. Polling returns "authorized", "completed_elsewhere", or "denied"

The module reuses the device code module (RFC 8628) for the polling mechanism and the sessions module for cluster-replicated token storage.

Config

Magic link is configured under the signin service section:

[service.signin.magiclink]

  enabled = true              # Master switch (default: false)
  code_ttl = "10m"            # Link validity duration (default: 10 minutes)
  rate_limit = "5/1m"         # Per-IP rate limit (default: 5 per minute)
  rate_limit_email = "3/10m"  # Per-email rate limit (default: 3 per 10 minutes)

Prerequisites:

  - SMTP must be configured for email delivery
  - Device code module is auto-enabled when magic link is activated
  - Directory module must be available for user lookup by email

UI integration:

  When enabled, sign-in templates render a "Send me a sign in link" text link
  below the secondary method buttons. Magic link is NOT injected into the
  secondary methods array. It appears as a separate, lower-emphasis option
  via the "magiclink_enabled" template variable. Operators only need to set
  enabled = true; the link appears on all sign-in pages (passkey, password,
  x509) automatically.

Rate limiting behavior:

  - Per-IP limit (rate_limit): returns error "rate_limited" when exceeded,
    service responds with HTTP 429
  - Per-email limit (rate_limit_email): silently creates orphaned device code
    as decoy (anti-enumeration), no email sent
  - Both limits reset on their respective sliding windows

Anti-enumeration design:

  Initiate always returns the same response shape (DeviceCode + ExpiresIn)
  regardless of whether the email exists, is disabled, or is rate-limited.
  When the email is invalid or per-email rate-limited, a real but orphaned
  device code is created as a decoy so timing and response structure are
  identical. The frontend polls normally and eventually gets "expired",
  which is indistinguishable from a valid request where the user never
  clicked the link.

Hot-reloadable: code_ttl, rate_limit, rate_limit_email. Cold (restart required): enabled.

Troubleshooting

Common symptoms and diagnostic steps:

User never receives magic link email:

  - Check SMTP health: 'smtp health' to verify email delivery is working
  - Verify email belongs to an active directory user
  - Check per-email rate limit: silent suppression after 3/10m (no error shown)
  - Check spam/junk folders for the magic link email
  - Verify the user's email address in directory matches what was entered
  - Check structured logs for SMTP delivery errors

Magic link says “expired” or “invalid” when clicked:

  - Default TTL is 10 minutes; check if user clicked in time
  - Token is single-use: clicking a second time returns "already consumed"
  - Check cluster time synchronization (NTP) across nodes
  - Verify session replication health across cluster

Polling returns “expired” immediately (anti-enumeration):

  - This is expected behavior for non-existent emails (by design)
  - Per-email rate limit exceeded: creates orphaned decoy device code
  - User disabled in directory: treated as non-existent (anti-enumeration)
  - No way to distinguish from legitimate "user never clicked" scenario

“completed_elsewhere” status on polling device:

  - User chose "Sign in here" on the verifying device (the device where
    they clicked the email link)
  - This is intentional: the session was created on the verifying device only
  - Polling browser displays a friendly message, not an error
  - Detected via a cluster-wide signal for cross-device coordination

Confirmation page shows wrong location or IP:

  - Geo data comes from the GeoAccess module's IP-to-country/ASN lookup
  - Check geo database freshness and availability
  - Proxy or CDN may mask the original client IP
  - X-Forwarded-For header processing depends on trusted proxy configuration

Rate limiting triggered unexpectedly:

  - Per-IP limit: 5 requests per minute (shared across all emails from one IP)
  - Per-email limit: 3 requests per 10 minutes (shared across all IPs)
  - Corporate NAT may cause many users to share one IP
  - Adjust rate_limit and rate_limit_email in config as needed

Magic link feature not visible on sign-in page:

  - Verify enabled = true in [service.signin.magiclink]
  - Check that SMTP is configured (prerequisite)
  - Template variable "magiclink_enabled" drives visibility
  - The link appears below secondary method buttons, not in the methods array

Diagnostic commands:

  - smtp health: verify email delivery subsystem
  - auth status: check authentication system overview
  - sessions list --type=magiclink: list active magic link sessions
  - health components: verify magic link subsystem health

Security

Security features and hardening measures:

Token entropy:

  Magic link tokens are session IDs (UUID v4) with 128-bit cryptographic
  entropy. The session ID doubles as the magic link token in the verification
  URL, providing sufficient randomness to resist brute-force guessing.

Single-use enforcement:

  Tokens are consumed via atomic session revocation (replicated to all nodes).
  Once revoked, the token cannot be reused. Double-click on the verification
  link returns AlreadyDone=true (idempotent, no error).

Anti-enumeration:

  The Initiate operation returns identical response structure regardless of
  whether the email exists, the user is disabled, or per-email rate limit
  is exceeded. Orphaned device codes serve as timing-identical decoys.
  This prevents attackers from using magic link requests to discover valid
  email addresses in the directory.

Directory re-validation:

  At Verify time, the module re-validates the user against the directory.
  If the user has been disabled between Initiate and Verify, authentication
  fails. This prevents race conditions where an admin disables a user who
  already has a pending magic link.

Link-preview bot protection:

  PreVerify (GET request when link is clicked) is read-only and does not
  consume the token. Link-preview bots that fetch URLs in emails cannot
  accidentally authorize or deny the request.

Phishing detection:

  The confirmation page displays the request context (source IP, browser
  User-Agent, country, ISP/ASN) so the user can verify whether they
  initiated the request. Suspicious requests can be denied.

Cross-device security:

  The "sign-in-here" action denies the device code and stores a signal,
  so the polling browser sees "completed_elsewhere" rather than "authorized".
  This prevents unintended sessions on the original (potentially shared) device.

Rate limiting:

  - Per-IP: prevents abuse from a single source (default: 5/1m)
  - Per-email: prevents inbox flooding for a target user (default: 3/10m)
  - Per-email limit is silent (anti-enumeration): no error, decoy created

Relationships

Module dependencies and interactions:

Device code: Core dependency. Provides RFC 8628 device code

  pair generation and polling infrastructure. Magic link auto-enables device
  code when activated. Device code handles the polling lifecycle; magic link
  provides the email-based authorization trigger.

Sessions: Cluster-replicated token storage. Magic link tokens

  are stored as sessions with automatic TTL cleanup and atomic single-use
  via cluster-wide revocation. Session metadata contains user info, device
  code key, and request context.

Directory: User lookup by email at Initiate time and re-validation

  at Verify time. Disabled users are treated as non-existent (anti-enumeration).
  Directory provides canonical user attributes (username, email, full name,
  groups) stored in session metadata.

SMTP: HTML/text magic link email delivery. SMTP must be

  configured as a prerequisite for magic link functionality.

Geo access: IP-to-country and ASN lookup for email context and

  confirmation page display. Helps users detect phishing attempts.

Rate limiting: Per-IP and per-email request throttling. Per-IP

  returns HTTP 429; per-email silently creates decoy (anti-enumeration).

config: Runtime access to [service.signin.magiclink] settings. Hot-reload

  supported for TTL and rate limit values.

Sign-in service: HTTP handlers for /signin/magiclink routes and

  /api/signin/magiclink/poll that delegate to this module.

Distributed memory cache: Stores cross-device flow coordination signals

  so the polling browser knows when authentication completed elsewhere.

OIDC Provider

OpenID Connect provider with OAuth 2.0, PKCE, DPoP, mTLS, PAR, DCR, M2M, and Personal Access Tokens

Overview

The OIDC module implements a full OpenID Connect 1.0 provider with OAuth 2.0 authorization server capabilities. It serves as the central authentication hub for all Hexon services including proxy SSO, bastion SSH device authorization, and machine-to-machine workload authentication.

Core capabilities:

Authorization Code Flow with PKCE (RFC 7636) for user authentication
OIDC Core §3.1.2.1 prompt parameter (none, login, consent)
OIDC Core §3.1.2.1 max_age parameter with session age enforcement
OIDC Core §3.1.3.6 at_hash claim in ID tokens
OIDC Core §2 auth_time claim from real session creation time
Dynamic ACR/AMR claims based on actual authentication method (RFC 8176)
Response mode support: query (default) and form_post (OAuth 2.0 Form Post)
Per-client skip_consent for trusted first-party applications
Pushed Authorization Requests (RFC 9126) for enhanced security
DPoP token binding (RFC 9449) with nonce-based replay protection
Mutual TLS client authentication and certificate-bound tokens (RFC 8705)
Client Credentials Grant (RFC 6749 Section 4.4) for M2M service auth
JWT Bearer Grant (RFC 7523) for certificate-based M2M authentication
Dynamic Client Registration (RFC 7591) for native OAuth clients
Device Authorization Grant (RFC 8628) for headless device flows
Token introspection (RFC 7662) and revocation (RFC 7009)
UserInfo endpoint (OIDC Core Section 5.3)
JWKS and OpenID Configuration discovery endpoints
Optional PKCE plain method deprecation (OAuth 2.1 hardening)
Deterministic Ed25519 key derivation for cluster-wide key consistency
Distributed token storage with quorum-based replication
Personal Access Tokens (PATs) for CLI, CI, proxy access, and automation use
Token introspection supports access tokens, refresh tokens, and PATs

All operations are cluster-wide — token storage, revocation, and key management are replicated across all nodes automatically.

A built-in proxy SSO client provides unified single sign-on for all proxy mappings. Its redirect URIs are validated against live proxy configuration to prevent open redirect attacks. This client is managed automatically and does not appear in the TOML configuration.

Config

Core configuration under [authentication.oidc]:

[authentication.oidc]

  signing_key = "..."              # REQUIRED: Min 32 chars, used for deterministic key derivation via HKDF
  signing_algorithm = "ES256"      # ES256 (default), ES384, ES512, or EdDSA
                                   # MUST be identical across all cluster nodes
  hostname = "auth.example.com"    # REQUIRED: OIDC issuer URL (appears in token claims and discovery)
  enable_test_callback = false     # Enable test callback URL (NEVER enable in production)
  dpop_proactive_nonce = true      # Send DPoP-Nonce header in all token responses (default: true)
  par_ttl = "5m"                   # PAR request_uri TTL (range: 1m-10m per RFC 9126)
  enable_dcr = false               # Enable Dynamic Client Registration (RFC 7591)
  rate_limit_dcr = "10/1m"         # DCR endpoint rate limit per IP
  allow_dcr_from = []              # CIDR allowlist for DCR (empty = allow all)
  allow_dcr_redirect_domains = [] # Allowed redirect URI domains (loopback always allowed, supports *.example.com)
  disable_plain_pkce = false       # Reject "plain" PKCE method (OAuth 2.1 hardening, S256 only when true)
  pat_enabled = false              # Master switch for PATs (default: disabled)
  pat_max_ttl = "2160h"            # Maximum PAT lifetime (default 90 days, max 365 days)
  pat_max_per_user = 10            # Maximum PATs per user (default 10)
  pat_required_groups = []         # Groups allowed to create PATs (empty = any authenticated user)

[[authentication.oidc.clients]]

  name = "my-app"                  # REQUIRED: Client identifier (used as client_id)
  clientsecret = "..."             # Min 32 chars with entropy validation (omit for public/mTLS clients)
  redirect_urls = ["https://..."]  # REQUIRED: Allowed redirect URIs (strict validation, wildcard support)
  origin_urls = ["https://..."]    # Allowed CORS origins
  allowed_scopes = ["openid", "profile", "email", "groups"]  # Permitted scopes
  allowed_grant_types = ["authorization_code", "refresh_token"]  # Default grant types
  require_pkce = false             # Enforce PKCE (MUST be true for public clients)
  skip_consent = false             # Skip consent screen for trusted first-party clients
  allow_client_from = ["0.0.0.0/0"]  # IP allowlist in CIDR notation
  client_credentials_ttl = "1h"   # Access token TTL for client_credentials grant

  # mTLS configuration (RFC 8705)
  token_endpoint_auth_method = "tls_client_auth"  # Enable mTLS client auth
  tls_client_auth_san_uri = "spiffe://..."         # URI SAN identity (SPIFFE)
  tls_client_auth_san_dns = "service.local"        # DNS SAN identity
  tls_client_auth_san_email = "svc@example.com"    # Email SAN identity
  tls_client_auth_subject_dn = "CN=service"        # Subject DN identity
  certificate_bound_tokens = true                  # Bind tokens to client certificate
  client_ca_pem = "/path/to/ca.pem"               # Per-client CA trust (inline PEM or file path)

  # JWT Bearer configuration (RFC 7523)
  jwt_public_key = "-----BEGIN PUBLIC KEY-----..."  # Public key for JWT assertion verification
  jwt_algorithm = "RS256"          # RS256/384/512, ES256/384/512, EdDSA
  jwt_issuer = "service-name"      # Expected issuer claim
  jwt_subject = "service-name"     # Expected subject claim

  # Scope-to-group mapping for M2M authorization
  scope_group_mapping = { "api:read" = ["readers"], "api:write" = ["writers"] }

Token storage and TTL defaults:

  Authorization codes:  10 minutes, single-use
  Access tokens:        1 hour (configurable), replicated cluster-wide
  Refresh tokens:       30 days (configurable), replicated cluster-wide
  DPoP JTIs:            120 seconds, replicated cluster-wide (best-effort)
  DPoP nonces:          60 seconds, single-use, replicated cluster-wide (best-effort)
  PAR requests:         5 minutes (configurable 1-10m), replicated cluster-wide (best-effort)
  PAT sessions:         up to pat_max_ttl (default 90 days), managed by sessions module

Key management:

  Signing keys are derived deterministically from the signing_key using
  HKDF (RFC 5869). Supports ES256 (ECDSA P-256,
  default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519). All cluster
  nodes derive identical keypairs from the same signing_key, requiring no
  key synchronization. Keys remain stable across restarts.

Hot-reloadable: client configurations, scopes, redirect URIs, IP allowlists. Cold (restart required): signing_key, signing_algorithm, hostname (issuer URL).

Security

Token signing:

  ID tokens signed with configurable algorithm: ES256 (default), ES384, ES512, or EdDSA.
  ES256/384/512 are compatible with Kubernetes kube-apiserver --oidc-signing-algs.
  Two signing modes with automatic failover:
    Threshold: distributed key — no single node holds the full private key (requires cluster quorum).
    Deterministic fallback: all nodes derive identical keys from signing_key for cross-node consistency.
  The module auto-switches between modes based on cluster health.
  All token issuance logs include signer_type attribute ("threshold" or "deterministic").
  Signing key entropy validated at startup.
  Keys derived via HKDF-SHA256 (RFC 5869) for cross-node consistency.

JWT algorithm hardening:

  All JWT parsing enforces strict algorithm allowlists.
  ID token validation: ES256, ES384, ES512, EdDSA only (server-issued tokens).
  JWT Bearer assertion: RS256-512, ES256-512, EdDSA (client-signed assertions).
  DPoP proof validation: RS256-512, ES256-512, EdDSA (client-signed proofs).
  id_token_hint validation: ES256, ES384, ES512, EdDSA only (server-issued tokens).
  Symmetric algorithms (HS256/384/512) always rejected, preventing algorithm
  confusion attacks. DPoP proofs validate typ header per RFC 9449 Section 4.3.

PKCE (Proof Key for Code Exchange, RFC 7636):

  Supports S256 (SHA-256) and plain methods. S256 strongly recommended.
  Optional disable_plain_pkce config rejects plain method (OAuth 2.1 hardening).
  When disable_plain_pkce=true, discovery advertises only S256.
  MANDATORY for public clients (no client_secret configured).
  RECOMMENDED for all confidential clients as defense-in-depth.
  Prevents authorization code interception in mobile and SPA scenarios.

DPoP (Demonstrating Proof-of-Possession, RFC 9449):

  Binds tokens to client cryptographic key, preventing token theft and replay.
  Supports RSA, ECDSA, and Ed25519 proof keys.
  JTI replay prevention with 120-second distributed cache TTL.
  Optional nonce-based replay protection (proactive nonce delivery by default).
  Server issues DPoP-Nonce header in all token responses when enabled.
  Introspection returns cnf claim with jkt field for DPoP-bound tokens.

Mutual TLS (RFC 8705):

  Client authentication via X.509 certificate presented during TLS handshake.
  Four identity methods: URI SAN (SPIFFE), DNS SAN, Email SAN, Subject DN.
  Configure exactly one identity method per client.
  Certificate-bound tokens contain cnf.x5t#S256 (SHA-256 thumbprint).
  Binding validated at token refresh and UserInfo endpoints.
  Mutual exclusion: tokens are DPoP-bound OR cert-bound, never both.
  Per-client CA trust via client_ca_pem provides defense-in-depth.
  Certificate DER size limited to 16KB. Raw certificates never logged.
  SPIFFE integration: workloads authenticate with existing X.509-SVIDs.

Pushed Authorization Requests (RFC 9126):

  Authorization parameters stored server-side, not exposed in browser URL.
  request_uri enforces single-use consumption (prevents replay attacks).
  request_uri format: urn:ietf:params:oauth:request_uri:<base64url(32 bytes)>.
  Client binding: request_uri locked to creating client_id.
  DPoP integration: optional key binding at PAR time.
  Claims/id_token_hint limited to 8KB to prevent DoS.

OIDC Core compliance (§2, §3.1.2.1, §3.1.3.6, §5.5.1):

  prompt parameter:
    prompt=none: returns error if user not authenticated (no login redirect).
    prompt=login: forces re-authentication even with active session.
    prompt=consent: forces consent screen even for skip_consent clients.
    Mutually exclusive with each other. Validated at authorization endpoint.
    Error redirects (login_required, consent_required) validated against
    registered redirect URIs to prevent open redirect.

  max_age parameter:
    Limits maximum authentication age in seconds.
    If session is older than max_age, forces re-authentication.
    Validates session CreatedAt against current time.

  auth_time claim:
    Reflects the real time the user authenticated, not when the token was issued.
    Carried through the entire token lifecycle (auth code, refresh, ID token).

  at_hash claim (§3.1.3.6):
    Left half of SHA hash of access token, base64url-encoded.
    Hash algorithm matched to signing algorithm:
      ES256 → SHA-256, ES384 → SHA-384, ES512/EdDSA → SHA-512.
    Included in all ID tokens issued alongside an access token.

  ACR/AMR claims (RFC 8176):
    ACR (Authentication Context Class Reference):
      "1" = single factor (password only)
      "2" = multi-factor or strong single factor (WebAuthn, x509)
    AMR (Authentication Methods References):
      Values per RFC 8176: pwd (password), otp (TOTP/email OTP), hwk (WebAuthn), x509.
      Carried through the entire token lifecycle.

  response_mode parameter:
    query (default): authorization code delivered via redirect query string.
    form_post: code delivered via auto-submitting HTML form (POST).
    form_post includes security headers (X-Frame-Options, Referrer-Policy).

  Consent:
    Per-client skip_consent config skips consent screen for first-party apps.
    The built-in proxy SSO client and DCR clients skip consent.
    Unknown clients always show consent screen.
    prompt=consent overrides skip_consent.

Timing attack protection:

  All security-sensitive comparisons use crypto/subtle.ConstantTimeCompare:
  client secrets, PKCE verifiers, authorization code validation, refresh token
  client binding, DPoP thumbprints, token ownership, mTLS SAN/DN matching,
  and certificate thumbprint binding.

Client security:

  Client secrets require minimum 32 characters with entropy validation.
  Strict redirect URI validation with wildcard security (HTTPS enforced).
  State parameter minimum entropy requirements (32+ characters).
  IP allowlisting per client via CIDR notation.
  Public clients (no secret) MUST set require_pkce=true.
  mTLS clients authenticate via certificate (no secret needed).

Proxy SSO client:

  Automatically managed — not configured via TOML.
  Secret derived from cluster key (consistent across all nodes).
  PKCE S256 required. Redirect URIs validated against live proxy mappings.
  Token exchange handled internally (no external HTTP round-trips).
  Invalid or disabled proxy mappings excluded from redirect URI validation.

Personal Access Tokens (PATs):

  Pre-issued long-lived tokens for hexonclient CLI, CI pipelines, and automation.
  Each PAT is a signed JWT backed by a server-side session for revocation control.
  The JWT allows stateless validation; the session enables instant revocation.
  Step-up 2FA required before creation (TOTP or email OTP) — even if already logged in.
  Server-side revocation: revoking a PAT invalidates the JWT at the next validation check.
  Per-user limit (pat_max_per_user, default 10) prevents token accumulation.
  Max TTL cap (pat_max_ttl, default 90d, max 365d) limits blast radius of stolen tokens.
  Optional IP restriction (allowed_ips) checked at validation time.
  Email notification on creation — user alerted if PAT created without their knowledge.
  Last-used tracking (IP + timestamp) for forensics and audit trail.
  Auto-revoke on user disable — directory bulk revocation includes PATs.
  Active connector (QUIC) connections severed immediately on revocation.
  PATs are distinguished from other token types by a dedicated audience claim.
  PAT names optional (default "Token <date>"), duplicate names rejected (case-insensitive).
  Optional group restriction (pat_required_groups) — when set, user must have any listed group.
  Group check enforced at issuance (OIDC module), profile UI (hides section), and bastion CLI.

  PoW-free proxy access:
    All Bearer tokens (opaque access tokens, JWT ID tokens, PATs) bypass Proof-of-Work
    challenges and OIDC browser redirects entirely. The proxy middleware chain resolves
    Bearer tokens at step 1 — before PoW, before OIDC redirect. Two on-ramps:
      Browser: PoW → OIDC SSO → cookie → proxy (human path)
      Machine: Bearer <token> → proxy (machine path, no round-trips)
    Token types: client_credentials grant (M2M), kubelogin ID tokens, PATs (long-lived
    with session-backed revocation + IP restrictions). Same group authorization, identity
    headers, and Ed25519 signing apply to both paths.

Dynamic Client Registration (RFC 7591):

  Fully stateless — no database, no KV storage, no cache.
  Client IDs use "dcr-" prefix + UUID for recognition.
  Client secrets deterministically derived from the cluster signing key.
  All cluster nodes derive identical secrets. PKCE always required.
  Redirect URIs: loopback always allowed (RFC 8252 §7.3):
    http://localhost[:port][/path], http://127.0.0.1[:port][/path], http://[::1][:port][/path].
  Additional domains via allow_dcr_redirect_domains (exact match or *.example.com wildcard).
  Use allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain (for web-based MCP clients).
  Non-loopback redirect URIs require HTTPS.
  CIDR allowlist (allow_dcr_from) controls which IPs can register.
  Rate limited per IP via rate_limit_dcr config.
  Cannot revoke individual DCR clients — toggle enable_dcr=false to disable all.
  MCP service requires enable_dcr = true for OAuth-based MCP client authentication.

Troubleshooting

Common symptoms and diagnostic steps:

Token exchange failures (invalid_grant):

  - Authorization code expired (10-minute TTL): user took too long to complete flow
  - Code already consumed (single-use): possible replay attack or double-submit
  - PKCE verifier mismatch: client sent wrong code_verifier for the code_challenge
  - Client ID mismatch: code was issued to a different client
  - Redirect URI mismatch: URI in token request differs from authorization request
  - Start with: 'auth status' to check OIDC module health
  - Check: 'diagnose user <username>' for cross-subsystem user access diagnostic

DPoP validation failures:

  - proof_too_old: DPoP proof timestamp older than 60 seconds (clock skew?)
  - proof_from_future: client clock ahead of server (NTP issue)
  - jti_replay: same JTI used twice within 120 seconds (SECURITY: possible attack)
  - invalid_nonce: nonce not found or expired (60-second TTL, single-use)
  - htm_mismatch / htu_mismatch: proof HTTP method or URI does not match request
  - thumbprint_error: JWK thumbprint computation failed (malformed key)
  - Monitor: alert on ANY oidc_dpop_jti_replay_total increments

mTLS authentication failures:

  - No client certificate: TLS handshake did not include certificate
  - SAN/DN mismatch: certificate identity does not match client config
  - Certificate too large: DER exceeds 16KB limit
  - CA trust failure: certificate not signed by expected CA (check client_ca_pem)
  - Wrong identity method: client configured with san_uri but cert has san_dns
  - Check: 'auth status' for authentication system overview

Token refresh failures:

  - Refresh token expired (default 30-day TTL)
  - Client ID mismatch: refresh token bound to different client
  - Certificate binding mismatch (mTLS): presented cert differs from original
  - DPoP key mismatch: different key used than at token issuance
  - Token revoked: check if bulk revocation was triggered
  - Check: 'sessions list --user=<username>' for active sessions

M2M (client_credentials / jwt-bearer) failures:

  - ip_not_allowed: source IP not in client allow_client_from CIDR list
  - Invalid client secret: ensure 32+ chars, check for trailing whitespace
  - Wrong grant type: client must have grant type in allowed_grant_types
  - Scope not allowed: requested scope not in client allowed_scopes
  - JWT assertion: check algorithm matches jwt_algorithm, verify issuer/subject
  - JWT public key: ensure PEM format is correct and algorithm matches key type

PAR (Pushed Authorization Request) failures:

  - replay_attempt: request_uri already consumed (SECURITY: possible replay attack)
  - expired: request_uri TTL exceeded (default 5 minutes)
  - client_mismatch: different client_id attempting to use another client's request_uri
  - invalid_length: request_uri format does not match expected 78-character URN
  - Monitor: alert on oidc_par_consume_total result=replay_attempt

Authorization endpoint (OIDC Core) issues:

  - prompt=none returns login_required: user has no active session; expected behavior
  - prompt=none returns consent_required: client requires consent but prompt=none forbids it
  - prompt=login redirect loop: session freshness check prevents infinite loops (30s guard)
  - max_age forces re-auth: session age exceeds max_age seconds; user must re-authenticate
  - "Unsupported prompt value": client sent invalid prompt value (only none, login, consent allowed)
  - "Invalid max_age": client sent non-numeric max_age value
  - Consent screen shown unexpectedly: check skip_consent on client config ('config show authentication')
  - at_hash missing in ID token: at_hash only present when ID token issued alongside access token
  - ACR shows "1" despite MFA: check session auth_method metadata matches expected method
  - form_post not working: ensure client accepts POST at redirect_uri; check response_mode=form_post

Dynamic Client Registration (RFC 7591) failures:

  - 404 on POST /oidc/register: enable_dcr is false in configuration
  - access_denied: source IP not in allow_dcr_from CIDR allowlist
  - invalid_redirect_uri: redirect domain not in allow_dcr_redirect_domains and not loopback
  - Non-loopback redirect URIs must use HTTPS
  - For web-based MCP clients: set allow_dcr_redirect_domains = ["*"] to allow any HTTPS domain
  - For native CLI MCP clients: no domain config needed (loopback always allowed)
  - Client secret not working: ensure client is using the client_secret returned at registration
  - Token exchange fails: DCR clients require PKCE (S256); ensure code_challenge is sent
  - Check: 'config show authentication' to verify enable_dcr and allow_dcr_redirect_domains settings

PAT (Personal Access Token) failures:

  - "PAT revoked or expired": session deleted or TTL exceeded — check 'sessions list --type=pat --user=X'
  - "maximum PAT limit reached": user has pat_max_per_user tokens — revoke unused ones first
  - "PAT name already exists": case-insensitive duplicate — use a different name
  - "authentication failed" after revoke: expected — session deletion invalidates JWT at next check
  - Token not working after creation: ensure hexonclient uses --token flag with full JWT string
  - IP restriction error: remote IP not in allowed_ips metadata — check 'pats show <session_id>'
  - "your groups do not permit PAT creation": user not in pat_required_groups — check 'config show authentication' and user's groups
  - PAT section hidden in profile: pat_required_groups is set and user not in any listed group
  - Step-up verification required: user must complete TOTP or email OTP before PAT creation
  - PAT not working as proxy Bearer token: check 'logs search "handlers.bearer"' — look for
    "PAT rejected" (revoked session) or "Cached PAT rejected" (stale cache, auto-invalidated)
  - PAT introspection returns {active: false}: ensure token_type_hint is "" or "pat",
    check session exists ('sessions list --type=pat'), verify JWT not expired
  - PAT proxy access denied despite valid token: check allowed_ips — proxy enforces IP restriction
    from session metadata. Use 'pats show <session_id>' to see allowed_ips list
  - Check: 'pats list --user=X' to see all PATs for a user
  - Check: 'sessions list --type=pat' for all PAT sessions cluster-wide
  - Check: 'logs search "oidc.pat"' for PAT issuance and validation logs
  - Check: 'logs search "handlers.bearer"' for proxy bearer middleware PAT validation logs

Proxy SSO redirect loops:

  - OIDC callback failing: check proxy oidc_providers configuration
  - Token exchange fails: proxy exchanges tokens internally (no external HTTP hairpin)
  - Cross-domain cookie: verify proxy hostname matches cookie domain
  - Check: 'sessions list --type=proxy --user=<username>'
  - Check: 'proxy traffic <app>' for per-route metrics

Threshold signing issues:

  - signer_type=deterministic when threshold expected: check cluster quorum, 'cluster status'
  - "Threshold signing unavailable but required": threshold_required is set but quorum lost
  - "OIDC switched to deterministic fallback signing": threshold signer lost, using HKDF key
  - Algorithm mismatch: threshold signer algorithm must match signing_algorithm config
  - Check logs: 'logs search "oidc.keys"' for signing mode transitions

Key rotation / history issues:

  - Token validation fails after key rotation: check 'auth keys' — is the old kid still listed?
  - Key history empty: keys are recorded on first token signing or key rotation
  - Historical key expired from history: TTL may be too short relative to token lifetimes
  - Token signed with unknown kid: historical key may have expired from KV — restart loads from KV
  - Check: 'auth keys' — shows kid, algorithm, curve, expiry, and remaining TTL

Health check failures:

  - signing_key_loaded=false: signing key derivation failed (check signing_key length)
  - entropy_validated=false: signing key has insufficient entropy (weak key)
  - issuer_configured=false: hostname not set in configuration
  - Use: 'auth status' for OIDC health overview

General diagnostic commands:

  'auth status'              - Authentication system status overview
  'auth tokens'              - Active OIDC tokens and sessions
  'auth oidc'                - OIDC provider config and registered clients
  'auth keys'                - Active signing keys with kid, algorithm, and TTL
  'diagnose user <username>' - Cross-subsystem user access diagnostic
  'sessions list --user=X'   - List active sessions for a user
  'sessions revoke-user X'   - Revoke all sessions for a user (emergency)
  'logs search oidc'         - Search logs for OIDC-related entries
  'metrics prometheus oidc'  - Raw OIDC Prometheus metrics

Architecture

How the OIDC provider works at the cluster level:

The OIDC module operates cluster-wide. All token operations, key management, and revocation are replicated to all nodes automatically. The HTTP service layer handles request parsing and delegates to the OIDC module internally.

Operation categories:

Authorization (user login flows)

   - Authorization code generated after user authentication (10-minute single-use TTL)
   - Code exchange validates PKCE, client credentials, redirect URI, then issues tokens
   - Supports: prompt (none/login/consent), max_age, response_mode (query/form_post)
   - Per-client skip_consent controls whether the consent screen is shown

Token management

   - Refresh validates client binding, DPoP key, and certificate binding
   - Bulk revocation is replicated to all nodes for immediate effect
   - Introspection returns confirmation claims for DPoP-bound and cert-bound tokens
   - Introspection also supports PATs (returns token name and ID)

Machine-to-machine (M2M)

   - Client credentials: secret-based auth with scope-to-group mapping
   - JWT bearer: certificate-based auth with public key validation
   - Both return access tokens only (no refresh token, no ID token)
   - Scope-to-group mapping bridges OAuth scopes to Hexon group authorization

Device authorization

   - Issues tokens after the device authorization flow completes
   - Used by bastion SSH for user authentication via browser

Discovery

   - JWKS exposes signing public keys for external JWT verification
   - OpenID Configuration provides standard OIDC discovery metadata
   - Discovery advertises supported response modes, claims, and PKCE methods

Dynamic Client Registration (DCR)

   - Stateless: each DCR client gets a unique ID (dcr- prefix) and derived secret
   - No storage needed — client credentials are deterministically reproducible
   - PKCE required; redirect URIs: loopback always allowed + operator-configured domains

Pushed Authorization Requests (PAR)

   - Authorization parameters stored server-side (not exposed in browser URL)
   - Single-use consumption prevents replay attacks
   - Client binding enforced with constant-time comparison

Personal Access Tokens (PATs)

   - JWT signed and displayed once at creation — never stored server-side
   - Three validation paths: connector (QUIC), HTTP proxy Bearer header, introspection
   - All paths verify JWT signature + server-side session existence + optional IP restriction
   - Revocation deletes the session and immediately disconnects active connections

Token replication model:

  - Authorization codes: local node only (short-lived, single-use)
  - Access/refresh tokens: replicated to all nodes with quorum
  - DPoP JTIs and nonces: best-effort replication (short TTL)
  - PAR requests: best-effort replication (short TTL, single-use)
  - PAT sessions: managed by the sessions module (TTL per token, up to pat_max_ttl)

Key management:

  Signing keypair derived deterministically from signing_key using HKDF-SHA256.
  Supports ES256 (P-256, default), ES384 (P-384), ES512 (P-521), and EdDSA (Ed25519).
  All cluster nodes produce identical keys from the same signing_key — no key
  synchronization needed. Keys remain stable across restarts.

  Threshold signing is preferred when cluster quorum is available. The OIDC module
  auto-switches signing mode based on cluster health. If threshold_required is set
  in config, the deterministic fallback is disabled (fail-closed on quorum loss).

  Key history (rotation support):
  On key rotation, old signing keys are retained so that tokens signed with the
  previous key can still be verified. Each key is identified by its kid (Key ID).
  Historical keys have a TTL based on the longest-lived token signed with them.
  JWKS endpoint serves all active keys (current + historical).
  Inspect active keys: 'auth keys' shows kid, algorithm, curve, and TTL.

Metrics and observability:

  Comprehensive Prometheus metrics exported for all operations:
  - Token operations: exchange, refresh, revocation, introspection, userinfo
  - DPoP: validation, JTI replay detection, nonce generation and validation
  - mTLS: authentication attempts, certificate binding validations
  - PAR: request creation, consumption, replay detection
  - Latency histograms: ID token, auth code, access token generation
  - Validation failures: PKCE, scope, redirect URI, signing key entropy

Relationships

Module dependencies and interactions:

proxy: Provides SSO authentication via a built-in proxy client. Authorization

  codes are exchanged internally (no external HTTP round-trips). Redirect URIs
  validated against live proxy mapping config. Proxy sessions use 24-hour token TTL.

devicecode: Issues tokens after device authorization flow completes. Used

  by bastion SSH — trusted internal callers skip client validation.

directory: Provides user information (groups, email, name) for token claims.

  When a user is disabled in the directory, all their tokens are revoked
  cluster-wide. Group memberships are included in ID tokens and used for
  scope-to-group mapping in M2M flows.

sessions: OIDC tokens create sessions for proxy and bastion flows.

  Session revocation triggers token revocation for the associated user.

authentication.x509: TLS layer validates client certificates against the

  global CA pool. OIDC performs identity matching (SAN/DN) and optional
  per-client CA trust validation on top of TLS-layer authentication.

spiffe: SPIFFE X.509-SVIDs used for mTLS client authentication via URI SAN.

  No separate CA infrastructure needed; reuses ACME SPIFFE profile certificates.

bastion: Bastion SSH uses device authorization flow for user authentication.

  Bastion shell also provides 'pat create/list/revoke' commands with inline
  TOTP/email OTP verification.

firewall: Network-level access rules applied before OIDC HTTP endpoints.

  IP allowlisting per client provides additional application-layer restriction.

protection: Rate limiting applied to token and authorization endpoints.

  Prevents brute-force attacks on client credentials and authorization codes.

mcp: MCP service uses DCR for OAuth-based authentication. MCP clients

  register dynamically via POST /oidc/register, then complete Authorization
  Code + PKCE flow. Also supports static bearer token auth as fallback.

connector (hexonclient): PATs are used for QUIC connector authentication.

  Validates JWT signature + session existence. Active connections are severed
  immediately when a PAT is revoked. Last-used metadata updated on each use.

proxy (Bearer tokens): PATs can be used as HTTP Bearer tokens for proxy access.

  Bearer middleware validates the JWT and checks the server-side session on every
  request (revocation takes effect immediately). IP restrictions from the PAT
  are enforced at the proxy layer.

profile: Profile web UI allows PAT creation (with step-up 2FA gate), listing,

  and revocation.

admin CLI: ‘pats’ command for cross-user PAT management with step-up verification.
smtp: Email notification sent on PAT creation, including token name, expiry,

  and the IP address used during creation.

cluster: All token operations are replicated cluster-wide. Key derivation

  ensures all nodes produce identical signing keypairs from the same signing_key.

Email OTP

Email-based one-time password with distributed storage, timing-attack resistance, and brute-force protection

Overview

The Email OTP module implements email-based one-time password authentication with cryptographically secure code generation, distributed cluster storage, and advanced timing-attack resistance.

Core capabilities:

Cryptographically secure code generation (numeric or BASE20 consonant-only)
Rejection sampling for modulo bias elimination in random digit generation
Device-based rate limiting and resend delay enforcement
Distributed OTP storage with cluster quorum (>50% node confirmation)
Constant-time validation using crypto/subtle with dummy value protection
Replay attack prevention via one-time use enforcement
Email domain allowlisting for authorized domains
Brute-force protection with configurable max retry limits and OTP locking
SHA-256 hashed cache keys for email privacy protection
JIT-2FA override support for webhook-validated scenarios

Authentication flow:

  1. GenerateOTP: Validate email domain, check device restrictions, generate code
  2. Code sent to user via SMTP module (fire-and-forget, non-blocking)
  3. ValidateOTP: Constant-time code comparison, retry tracking, replay prevention

OTP code types:

  numeric: Digits 0-9 with rejection sampling (no modulo bias)
  base20: 20 uppercase consonants (BCDFGHJKLMNPQRSTVWXZ), avoids profanity
          per RFC 8628 Section 6.1 recommendation for user-facing codes

Storage: OTPs stored in distributed memory cache (cache type “otp_codes”) with cluster-wide broadcast and quorum requirement. Cache keys are SHA-256 hashes of “email|deviceID” for privacy protection. TTL-based automatic expiration.

Config

Configuration under [authentication.otp]:

[authentication.otp]

  length = 6                        # OTP code length (4-12, recommended: 4-8)
  type = "numeric"                  # Code type: "numeric" or "base20"
  valid = "5m"                      # OTP expiration duration (bounds: 1m-30m)
  resend_time = 60                  # Minimum seconds between OTP requests per device
  max_retries = 5                   # Max failed validation attempts before OTP locked
  avoid_replay = true               # Delete OTP after successful validation
  mask_email = true                 # Mask email in MFA page ("user****@example.com")
  domains = [                       # Allowed email domains (empty = all blocked)
    "example.com",
    "company.org",
  ]

Code type selection:

  "numeric": Standard digit-only codes, works with any keyboard layout
  "base20": Consonant-only uppercase codes, prevents generating offensive words
  Invalid type values fall back to "numeric" with a warning log

Override fields for JIT-2FA and programmatic callers:

  TypeOverride: Override code type per-request (empty = global config)
  CodeLengthOverride: Override code length per-request (bounds: 4-12)
  TTLOverride: Override expiration per-request (bounds: 1m-30m)
  ResendTimeOverride: Override resend cooldown per-request (bounds: 10s-5m)
  SkipDomainCheck: Bypass email domain allowlist (for webhook-validated flows)
  MaxRetriesOverride: Override max failed attempts per-request (bounds: 1-10)
  Resolution chain for all overrides: per-request > global config > default

MaxRetries behavior:

  When retry count reaches max_retries, OTP is locked (not deleted).
  Locked OTPs block both validation AND resend requests.
  This prevents brute-force bypass via the resend trick (request new code
  after exhausting retries on the current one).
  Locked OTPs expire naturally via TTL for automatic cleanup.

Resend behavior:

  Retry and attempt counters are preserved across resends for the same email.
  This prevents attackers from resetting counters by requesting a new code.
  Counters only reset when a different email is used from the same device.

All settings are hot-reloadable (read dynamically on each operation).

Troubleshooting

Common symptoms and diagnostic steps:

User does not receive OTP email:

  - Check email domain is in the allowed domains list
  - Verify SMTP module health: 'smtp health'
  - Check telemetry logs for "Failed to initiate OTP email delivery"
  - Note: email delivery is fire-and-forget, GenerateOTP succeeds even if SMTP fails
  - Verify the user's email address format is valid (must contain @)

“email domain not allowed” error:

  - Email domain not in [authentication.otp] domains list
  - Domain check is case-insensitive
  - Empty domains list or ["*"] allows all domains
  - JIT-2FA callers should set SkipDomainCheck=true if webhook validates

“unidentified device” error:

  - DeviceID is empty in the GenerateOTP request
  - Handler must generate device fingerprint before calling OTP module
  - DeviceID is required for rate limiting and device-email binding

“this device has already requested a code” error:

  - Device has an active (non-expired) OTP for a different email address
  - Prevents attacker from using victim's device session for their email
  - Wait for existing OTP to expire, or use a different device identifier

“please wait X before requesting another code” error:

  - Resend delay not elapsed (default: 60 seconds between requests)
  - Check resend_time config or ResendTimeOverride bounds (10s-5m)

“too many failed attempts” error:

  - OTP locked after max_retries exceeded (default: 5 attempts)
  - Locked OTPs also block resend requests to prevent bypass
  - User must wait for OTP to expire (TTL) then request a new one
  - Check logs for "SECURITY: OTP locked due to max retry attempts exceeded"

OTP validation returns Valid=false without error:

  - Code expired (check valid duration in config)
  - Incorrect code submitted (case-insensitive comparison)
  - No OTP found for email/device combination
  - OTP already consumed (avoid_replay=true deletes after success)
  - OTP locked from previous max retries exceeded

“OTP storage quorum not reached” error:

  - Insufficient cluster nodes confirmed storage (need >50%)
  - Check cluster health: 'cluster status'
  - May indicate network partition or node failures

Metrics for monitoring:

  - otp.codes_generated (type=numeric|base20): Generation count by type
  - otp.validations_total (result=valid|invalid): Overall validation outcomes
  - otp.validation_failures (reason=not_found|expired|invalid_code|max_retries|locked):
    Failure breakdown by reason
  - otp.replay_prevented: Successful validations where OTP was deleted

Security

Security design and hardening:

Code generation:

  Cryptographically secure random generation using crypto/rand.
  Rejection sampling eliminates modulo bias in digit selection:
    For numeric (base 10): Accept bytes 0-249, reject 250-255 (2.3% rejection rate).
    For BASE20 (base 20): Accept bytes 0-239, reject 240-255.
  This ensures perfectly uniform distribution across all code characters.

Constant-time validation (timing attack resistance):

  All code paths execute identical operations regardless of OTP existence.
  When OTP not found: dummy code "DUMMY0000" and expired metadata are used.
  crypto/subtle.ConstantTimeCompare always called, even on storage errors.
  No early returns before the comparison operation.
  Prevents attackers from determining OTP existence via response time analysis.
  Prevents code enumeration through timing side channels.

Brute-force protection:

  Configurable max retry limit (default: 5 failed attempts).
  OTP locked (not deleted) after max retries — blocks both validation and resend.
  For 6-digit numeric: 5/1,000,000 = 0.0005% success probability per OTP.
  Retry counters preserved across resends to prevent counter-reset bypass.
  Security event logged at WARN level when max retries exceeded.

Device-email binding:

  Each device can only have one active OTP at a time.
  Device cannot switch to a different email while an active OTP exists.
  Prevents attacker from using a compromised device session for their own email.

Email privacy protection:

  Cache keys are SHA-256 hashes of "email|deviceID" (base64url encoded).
  Email addresses never stored directly in cache keys.
  Prevents email enumeration via cache key inspection.
  Deterministic hashing ensures consistent key derivation across cluster nodes.

Replay prevention:

  Configurable via AvoidReplay setting (recommended: true).
  OTP deleted from all cluster nodes after successful validation.
  Prevents code reuse even within the validity window.

Resend abuse prevention:

  Per-device resend delay (configurable, default 60 seconds).
  Locked OTPs block resend requests (prevents brute-force via fresh codes).
  Retry counters preserved across resends for the same email.

Cluster storage security:

  OTP broadcast to all cluster nodes with quorum requirement (>50%).
  Ensures OTP availability across node failures.
  Retry count updates also require cluster quorum.
  TTL-based automatic expiration prevents stale OTP accumulation.

Relationships

Module dependencies and interactions:

signin: Primary consumer for email-based MFA. When MFAMethods includes “otp”,

  users see the email OTP option on the MFA page. The signin flow engine calls
  GenerateOTP to send a code, then ValidateOTP when the user submits it.
  Successful validation completes the login flow.

smtp: Email delivery for OTP codes. OTP generation triggers email delivery via

  the SMTP module (fire-and-forget). Email includes the code, validity duration, and
  is localized using the Language field from the request.

Distributed memory cache: Backend for OTP metadata. Uses cache type

  "otp_codes" with SHA-256 hashed keys. All writes use cluster broadcast
  with quorum for consistency.

authentication.totp: Sibling MFA method. Users may see both email OTP and

  TOTP options on the MFA page. Email OTP requires no prior enrollment but
  depends on email delivery; TOTP is faster but requires authenticator app setup.

config: Reads [authentication.otp] settings dynamically at runtime.

  All settings are hot-reloadable. Override fields in requests take precedence
  over global config values.

telemetry: Structured logging with email context for all operations.

  Security events logged at WARN level (max retries exceeded, OTP locked).
  Metrics counters for generation, validation outcomes, and failure reasons.

Rate limiting: External rate limiting layer. Handlers should implement

  IP-based rate limiting in addition to the module's device-based limiting.

jit_2fa: JIT-2FA webhook flow uses override fields (SkipDomainCheck,

  TTLOverride, CodeLengthOverride) for customized OTP behavior when the
  webhook has already validated the user.

RADIUS Authentication (RADSEC + UDP)

Dual-mode RADIUS AAA server — RADSEC (TCP+TLS, RFC 6614) and plain UDP (RFC 2865) — for network device authentication and group-based authorization

Overview

The RADIUS module provides authentication and authorization for network devices — VPN concentrators, WiFi controllers, switches, and other NAS (Network Access Server) equipment — supporting both RADSEC (RFC 6614, RADIUS over TCP+TLS) and plain UDP (RFC 2865).

Transport modes (controlled by radsec_only, default: true):

radsec_only=true: RADSEC TCP+TLS only on configured port (default 2083)
radsec_only=false: Dual mode — RADSEC on port 2083 + plain UDP on configured port (default 1812)

Core capabilities:

RADSEC listener for Access-Request packets over TCP+TLS (always active)
Plain UDP RADIUS listener for legacy NAS equipment (when dual mode enabled)
TLS certificate cascade: per-client → module-level → auto_tls (ACME) → service default
Per-client mTLS: optional NAS device certificate verification via client_ca_pem
NAS client validation via CIDR matching and shared secret verification (CIDR defaults to 0.0.0.0/0 if empty)
HXEP (Hexon Edge Protocol) support: real NAS IP through SNAT/edge proxy (same as VPN)
Password authentication via LDAP bind (standard RADIUS User-Password)
X.509 certificate authentication via RADSEC peer certificates — uses the same

  authentication.x509 module (7-layer validation: expiry, chain, CRL, identity
  extraction via cert_subject_map, directory lookup, revocation check)

Group-based authorization mappings with priority ordering (first match wins)
RADIUS attribute-value pair (AVP) responses: VLANs, ACLs, privilege levels
Per-NAS rate limiting (sliding window) and per-user lockout after failed attempts
Global concurrent authentication cap for DoS protection
Full audit logging of authentication decisions with NAS and user context

Both transports share the same packet processing pipeline — authentication, authorization, and response building are transport-independent.

Config

RADIUS configuration under [radius] section:

[radius]

  enabled = true                    # Enable RADIUS service
  radsec_only = true                # true: RADSEC TCP+TLS only; false: dual mode (UDP + RADSEC)
  network_interface = ""            # Bind interface (defaults to service.network_interface → "eth0")
  radsec_port = 2083                # RADSEC TCP+TLS port (default 2083, RFC 6614)
  plain_port = 1812                 # Plain UDP RADIUS port (default 1812, RFC 2865, dual mode only)
  accounting_port = 2083            # Reserved for future accounting
  auth_methods = ["password"]       # Methods: "password" (LDAP bind), "x509" (RADSEC peer cert)
  idle_timeout = "30s"              # Per-connection idle timeout (default: 30s)
  session_ttl = "1h"               # Auth event visibility in session list (1m-24h)
  tls_min_version = "1.2"          # Minimum TLS version: "1.1", "1.2", "1.3"

  # TLS: module-level certificate (optional, falls back to service default)
  tls_cert = ""                     # Server cert (file path or inline PEM)
  tls_key = ""                      # Server private key (file path or inline PEM)
  auto_tls = false                  # Issue cert from internal ACME CA

[radius.rate_limit]

  max_requests_per_second_per_nas = 100   # Per-NAS rate limit
  max_auth_attempts_per_user = 5          # Failed attempts before lockout
  auth_lockout_duration = "5m"            # Lockout period after max failures
  max_concurrent_authentications = 1000   # Global concurrent auth cap

NAS client definitions (at least one required)

[[radius.client]]

  name = "vpn-concentrator"
  description = "Fortinet FG-100F at DC1"
  cidr = "10.0.1.0/24"               # Defaults to 0.0.0.0/0 if empty (WARNING logged)
  secret = "base64:c2VjdXJlLXJhbmRvbS1zZWNyZXQ="  # min 16 bytes decoded

  # Per-client TLS overrides (optional)
  tls_cert = ""                     # NAS-specific server cert
  tls_key = ""                      # NAS-specific server key
  client_ca_pem = ""                # CA to verify NAS device cert (enables mTLS)

Group-based authorization mappings (evaluated by priority, highest first)

[[radius.mapping]]

  name = "network-admins"
  groups = ["admins", "network-ops"]
  priority = 100
  [radius.mapping.attributes]
    "Service-Type" = "6"                 # Administrative
    "Tunnel-Type" = "13"                 # VLAN
    "Tunnel-Medium-Type" = "6"           # IEEE 802
    "Tunnel-Private-Group-ID" = "10"     # VLAN 10

[radius.mfa]

  enabled = false                    # Enable MFA for RADIUS password auth
  mode = "challenge"                 # "challenge" (Access-Challenge) or "append" (password+code)
  methods = ["totp"]                 # Priority list: "totp", "otp" (email)
  separator = ":"                    # Append mode separator (split at last occurrence)
  challenge_timeout = "60s"          # Access-Challenge response timeout (10s-300s)
  required_groups = []               # Groups requiring MFA (empty = all users)
  skip_if_unavailable = false        # Skip MFA if no method available (false = reject)
  otp_ttl = "5m"                     # Email OTP validity override (1m-10m)
  otp_code_length = 6                # Email OTP code length (4-8)

Per-client MFA override (optional field on [[radius.client]]):

  mfa_override = ""                  # "" = inherit global, "off" = disable, "challenge", "append"

Hot-reloadable: all settings except port and TLS (requires restart).

Troubleshooting

Common RADIUS issues and diagnostic steps:

NAS cannot connect to RADIUS server:

  - RADSEC: verify port 2083/tcp is open; 'firewall show' to check rules
  - UDP (dual mode): verify configured port (default 1812/udp) is open
  - Verify NAS IP falls within a configured [[radius.client]] CIDR
  - Test connectivity from NAS to gateway on configured port
  - Check: 'config show radius' to verify enabled = true and radsec_only setting
  - TLS handshake failures logged with NAS name and source IP (RADSEC only)

TLS handshake failures:

  - "no TLS certificate available": no cert configured at any level
  - Check TLS cascade: per-client tls_cert → module tls_cert → auto_tls → service cert
  - If using auto_tls, verify ACME CA is configured and reachable
  - If client_ca_pem set: NAS must present valid client certificate (mTLS)
  - Minimum TLS version defaults to 1.2 — check tls_min_version setting
  - Set tls_min_version = "1.1" only for legacy NAS devices that don't support 1.2+

Authentication failures (Access-Reject):

  - Access-Reject always returns "Access denied" in Reply-Message (no internal detail leak)
  - Check server logs for the actual reason (detailed reason logged at each reject point)
  - "bad authenticator" in logs: shared secret mismatch between NAS and config
  - "LDAP bind failed" in logs: user credentials incorrect or user not in directory
  - "User account disabled" in logs: user is disabled in directory
  - "Account temporarily locked" in logs: too many failed attempts, wait for lockout to expire
  - Lockout auto-clears after auth_lockout_duration expires (default 5m)
  - Abandoned lockout entries (< max failures, then idle) are cleaned up after 2× auth_lockout_duration
  - Check rate_limit settings if legitimate users are being locked out

X.509 certificate authentication issues:

  - x509 only works on RADSEC (TCP+TLS) — NAS must present client cert during TLS handshake
  - "Certificate validation service unavailable": [authentication.x509] not enabled or bridge error
  - "Certificate validation failed": cert expired, chain untrusted, revoked, or identity not in directory
  - Identity from cert is authoritative (RADIUS User-Name attribute is optional for x509)
  - Uses same authentication.x509 config (ca_pem, cert_subject_map, OCSP) as web signin
  - Check: 'config show authentication.x509' for CA pool and identity mapping settings

No RADIUS response (NAS timeout):

  - RADSEC: connection drops for unknown NAS IPs (no TLS handshake for unknowns)
  - UDP: unknown source IPs silently dropped (no information leak)
  - Per-NAS rate limit exceeded: increase max_requests_per_second_per_nas
  - Global concurrent auth limit reached: increase max_concurrent_authentications
  - LDAP service not ready: check directory service health
  - Idle timeout (default 30s): increase idle_timeout if NAS sends infrequent requests

HXEP (edge proxy / SNAT) issues:

  - "HXEP resolved real NAS IP" log: normal — shows socket IP → real NAS IP resolution
  - NAS rejected after HXEP: real NAS IP doesn't match any client CIDR — add correct CIDR
  - HXEP not resolving: verify service.hexon_edge_protocol = true and edge IP in service.hexon_edge_cidr
  - TLS handshake fails via edge: HXEP header parsed during TLS handshake read — check edge proxy config
  - UDP via edge: HXEP wrapping is transparent — no RADIUS-specific config needed
  - "Rejecting HXEP connection — NAS has per-client mTLS": client_ca_pem is incompatible
    with HXEP edge proxy — mTLS cannot be enforced because TLS handshake occurs before
    HXEP reveals the real NAS IP. Remove client_ca_pem or connect the NAS directly (no edge)

MFA issues:

  - "MFA enrollment required": user has no TOTP enrolled and skip_if_unavailable=false
    → Enroll user's TOTP via bastion 'totp enroll' or web signup, or set skip_if_unavailable=true
  - "Challenge expired or invalid": user took too long, increase challenge_timeout (max 300s)
  - Access-Challenge not working: NAS may not support Access-Challenge — use mfa_override="append"
  - Append mode "Invalid credentials": password+code not split correctly
    → Check separator config (default ":"), user must type password:123456
  - Email OTP not delivered: verify SMTP configured and user has email in directory
  - Per-client MFA override: set mfa_override on [[radius.client]] to "off", "challenge", or "append"
  - MFA only applies to password auth — x509 certificate is the second factor

Mapping not applied (wrong VLAN/attributes):

  - Mappings evaluated by priority (highest first), first match wins
  - Empty groups = catch-all, ensure it has lowest priority
  - Verify user's group membership in directory matches mapping groups
  - Check: user groups via directory service

Relationships

Module dependencies and interactions:

LDAP module: Password authentication uses LDAP bind for credential verification.

  RADIUS waits for LDAP readiness before accepting connections.

X.509 auth module: Certificate authentication validates client certificates against the CA.

  Full 7-layer validation: expiry, chain, CRL, identity extraction, directory, revocation.
  Uses same [authentication.x509] config as web signin (ca_pem, cert_subject_map, OCSP).
  Identity extracted from certificate is authoritative (RADIUS User-Name optional for x509).

Directory service: Group membership lookups for authorization mapping evaluation.

  User disabled status checked before authentication.

Certmanager: TLS certificate cascade — module cert, auto_tls (ACME), or service default.

  Per-client TLS overrides built at init time for NAS-specific certificates.

Managed listener: TCP and UDP socket lifecycle managed by Hexon’s listener infrastructure.

  RADSEC: TLS applied per-connection (not at listener level) for per-client cert selection.
  UDP: packets matched to NAS by source IP, dispatched directly to handlePacket.
  HXEP (Hexon Edge Protocol): real NAS IP resolved through SNAT/edge proxy — same pattern as VPN.
  TCP: two-phase NAS matching (socket IP for TLS config → HXEP real IP for final NAS match).
  UDP: HXEP PacketConn wrapper transparently resolves real IP — no handler changes needed.

TOTP module: MFA checks TOTP enrollment and validates codes (including recovery codes).
Email OTP module: MFA generates and validates email OTP codes.

  Bypasses web domain allowlist since RADIUS users may not match web-configured domains.

Cluster: All cross-module calls use standard cluster communication.
Metrics: Exposes radius_connections_total, radius_packets_total, radius_auth_total,

  radius_auth_duration, radius_errors_total, and radius_mapping_matches_total counters.

Sessions module: Auth events recorded as type “radius” sessions on Access-Accept.

  Visible via 'sessions list --type=radius', 'sessions show', cluster-wide.
  TTL controlled by session_ttl config (default 1h).
  Rich metadata per session: NAS name/IP, transport (tcp/udp), TLS version,
  auth method, mapping, RADIUS attributes, user groups, packet ID, timing
  metrics (total_ms, auth_ms, authz_ms), and cert info for x509 (serial,
  subject, issuer, expiry, CA type).

Configuration: Reads [radius] TOML section. Validated at startup.
Admin CLI: RADIUS status and diagnostics available through admin commands.

SAML 2.0 Identity Provider

SAML 2.0 IdP with SSO/SLO, ECDSA-SHA256 signing, per-SP access control, and replay prevention

Overview

Hexon acts as a SAML 2.0 Identity Provider (IdP), authenticating users for external Service Providers (SPs). Users authenticate against Hexon’s directory module (LDAP, local users, SCIM-provisioned identities) and receive signed SAML assertions that the SP trusts.

Supported capabilities:

SAML 2.0 SSO (Single Sign-On) via HTTP-POST and HTTP-Redirect bindings
SAML 2.0 SLO (Single Logout) for both SP-initiated and IdP-initiated flows
ECDSA P-256 with SHA-256 assertion and response signing via ACME CA
RSA-SHA256 signature verification for incoming SP requests
Per-SP configuration with group-based access control (OR semantics)
Configurable NameID formats: persistent, email, transient, unspecified
Custom attribute mapping per SP (e.g., email to mail, groups to memberOf)
Replay attack prevention using distributed cache with Request ID tracking
Configurable assertion TTL and session TTL per deployment
Automatic signing key rotation via ACME CA lifecycle
Cluster-consistent signing (same key derived on all nodes)
Built-in SP simulator for testing (enabled only in test mode)

Authentication flow:

SP sends AuthnRequest to Hexon IdP (HTTP-POST or HTTP-Redirect binding)
Hexon validates the request (signature, format, replay check)
User authenticates via Hexon’s sign-in flow (directory-backed)
Hexon generates a signed SAML Response with Assertion
Browser POSTs the SAML Response back to the SP’s ACS URL
SP validates the assertion signature and establishes a local session

All SAML operations are cluster-wide, ensuring consistency and proper access control regardless of which node handles the request. Signing operations use the ACME CA key pair.

Config

Configuration under [authentication.saml] in hexon.toml:

[authentication.saml]

  enabled = true                    # Enable SAML IdP functionality
  assertion_ttl = "5m"              # Lifetime of SAML assertions (default: 5 minutes)
  session_ttl = "8h"                # SAML session duration for SLO tracking (default: 8 hours)
  require_signed_requests = false   # Require AuthnRequests to be signed by SP
  sign_responses = true             # Sign the entire SAML Response envelope
  sign_assertions = true            # Sign individual Assertion elements
  name_id_format = "persistent"     # Default NameID format (persistent, email, transient, unspecified)

Per-SP configuration via [[authentication.saml.service_providers]]:

[[authentication.saml.service_providers]]

  name = "Example SP"                                    # Display name for admin visibility
  entity_id = "https://sp.example.com/saml"              # SP's unique entity identifier
  acs_url = "https://sp.example.com/saml/acs"            # Assertion Consumer Service URL (POST binding)
  slo_url = "https://sp.example.com/saml/slo"            # Single Logout URL (optional)
  name_id_format = "email"                               # Override default NameID format for this SP
  allowed_groups = ["users", "admins"]                    # Group-based access (OR logic, empty = all users)
  attribute_mapping = { "email" = "mail", "groups" = "memberOf" }  # SAML attribute name mapping

The IdP entity ID is automatically derived from the service hostname:

  https://{service.hostname}/saml

Signing uses the ACME CA key pair. No separate signing key configuration is needed. The signing certificate is published in IdP metadata and rotates automatically with the ACME CA lifecycle.

Hot-reloadable: assertion_ttl, session_ttl, require_signed_requests, sign_responses, sign_assertions, name_id_format, service_providers list (add/remove/modify SPs). Cold (restart required): enabled.

Troubleshooting

Common symptoms and diagnostic steps:

SAML Response rejected by SP (signature validation failure):

  - Signing certificate rotated: SP may have cached old IdP metadata
  - Fetch fresh metadata from /saml/metadata and update SP configuration
  - Check: 'saml signing' to inspect current signing certificate details
  - Check: 'saml health' for overall IdP health status
  - Verify SP expects ECDSA-SHA256 (some older SPs only support RSA-SHA256)

AuthnRequest processing fails:

  - Malformed XML: check SP's AuthnRequest encoding (Base64 + URL-encoding for Redirect)
  - Replay detected: Request ID already seen in distributed cache (5 min window)
  - Invalid signature: SP signing cert not matching, or RSA-SHA256 verification failing
  - Time validation: IssueInstant clock skew exceeding allowed tolerance
  - Size limit exceeded: request too large (DoS protection)
  - Check: 'logs search saml' for detailed error messages

User cannot access SP (403 or no assertion generated):

  - Group restriction: user not in SP's allowed_groups list
  - Check: 'directory user <username>' for group membership
  - Check: 'diagnose user <username>' for cross-subsystem access diagnostic
  - Empty allowed_groups means all authenticated users can access the SP

SLO (Single Logout) not working:

  - SP's slo_url not configured in the service_providers entry
  - Session already expired (session_ttl exceeded)
  - LogoutRequest signature verification failing
  - SP not sending proper LogoutRequest or LogoutResponse
  - Check: 'sessions list --user=X' for active session state

IdP metadata endpoint returning errors:

  - ACME CA not available: 'autotls status' and 'certs list' for certificate state
  - Signing key derivation failing: 'cluster status' for cluster health
  - Check: 'saml metadata' to inspect metadata XML output

Assertion attribute mapping issues:

  - attribute_mapping maps Hexon field names to SAML attribute names
  - Common mappings: email->mail, groups->memberOf, username->uid
  - Directory must have the source field populated for the user
  - Check: 'directory user <username>' for available user attributes

Cross-cluster SAML issues:

  - All nodes must derive the same signing key (from the shared ACME CA)
  - Distributed cache must be consistent for replay prevention
  - Check: 'cluster status' for quorum and node health

NameID format mismatch:

  - SP expects a specific format (e.g., email) but IdP sends persistent
  - Per-SP name_id_format overrides the global default
  - Supported: persistent, email, transient, unspecified

Security

SAML-specific security measures and considerations:

Assertion Signing (ECDSA P-256 with SHA-256):

  - Algorithm: http://www.w3.org/2001/04/xmldsig-more#ecdsa-sha256
  - Key source: ACME CA key pair via cluster bridge
  - Cluster consistency: same key derived on all nodes
  - Automatic rotation: follows ACME CA certificate lifecycle
  - Both Response and Assertion can be signed independently (sign_responses, sign_assertions)
  - Signing certificate published in IdP metadata at /saml/metadata

SP Signature Verification:

  - AuthnRequest: verified when SP has certificate_pem configured; require_signed_requests rejects SPs without certs
  - LogoutRequest: verified when SP has certificate_pem configured (prevents forged logout attacks)
  - LogoutResponse: verified when SP has certificate_pem configured
  - Algorithm: ECDSA-SHA256 or RSA-SHA256 for incoming SP signatures
  - Best practice: always configure certificate_pem for each SP to enable signature verification

XXE (XML External Entity) Prevention:

  - XML parsers configured to disable external entity resolution
  - Prevents server-side request forgery via crafted XML payloads
  - Prevents file disclosure attacks via entity expansion
  - Size limits enforced on all incoming XML to prevent billion laughs attack

Replay Attack Prevention:

  - AuthnRequest IDs tracked in distributed cache (5-minute TTL)
  - Duplicate Request IDs rejected immediately
  - Writes are replicated to all nodes for consistency
  - Time validation: IssueInstant checked with configurable clock skew tolerance

ACS URL Validation:

  - Strict URL matching against configured acs_url per SP
  - Prevents assertion redirect attacks where attacker substitutes ACS URL
  - URL must exactly match the registered SP configuration

Session Security:

  - SAML sessions stored in distributed cache with configurable TTL (default 8h)
  - Pending SSO requests stored during login (10-minute TTL, auto-cleanup)
  - All storage is replicated to all nodes for cluster-wide consistency
  - Sessions support both SP-initiated and IdP-initiated logout

Group-Based Access Control:

  - Per-SP allowed_groups with OR semantics (user needs any one listed group)
  - Groups fetched from directory at assertion time (not cached)
  - Empty allowed_groups list permits all authenticated users
  - Group changes take effect on next authentication (no active assertion revocation)

Clock Synchronization:

  - NTP required on all cluster nodes for consistent timestamp validation
  - IssueInstant validation uses configurable clock skew tolerance
  - Assertion NotBefore/NotOnOrAfter use assertion_ttl for time window

Relationships

Module dependencies and interactions:

directory: User authentication and group membership lookup. All user attributes

  (username, email, display name, groups) sourced from directory for assertion
  generation. Supports LDAP, local users, and SCIM-provisioned identities.

acme_ca: Provides the signing key pair for ECDSA-SHA256 signatures. Certificate

  rotation handled automatically. Signing certificate published in IdP metadata.
  Check certificate status via 'autotls status' and 'certs list'.

sessions: SAML session tracking for Single Logout (SLO). Sessions stored in

  distributed cache (saml_sessions) with broadcast writes. Supports both
  SP-initiated and IdP-initiated logout flows.

SAML service: HTTP endpoint registration for /saml/* routes. The service layer

  handles HTTP binding specifics (POST, Redirect) and delegates to module operations
  for assertion processing.

signin: User authentication flow. When an unauthenticated user hits the SSO

  endpoint, they are redirected through the sign-in flow. After successful
  authentication, the SSO continuation endpoint (/saml/continue) completes
  the SAML response generation.

cluster: Distributed cache for replay prevention (saml_requests) and session

  tracking. Writes are replicated to all nodes for consistency. Quorum required for reliable operation.

authentication.oidc: Shares the authentication infrastructure but operates

  independently. Users authenticated via OIDC can also have SAML sessions.
  No direct interaction between SAML and OIDC assertion/token flows.

firewall: Network-level access rules applied before SAML endpoints are reached.

  Firewall rules can restrict access to /saml/* paths by source IP or user.

waf: Web Application Firewall rules checked on incoming SAML requests before

  processing. Protects against XML-based attacks at the HTTP layer.

TOTP Authenticator

RFC 6238 time-based one-time password with QR enrollment, replay protection, and recovery codes

Overview

The TOTP module implements Time-based One-Time Password authentication per RFC 6238 for use with Google Authenticator, Authy, 1Password, and other TOTP-compatible authenticator apps.

Core capabilities:

Secret enrollment with QR code generation (otpauth:// URI)
Code validation with configurable time skew for clock drift tolerance
Replay protection via time step tracking (rejects step <= last used)
One-time recovery codes (SHA-256 hashed, consumed on use)
Enrollment confirmation requiring first code verification
Per-user enrollment status and secret deletion
Integration with signin MFA flow for second-factor authentication

Enrollment flow:

  1. Enroll: Generate 160-bit secret + QR code PNG (secret NOT persisted yet)
  2. ConfirmEnroll: User submits first code from authenticator to prove QR scan
     - Validates code, generates 10 recovery codes, persists secret to moduledata
     - Recovery codes returned in plaintext exactly once (user must save them)
  3. Subsequent authentications use Validate or ValidateRecoveryCode

Storage: TOTP secrets stored in the moduledata system (module name: “totp”) alongside other per-user credentials like passkeys (webauthn) and certificates (x509). Backend is Hexon KV (NATS JetStream) with automatic cluster replication.

RFC compliance:

RFC 6238 (TOTP) built on RFC 4226 (HOTP)
HMAC-SHA1 default (SHA256/SHA512 configurable but reduce app compatibility)
Dynamic truncation per RFC 4226 Section 5.4
Time step T = floor(unix_time / period) per RFC 6238 Section 4
Configurable skew window for clock drift tolerance

Config

Configuration under [authentication.totp]:

[authentication.totp]

  enabled = true                    # Enable TOTP module
  issuer = "HexonGateway"          # Shown in authenticator apps (otpauth URI)
  algorithm = "SHA1"               # HMAC algorithm: SHA1 (most compatible), SHA256, SHA512
  digits = 6                       # Code length: 6 (standard) or 8
  period = 30                      # Time step in seconds (30 is RFC default)
  skew = 1                         # Allow +/- N steps for clock drift (1 = 30s tolerance)
  recovery_codes = 10              # Number of one-time recovery codes generated
  recovery_code_length = 6         # Character length of each recovery code
  rate_limit_auth = "10/1m"        # Rate limit for validation attempts

Algorithm compatibility notes:

  SHA1: Works with all authenticator apps (Google, Authy, 1Password, etc.)
  SHA256: Limited app support (may not work with Google Authenticator)
  SHA512: Minimal app support (not recommended for broad deployments)

Period and skew interaction:

  With period=30 and skew=1, codes are valid for ~90 seconds (current + 1 past + 1 future).
  Increasing skew improves tolerance for clock drift but reduces security.
  Period changes require re-enrollment of all users.

Storage: Hexon KV (NATS JetStream) — no user password needed for writes.

All settings are cold (restart required to take effect on new enrollments). Existing enrollments retain their original algorithm, digits, and period.

Troubleshooting

Common symptoms and diagnostic steps:

User cannot enroll TOTP (enrollment fails):

  - Verify [authentication.totp] enabled = true
  - Check if user already has TOTP enrolled: 'totp status <username>'
  - If re-enrolling, delete first: admin must call Delete operation
  - Check telemetry logs for "Failed to generate TOTP secret" errors

QR code not scanning in authenticator app:

  - Verify issuer is set (some apps reject empty issuer)
  - Check algorithm compatibility: SHA1 works universally, SHA256/SHA512 may not
  - Ensure digits=6 and period=30 for maximum compatibility
  - Try manual entry using the Base32 secret string instead of QR

TOTP code rejected during authentication:

  - Clock drift: user device clock may be off by more than skew * period seconds
  - Replay protection: code was already used (step <= last_used_step)
  - Wrong authenticator entry: user may have multiple entries for same issuer
  - Check enrollment status: 'totp status <username>' to confirm enrollment exists
  - Verify algorithm matches: stored secret uses algorithm from enrollment time

Recovery code rejected:

  - Code already consumed (one-time use, removed from storage after validation)
  - No codes remaining: check RecoveryCodesRemaining in status response
  - Case sensitivity: codes are case-sensitive
  - Storage update failure: check logs for "Failed to consume recovery code"

Replay detection false positives:

  - Rapid successive code submissions: same 30-second window generates same code
  - Step update failed: if persisting the step counter fails, validation is rejected (fail-closed)
  - Check logs for "TOTP replay detected" with step and last_used_step values

TOTP Delete fails:

  - Cluster not ready: moduledata requires cluster connectivity
  - Delete is idempotent: returns Success=true even if no enrollment exists

Metrics for monitoring:

  - totp.enrollments_initiated: Enroll calls (QR generated)
  - totp.enrollments_confirmed: Successful ConfirmEnroll (secret persisted)
  - totp.enrollments_deleted: Successful Delete calls
  - totp.validations_total (result=valid|invalid|replay): Validate outcomes
  - totp.recovery_validations_total (result=valid|invalid|no_codes): Recovery code outcomes

Security

Security design and hardening:

Secret generation:

  160-bit random secrets (20 bytes) from crypto/rand, Base32-encoded.
  Provides 2^160 entropy — brute-forcing the secret is computationally infeasible.

Code validation:

  Constant-time comparison via crypto/subtle prevents timing attacks.
  Attacker cannot determine partial code correctness from response time.

Replay protection:

  Each successful validation records the time step (LastUsedStep).
  Subsequent codes at step <= LastUsedStep are rejected.
  Step update is synchronous (not fire-and-forget) to prevent race conditions.
  If step persistence fails, validation is rejected (fail-closed).
  This prevents concurrent requests from replaying the same code.

Recovery codes:

  Generated with crypto/rand, stored as SHA-256 hashes.
  Plaintext returned to user exactly once during enrollment confirmation.
  Each code is consumed (removed) after successful validation.
  Matching uses constant-time comparison for timing-attack resistance.
  Consumption is synchronous with fail-closed semantics.

Enrollment security:

  Two-phase enrollment: Enroll generates secret, ConfirmEnroll verifies first code.
  This proves the user successfully scanned the QR and their authenticator works.
  Re-enrollment blocked while existing enrollment exists (prevents overwrite race).

Clock drift tolerance:

  Configurable skew parameter allows +/- N time steps.
  Default skew=1 with period=30 accepts codes from 3 consecutive 30-second windows.
  Wider skew reduces security: skew=2 means a valid code window of 150 seconds.

Authentication flow integration:

  TOTP is a second factor only — never used as primary authentication.
  Requires prior successful primary authentication (password, certificate, etc.).
  MFA pending session must exist before TOTP validation is attempted.
  Failed TOTP does not reveal whether the user has TOTP enrolled.

Audit logging:

  All operations logged via telemetry with security context (username).
  Enrollment initiation, confirmation, validation (success/failure/replay),
  recovery code use, and deletion all generate structured log entries.
  Replay attempts logged at WARN level for security monitoring.

Relationships

Module dependencies and interactions:

signin: Primary consumer via MFA flow. When RequireMFA includes “passwd” and

  MFAMethods includes "totp", users with TOTP enrolled see the authenticator
  option on the MFA page. After primary auth creates "mfa_pending" session,
  user submits 6-digit code, signin calls totp.Validate, and on success
  the signin flow completes the login.

moduledata: Storage backend for TOTP secrets. Module name “totp” in moduledata

  stores the per-user secret, algorithm, digits, period, last used step, and
  recovery codes.

Directory: Provides user context and group membership.

  TOTP enrollment status can influence access policies.

sessions: MFA pending session must exist before TOTP validation.

  Successful TOTP validation triggers session upgrade to fully authenticated.

authentication.otp: Sibling MFA method. Users may see both TOTP and email OTP

  options on the MFA page. TOTP is preferred when enrolled (no email delivery delay).

config: Reads [authentication.totp] settings dynamically at runtime.

  Algorithm, digits, and period from enrollment time are stored with the secret,
  so config changes only affect new enrollments.

telemetry: Structured logging with security context for all operations.

  Metrics counters for enrollment, validation, and recovery code operations.

Admin CLI: TOTP management commands (list enrollments, check status, delete).

  Admin can delete TOTP enrollment for locked-out users.

WebAuthn Passkeys

FIDO2/WebAuthn passwordless authentication with passkey management and clone detection

Overview

The WebAuthn module implements FIDO2/WebAuthn Level 2 passwordless authentication, acting as a WebAuthn Relying Party (RP). It manages the full passkey lifecycle: registration, authentication, revocation, and expiration monitoring.

Key capabilities:

Multiple passkeys per user (laptop, phone, YubiKey, etc.)
Challenge-response registration and authentication ceremonies
Platform authenticators (Touch ID, Face ID, Windows Hello)
Cross-platform authenticators (YubiKey, other FIDO2 security keys)
Attestation statement validation (none, packed, fido-u2f formats)
Clone detection via signature counter monitoring
ECDSA P-256 (ES256) and RSA-2048 (RS256) public key cryptography
Passkey expiration scheduler with email reminders
Distributed passkey storage (replicated or shared filesystem)
Session creation after successful authentication
Optional device naming for passkey identification

Operations: registration ceremonies, authentication ceremonies, passkey management (revoke, get, list), observability metrics, and scheduled expiration reminders.

Storage architecture follows a layered approach:

LDAP is the single source of truth for passkey data
Multi-passkey format: supports multiple passkeys per user with revocation tracking
Legacy single-passkey format auto-detected and migrated on first write
Directory module syncs LDAP to memory cache (including passkey data)
WebAuthn reads passkey data from the directory cache
No separate passkey cache — eliminates synchronization issues
Temporary challenge sessions use in-memory storage with 5-10 minute TTL
Passkey records also persisted to distributed file storage

Config

Configuration under [authentication.webauthn]:

  name = "Hexon Identity"              # RP name shown to users during ceremony
  rpid = "login.example.com"           # Relying Party ID (must match origin domain)
  origin = "https://login.example.com" # Origin URL (must match browser origin exactly)
  skip_port_check = true               # Skip port in origin validation (default: true)
  type = "preferred"                   # Authenticator type: "platform", "cross-platform", "preferred"
  valid_days = 365                     # Passkey validity period in days
  rate_limit_register = "5/1h"         # Registration rate limit per user
  rate_limit_auth = "20/1m"            # Authentication rate limit per user

Expiration reminder settings:

  renewal_reminder_enabled = true      # Enable expiration reminder emails (default: true)
  renewal_reminder_interval = "24h"    # Check frequency (default: "24h")
  renewal_reminder_days = 15           # Days before expiry to start sending (default: 15)
  renewal_reminder_timeout = "5m"      # Operation timeout (default: "5m")
  renewal_reminder_retries = 3         # Max retry attempts (default: 3)
  renewal_reminder_retry_delay = "30s" # Delay between retries (default: "30s")

Hot-reload behavior:

  Hot-reloaded (immediate effect):
    - valid_days: applies to newly registered passkeys only
    - Scheduler settings: interval, timeout, retries, retry_delay (via IntervalFunc)

  Cold (require restart):
    - rpid, origin, type, skip_port_check: cached at init time
    - Reason: changing RP ID or origin mid-flight breaks existing passkey validation

Cluster storage modes:

  Replicated mode (filesystem.mode = "replicated"):
    - Passkeys broadcast to all nodes with quorum (>50% must confirm)
    - Automatic cross-node synchronization
  Shared mode (filesystem.mode = "shared"):
    - Passkeys on shared filesystem (NFS), no replication needed

Troubleshooting

Common symptoms and diagnostic steps:

Registration failures (“invalid attestation”):

  - RP ID mismatch: rpid must match the domain portion of origin
  - Origin mismatch: origin must exactly match the browser URL (scheme + host + port)
  - Port issues in containers: set skip_port_check=true for K8s/Docker deployments
  - Unsupported attestation format: only none, packed, fido-u2f are supported
  - Check config: 'config show authentication.webauthn'
  - Diagnose user: 'diagnose user <username>'

Authentication failures (“signature verification failed”):

  - Passkey expired: check valid_until in passkey record ('webauthn list <username>')
  - Wrong RP ID hash: rpid changed since passkey was registered (requires re-registration)
  - Corrupted public key: revoke and re-register the passkey
  - Check passkey details: 'webauthn list <username>'

Clone detection alerts (“counter did not increase”):

  - Possible cloned authenticator: investigate immediately (security event)
  - Counter validation only enforced when both stored and new counters are non-zero
  - Some authenticators do not support counters (always 0) -- this is normal
  - Counter wrapped around (rare, requires 2^32 uses)
  - Authenticator reset: requires re-registration after investigation
  - Check logs: 'logs search "clone" --module=webauthn'

Challenge expired or not found:

  - Challenge TTL is 5-10 minutes; user took too long to respond
  - Challenge already consumed (single-use; cannot retry with same challenge)
  - Memory storage broadcast delay in large clusters
  - Retry the ceremony from the beginning (BeginRegistration/BeginAuthentication)

Expiration reminders not being sent:

  - Verify scheduler is enabled: renewal_reminder_enabled = true
  - Check SMTP health: 'smtp health'
  - Verify user has email in directory: 'directory user <username>'
  - Disabled users are skipped (by design)
  - Check scheduler status: 'health components'
  - Only the cluster leader runs the check (leader-only scheduling)
  - Look for errors: 'logs search "expiration" --module=webauthn'

Passkey not found during authentication:

  - User has no passkey registered: 'webauthn list <username>'
  - Specific passkey was revoked: 'webauthn list <username>' shows revoked status
  - Credential ID mismatch: browser sending different credential than stored
  - Directory sync delay: passkey in LDAP but not yet in memory cache
  - Trigger sync: 'directory sync <username>'
  - Legacy format issue: check if user's moduledata has old flat format vs new array

502/503 during WebAuthn ceremony:

  - Filestorage unavailable: check filesystem health
  - Quorum not reached in replicated mode: check cluster status ('cluster status')
  - Memory storage broadcast failure: check cluster connectivity ('ping')

Metrics not updating:

  - Check metrics endpoint: 'webauthn metrics'
  - Verify telemetry module is healthy: 'health components'

Security

Critical security requirements:

Challenge-Response Protocol:

  - 32-byte cryptographic random challenges (crypto/rand)
  - Single-use: challenge deleted immediately after validation
  - TTL: 5-10 minutes, expired challenges rejected
  - Prevents replay attacks entirely

Clone Detection (Signature Counter):

  - Authenticator maintains incrementing signature counter
  - On each authentication: new counter must exceed stored counter
  - If new <= stored (both non-zero): REJECT -- possible cloned authenticator
  - Counter=0 authenticators exempt (per WebAuthn specification)
  - Counter updates NOT persisted to LDAP (avoids write on every auth)
  - Detection works by comparing against registration-time stored value

Attestation Validation:

  - Performed during registration for all supported formats
  - Current mode: permissive (registration succeeds even if validation fails)
  - Validation results logged for security auditing
  - For stricter enforcement: modify FinishRegistration to reject failures
  - Future: FIDO Metadata Service (MDS) for authenticator trust verification

Origin and RP ID Validation:

  - Origin must be HTTPS (WebAuthn specification requirement)
  - RP ID must match the domain in the origin URL
  - Browser enforces same-origin policy on credentials
  - skip_port_check=true relaxes port matching only (not scheme or domain)

Public Key Cryptography:

  - Keys stored in COSE format (RFC 8152)
  - ES256 (ECDSA P-256): primary algorithm
  - RS256 (RSA-2048): secondary algorithm
  - Private keys never leave the authenticator hardware
  - Public keys stored base64-encoded in LDAP ModuleData

Rate Limiting:

  - Registration: configurable per-user limit (default 5/1h)
  - Authentication: configurable per-user limit (default 20/1m)
  - Prevents brute-force and denial-of-service attacks

Operational security recommendations:

  - Monitor clone detection alerts as critical security events
  - Set appropriate valid_days for your security policy
  - Implement passkey rotation procedures
  - Revoke passkeys immediately on device loss or compromise
  - Enable expiration reminders to prevent credential lapses
  - Audit all authentication events via telemetry logs
  - Consider enabling stricter attestation for high-security deployments

Relationships

Module dependencies and interactions:

directory: Primary passkey data source. WebAuthn reads passkeys from the

  directory's in-memory cache (synced from LDAP). Also provides
  user listing for expiration checks.
  User's FullName used for personalized reminder emails.

LDAP: Ultimate source of truth for passkey storage. Passkeys stored in

  the module data LDAP attribute. The calling layer is responsible for
  writing passkey data to LDAP after registration.

filestorage: Distributed credential storage with active/ and revoked/

  directories. Supports replicated mode (quorum broadcast) and shared mode
  (NFS). Used for passkey record persistence alongside LDAP.

sessions: Creates authenticated sessions after successful WebAuthn

  authentication. Session module and TTL configurable per-authentication
  request (e.g., "sshproxy" module, 8h TTL).

storage.memory: Temporary challenge session storage with broadcast to all

  cluster nodes. TTL-based expiration (5-10 minutes). Challenges stored under
  cache type "webauthn_sessions".

smtp: Sends passkey expiration reminder emails via SMTP module.

  ACL enforced -- only the webauthn module is authorized to call this operation.
  Passkey expiration reminder emails sent via SMTP module.

telemetry: Security audit logging at multiple levels. LevelError for clone

  detection and signature failures. LevelWarn for expired passkeys and invalid
  challenges. LevelInfo for successful operations.

scheduler: Expiration check runs as a leader-only scheduled task

  (distributed lock for safety). Configurable interval, timeout,
  retries, and retry delay.

config: Hot-reloadable configuration via the configuration system. Some fields cached

  at init (rpid, origin, type) to prevent mid-flight breakage.

External dependency:

CBOR decoding for attestation objects and COSE key parsing (RFC 8152).

X.509 Client Certificate Authentication

Client certificate authentication with external PKI validation, internal CA enrollment, CRL/OCSP revocation checking, and auto-renewal

Overview

X.509 client certificate authentication provides two distinct modes of operation:

External PKI Validation:

  Validates client certificates from external PKI infrastructure (FreeIPA, Active Directory,
  LDAP) presented during the TLS handshake. Hexon performs validation only -- certificate
  lifecycle (issuance, renewal, revocation) is managed by the external PKI.

Internal CA Enrollment:

  Issues and manages client certificates using Hexon's internal ACME CA. Users self-enroll
  at /signup/x509 after authenticating. Supports auto-renewal, self-revocation, and
  multi-certificate overlap (max 2 active per user during renewal windows).

Validation is performed as an ordered, defense-in-depth pipeline:

  1. Certificate expiration check (NotBefore/NotAfter)
  2. TLS handshake validation against ClientCAs pool (chain, signature, trust)
  3. Application-level chain validation (full chain verify with client auth usage check)
  4. CRL check -- O(1) in-memory lookup with atomic map swap (if enabled)
  5. Identity extraction from certificate subject (cn, uid, email, or upn)
  6. Directory lookup (user exists and is active)
  7. OCSP check with cluster-cached responses and configurable soft-fail (if enabled)
  8. Session creation with username, email, groups, and certificate metadata

All validation operations are cluster-wide, ensuring consistent behavior regardless of which node handles the authentication request.

Typical authentication latency:

  - Cached path (CRL + cached OCSP): 20-30ms total
  - Uncached path (first OCSP query): 70ms-5s depending on ocsp_timeout
  - CRL lookup: less than 1ms (in-memory hash map)
  - OCSP cached lookup: less than 1ms (cluster memory)

Memory footprint:

  - CRL map: ~100 bytes per revoked certificate (10K certs = ~1MB)
  - OCSP cache: ~200 bytes per response (1K users = ~200KB)

Config

Core configuration under [authentication.x509]:

[authentication.x509]

  enabled = true                     # Enable X.509 authentication
  ca_pem = """..."""                 # CA certificate(s) in PEM format (root + intermediates)

CRL (Certificate Revocation List):

  crl_enabled = true                 # Enable CRL-based revocation checking
  crl_url = "http://ca.example.com/ca.crl"  # CRL distribution point URL
  crl_refresh = "1h"                # CRL refresh interval (default: 1h)
  crl_timeout = "30s"               # HTTP download timeout (default: 30s)
  crl_max_size = 0                  # Max CRL size in bytes (0 = unlimited)

OCSP (Online Certificate Status Protocol):

  ocsp_enabled = true               # Enable OCSP revocation checking
  ocsp_url = "http://ocsp.example.com"  # OCSP responder URL
  ocsp_cache = "15m"                # Cache duration for OCSP responses (default: 15m)
  ocsp_timeout = "5s"               # HTTP timeout for OCSP queries (default: 5s)
  ocsp_soft_fail = true             # Allow auth if OCSP is unreachable (default: true)

IMPORTANT: OCSP timeout is independent of operations.wait_timeout. X.509 validation uses a dynamic timeout of ocsp_timeout + 5s buffer. This ensures OCSP queries complete with their full configured timeout regardless of the global wait_timeout.

Identity Mapping:

  [identity.cert_subject_map]
  username = "cn"                   # Certificate field for username extraction
                                     # Options: "cn" (CommonName), "uid" (LDAP UID OID),
                                     # "email" (email address), "upn" (AD User Principal Name)

Internal CA Enrollment:

  enroll_enabled = true              # Enable self-service certificate enrollment
  enroll_validity_days = 365         # Certificate validity period (default: 365)
  enroll_algorithm = "ECDSA-P256"   # Key algorithm: "ECDSA-P256" or "RSA-2048"
  enroll_max_active_certs = 10       # Max active certificates per user (1-50, default: 10)
  enroll_rate_limit = "3/1h"         # Enrollment rate limit per user (default: "3/1h")
  revoke_rate_limit = "5/1h"         # Revocation rate limit per user (default: "5/1h")
  enroll_p12_min_entropy = 60        # Min entropy bits for PKCS#12 password (default: 60)

Auto-Renewal:

  enroll_auto_renew = true           # Enable automatic renewal before expiry (default: true)
  enroll_auto_renew_days = 15        # Days before expiry to trigger renewal (default: 15)
  enroll_auto_renew_interval = "24h" # Check interval for expiring certs (default: "24h")
  enroll_auto_renew_timeout = "5m"   # Scheduler operation timeout (default: "5m")
  enroll_auto_renew_retries = 3      # Max retry attempts on failure (default: 3)
  enroll_auto_renew_retry_delay = "30s"  # Delay between retries (default: "30s")

PKI-Specific Identity Mapping:

  FreeIPA:          username = "uid"   (FreeIPA uses UID, not CN)
  Active Directory: username = "upn"   (AD uses User Principal Name)
  Generic LDAP:     username = "cn"    (CommonName is default)

Hot-reloadable: ca_pem, CRL settings, OCSP settings, identity mapping, enrollment settings. Cold (restart required): enabled.

Troubleshooting

Common error messages and diagnostic steps:

“certificate revoked (CRL)”:

  - Certificate serial number found in the downloaded CRL
  - Verify revocation status with external CA tools
  - User must obtain a new certificate from the PKI
  - Check CRL freshness: 'certs x509 metrics' for last refresh time

“user not found in directory”:

  - Identity field extracted from certificate does not match any directory user
  - Check cert_subject_map.username setting matches your PKI convention
  - Use 'directory user <username>' to verify user exists in directory
  - Use 'diagnose user <username>' for cross-subsystem check
  - Verify directory sync is current: 'directory status'

“failed to extract identity”:

  - The configured subject field (cn/uid/email/upn) is missing from the certificate
  - Inspect certificate subject with: openssl x509 -in cert.pem -noout -subject
  - Change cert_subject_map.username to a field present in the certificate

“OCSP query failed (soft-fail)”:

  - OCSP responder is unreachable but authentication proceeds (warning only)
  - Soft-fail is the default behavior (ocsp_soft_fail = true)
  - Check OCSP URL: 'net http <ocsp_url>'
  - Verify OCSP responder is operational
  - If hard-fail is required, set ocsp_soft_fail = false

“OCSP query failed (hard-fail)”:

  - OCSP responder is unreachable and ocsp_soft_fail = false
  - Authentication is blocked until OCSP responder recovers
  - Consider enabling soft-fail if OCSP outages are frequent
  - Check connectivity: 'net tcp <ocsp_host:port>'

“failed to download CRL”:

  - CRL URL is unreachable or returned an error
  - Check URL: 'net http <crl_url>'
  - Existing in-memory CRL continues to be used until refresh succeeds
  - Check for size limits: crl_max_size may be rejecting a large CRL

“certificate validation timeout”:

  - OCSP query or validation step exceeded the dynamic timeout
  - X.509 uses a dynamic timeout of ocsp_timeout + 5s, NOT operations.wait_timeout
  - Increase ocsp_timeout if OCSP responder is slow
  - Check OCSP responder latency: 'net latency <ocsp_host:port>'

“certificate expired or not yet valid”:

  - Certificate NotBefore/NotAfter check failed
  - Check certificate dates: openssl x509 -in cert.pem -noout -dates
  - Verify system clock is correct (NTP drift can cause false failures)

Session extension rejected (“x509_revocation”):

  - Certificate was revoked after the initial session was created
  - Internal CA: serial checked against the revocation index
  - External CA: OCSP check performed using stored certificate data from session
  - User must obtain a new certificate and re-authenticate

Enrollment failures:

  - "rate limit exceeded": user hit enroll_rate_limit, wait for window to reset
  - "PKCS#12 password too weak": password entropy below enroll_p12_min_entropy
  - "enrollment not enabled": set enroll_enabled = true in config
  - Check enrollment metrics: 'certs x509 metrics'

Auto-renewal not working:

  - User has no email in directory (skipped with warning)
  - User opted out via /signup/x509 status page (auto-renewal opt-out)
  - Certificate missing stored certificate data (older certificates)
  - enroll_auto_renew = false in config
  - Cluster lock contention: only one node processes renewals at a time
  - Check: 'certs x509 list' for certificate status per user

Browser not prompting for certificate:

  - Firefox: Settings > Privacy & Security > Certificates > View Certificates > Import
  - Chrome: Settings > Privacy and Security > Security > Manage Certificates > Import
  - Certificate must include ExtKeyUsageClientAuth
  - CA certificate must be in browser trust store
  - Verify TLS listener has ClientCAs configured (check logs for "x509 CA loaded")

Security

Defense-in-Depth Validation Pipeline:

Six independent validation layers ensure no single check failure compromises security:

  1. Certificate expiration (NotBefore/NotAfter checked first, fail-fast)
  2. TLS handshake with ClientCAs pool (chain, signature, trust anchor verification)
  3. Application-level chain validation (full chain verify with client auth usage check)
  4. CRL revocation check -- O(1) in-memory, race-condition safe (if enabled)
  5. Directory lookup confirms user exists and is active
  6. OCSP real-time revocation check with cluster caching (if enabled)

Identity is extracted ONLY after successful validation. Unvalidated certificate fields are never trusted.

TOCTOU Protection for CRL:

  CRL updates use atomic.Value to prevent Time-of-Check-Time-of-Use race conditions.
  The entire revoked serial map is built from the new CRL, then atomically swapped.
  Readers always see a consistent snapshot. No locks required for O(1) lookups.

Memory Exhaustion Protection:

  - CRL downloads have configurable timeout (crl_timeout, default 30s)
  - CRL size capped by crl_max_size (prevents DoS via malicious CRL files)
  - OCSP responses cached with TTL to limit memory growth

Configurable Soft-Fail OCSP:

  When ocsp_soft_fail = true (default), OCSP infrastructure failures allow authentication
  to proceed. The certificate is already validated by expiration + TLS handshake + chain
  validation + CRL + directory lookup before OCSP is checked.
  IMPORTANT: Revoked certificates ALWAYS block authentication regardless of soft-fail mode.
  Only infrastructure failures (unreachable, timeout) are affected by the soft-fail setting.

Session TTL Capping:

  X.509 sessions are automatically capped to the certificate validity period.
  Session TTL = min(configured_TTL, cert_not_after - now). This prevents sessions from
  outliving their authenticating certificate. Applied at both signin (caller-side) and
  sessions module (defense-in-depth). Example: if certificate expires in 12h but config
  TTL is 24h, session TTL is capped to 12h.

Session Extension Revocation Check:

  When an X.509 session is extended, revocation is re-checked automatically:
  - Internal CA: serial checked against the revocation index
  - External CA: OCSP cache checked, full OCSP query if certificate data is available
  - Revoked certificates always block extension; soft-fail allows extension if OCSP is down

Internal CA Enrollment Security:

  - PKCS#12 bundles encrypted with Modern2023 profile (AES-256-CBC, SHA-256 HMAC)
  - Minimum password entropy enforced (enroll_p12_min_entropy, default 60 bits)
  - Rate limiting on enrollment and revocation endpoints (per-user)
  - Re-enrollment auto-revokes ALL existing certificates (fresh start with new key)
  - Auto-renewal preserves existing public key (only re-signs with new validity)
  - Maximum 2 active certificates per user (oldest auto-revoked when limit exceeded)
  - Revocation reason codes follow RFC 5280

Logging Security:

  Certificate serial numbers are logged only at DEBUG level. INFO logs contain username
  only, preventing information disclosure in production log aggregation systems.

Cluster Caching:

  OCSP responses are replicated to all nodes asynchronously. Eventual consistency
  is acceptable for cache data. Cache TTL is controlled by ocsp_cache config
  (default 15m).

Relationships

Module dependencies and interactions:

directory: User lookup during validation step 6. Confirms user exists and is

  active, returns email, full name, and group memberships. Also provides email
  addresses for auto-renewal notifications.

sessions: Session creation after successful validation. Session TTL capped to

  certificate validity. Revocation is re-checked when sessions are extended.
  Session metadata stores certificate data for external CA OCSP re-checks.

acme: Internal CA certificate signing for enrollment. Certificate revocation triggers

  CRL rebuild. Updated CRL is replicated to all nodes immediately.

identity: cert_subject_map configuration determines which certificate field maps to

  username (cn, uid, email, upn). Shared config section [identity.cert_subject_map].

signin: The /signin/x509 route triggers X.509 authentication flow. Validates

  the certificate and creates a session on success.

proxy: Per-mapping mTLS support (mtls=true) uses X.509 for mutual TLS at the

  route level. Certificate validated against ACME CA bundle or external PKI.

cluster: OCSP responses cached in distributed memory and replicated to all nodes.

  Auto-renewal uses a distributed lock to prevent duplicate processing across
  cluster nodes.

smtp: Auto-renewal sends renewed certificate bundles to users via email. Users

  without email addresses in directory are skipped with a warning.

moduledata: Certificate records stored per-user in the directory backend.

  Each user can have up to 2 active certificates (during renewal overlap),
  plus a revocation history and an auto-renewal opt-out flag.

Onboarding Service

Self-service user onboarding with magic link verification and passkey enrollment

Overview

The onboarding service provides a streamlined SPA flow for new users to verify their email and enroll a passkey. It combines the magic link passwordless flow with WebAuthn passkey registration into a single guided experience.

The service is a single GET endpoint at /onboarding that renders different steps based on the user’s authentication state. All actual operations (magic link, passkey enrollment) are delegated to existing API endpoints — no new backend APIs are needed.

Onboarding flow (4 steps):

  Step 0: Email entry — user submits email address
  Step 1: Magic link polling — browser polls for authorization while user clicks link in email
  Step 2: Passkey enrollment — WebAuthn ceremony to register a biometric/hardware key
  Step 3: Success — animated confirmation, auto-redirect to /profile

Three handler states:

  1. No session — render email step (unauthenticated users start here)
  2. Authenticated session + no passkey — create mfa_pending session, render passkey step
  3. Authenticated session + has passkey — redirect to /profile (already onboarded)

The service is gated by the portal being enabled (portal = true). When portal is disabled, the /onboarding route is not registered.

Config

The onboarding service has no dedicated configuration section. It relies on:

  [service]
    portal = true                    # Must be enabled for onboarding route registration
    session_mfa_pending = "5m"       # TTL for the mfa_pending session during passkey enrollment
    cookie_name = "hexon"            # Session cookie name (for detecting authenticated users)
    cookie_domain = ""               # Cookie domain for cross-subdomain support

  [service.signin.magiclink]         # Magic link settings used by /api/signin/magiclink
    enabled = true
    code_ttl = "10m"
    rate_limit = "5/1m"

  [protection]
    pow = true                       # PoW protection applied automatically (no DisablePoW on route)

The onboarding page inherits PoW protection from the global middleware. Authenticated users skip PoW automatically (valid session cookies are detected).

Endpoints

UI endpoint:

  GET /onboarding                    Onboarding SPA page (all steps rendered client-side)

The SPA calls existing API endpoints via fetch():

  POST /api/signin/magiclink         Send magic link email (existing signin service)
  POST /api/signin/magiclink/poll    Poll for magic link authorization (existing signin service)
  POST /api/signup/passkey/begin     Begin WebAuthn registration ceremony (existing signup service)
  POST /api/signup/passkey/finish    Complete WebAuthn registration (existing signup service)

On magic link authorization, the poll handler (in signin service) creates an authenticated “user” session via session creation. The onboarding JS then reloads the page, and the handler detects the session, creates an mfa_pending session for passkey enrollment, and renders the passkey step.

Session flow:

  1. Poll authorized → session creation creates "user" session + sets hexon cookie
  2. Page reload → handler reads hexon cookie → validates user session
  3. No passkey found → creates mfa_pending session + sets mfa_session_id cookie
  4. Passkey begin/finish use mfa_session_id cookie for authorization
  5. On passkey success → JS redirects to /profile

Troubleshooting

Common issues and diagnostic steps:

Onboarding page shows email step despite being logged in:

  - Verify session exists: 'sessions list --user=<username>'
  - Check session type is "user" with auth_status "authenticated"
  - Check cookie: session cookie name must match config (default: hexon)
  - PoW interference: if PoW cookie expired, user may be redirected to challenge first

Passkey step not appearing after magic link click:

  - Check magic link poll response: should return status "authorized"
  - Verify session created by session creation: 'sessions list --user=<username>'
  - JS reloads page after authorized — check for network/redirect issues
  - Server log should show "Onboarding: authenticated user entering passkey enrollment"

Passkey registration failing:

  - Check mfa_session_id cookie exists and session is valid
  - Session TTL: mfa_pending session defaults to 5 minutes (session_mfa_pending config)
  - WebAuthn RP ID must match hostname
  - Browser must support PublicKeyCredential API (HTTPS required)
  - Server logs: look for "Begin registration request" and "FinishRegistration failed"

PoW challenge blocking onboarding:

  - Normal behavior for first-time visitors without PoW session cookie
  - Authenticated users skip PoW (middleware checks application session)
  - PoW session TTL: default 30 minutes (pow_session_ttl config)

Page redirect loop or landing on / after magic link:

  - return_url must be HMAC-sealed (handler passes sealed URL to template data)
  - Unsealed URLs fall back to "/"
  - Check that sealed_return_url is present in onboarding-data JSON

Session proliferation on page refresh:

  - Handler reuses existing valid mfa_pending session (checks mfa_session_id cookie first)
  - If mfa_session_id expired, a new session is created on refresh (normal behavior)
  - Old expired sessions are cleaned up by session TTL

Relationships

Module dependencies and interactions:

signin (magiclink): Provides the magic link email flow. POST /api/signin/magiclink

  initiates the flow, POST /api/signin/magiclink/poll checks status. The poll handler
  calls session creation which creates the "user" session that onboarding detects.

signup (passkey): Provides WebAuthn enrollment. POST /api/signup/passkey/begin and

  /finish handle the ceremony. Both require a valid mfa_session_id cookie pointing to
  an mfa_pending session with signup_flow="passkey".

sessions: Used for session detection (Validate) and mfa_pending session creation (Create).

  The handler checks the main session cookie for authenticated users, and creates a
  separate mfa_session_id cookie for the passkey enrollment session.

webauthn: Used to check if user already has a passkey. Users with an

  existing passkey are redirected to /profile immediately.

render: Template rendering. Uses the onboarding manifest entry for CSS/JS asset bundling.
locale: i18n translations via template {{t “onb.*”}} function. All UI text comes from

  locale TOML files ([onb] section in 10 language files).

protection (PoW): Global PoW middleware protects the route —

  unauthenticated users solve PoW challenge before seeing the page.

portal: Onboarding route registration is gated by IsPortalEnabled(). Both services

  share the same user-facing domain.

Authentication coordinator with multi-method sign-in, pluggable MFA, magic links, and session management

Overview

The signin service is the central authentication coordinator for Hexon. It orchestrates the complete user sign-in lifecycle across multiple authentication methods and modules, handling primary authentication, multi-factor verification, magic link passwordless flows, and session creation.

Supported primary methods:

  - passwd: LDAP password authentication (bind-based, no local password storage)
  - passkey: WebAuthn/FIDO2 passwordless (hardware keys, biometrics, phishing-resistant)
  - x509: Client certificate authentication (Subject DN to username mapping)
  - oidc: OpenID Connect single sign-on via external identity provider
  - magiclink: Email-based passwordless authentication (BASE-20 tokens, RFC 8628 polling)

Supported MFA methods (pluggable):

  - otp: Email-delivered verification code (via emailotp module)
  - totp: Time-based One-Time Password / authenticator apps (RFC 6238)

Authentication flow stages:

  1. Primary authentication — credential verification against backend (LDAP/WebAuthn/X.509)
  2. MFA challenge (if required) — pre-auth session created, MFA code verified
  3. Session creation — quorum-replicated across cluster, cookie set
  4. Directory sync — fire-and-forget background user data refresh
  5. Redirect — user sent to original destination (return_url)

Magic link flow (cross-device passwordless):

  1. User submits email on /signin/magiclink
  2. Device code created (RFC 8628), BASE-20 token generated (rejection sampling, no modulo bias)
  3. Token-to-device-code mapping stored as SHA-256 hashes (tokens never in cleartext)
  4. Magic link email sent via SMTP (fire-and-forget, anti-enumeration)
  5. Browser polls /api/signin/magiclink/poll every 5 seconds
  6. User clicks link on any device, token validated, device code marked authorized
  7. Next poll detects authorization, session created on polling browser only

Session security:

  - Session rotation after MFA (new ID prevents session fixation attacks)
  - MFA pending sessions are short-lived (default 5 minutes) and revoked after upgrade
  - Sessions bound to IP address and TLS fingerprint
  - Configurable max concurrent sessions per user (default: 1)
  - Cluster-wide session storage with quorum replication (available on all nodes)

Config

Configuration under [service.signin] in TOML:

[service.signin]

  primary = "passkey"              # Default authentication method shown at /signin
                                   # Options: "passwd", "passkey", "x509", "oidc", "magiclink"
  secondary = ["passwd", "x509"]   # Alternative methods (shown as links on sign-in page)
  require_mfa = ["passwd"]         # Methods that require MFA after primary auth
                                   # Empty list = MFA never required
  mfa_methods = ["otp", "totp"]    # Available MFA methods presented to user
                                   # Order determines default selection

[service.signin.magiclink]

  enabled = true                   # Enable magic link passwordless sign-in
  code_length = 10                 # Token length in BASE-20 characters (range: 6-40, default: 10)
  code_ttl = "10m"                 # Link validity duration (default: 10 minutes)
  rate_limit = "5/1m"             # Per-IP rate limit on magic link requests
  rate_limit_email = "3/10m"      # Per-email rate limit (anti-flooding protection)

Session configuration (under [service.signin] or related session config):

  session_ttl                      # Authenticated session lifetime
  session_password_expired         # Session TTL for expired password flow
  session_mfa_pending              # Pre-auth session TTL (default: 5 minutes)
  max_concurrent_sessions = 1      # Max active sessions per user (default: 1)

Password policy (enforced during passwd authentication):

  - Strength validation via zxcvbn algorithm (configurable score 0-4)
  - Character requirements: uppercase, lowercase, digits, special characters
  - Minimum length and entropy requirements (all configurable via TOML)
  - Password expiry enforcement with dedicated session type

MFA settings:

  max_retries = 5                  # Maximum MFA verification attempts before lockout

Hot-reloadable: primary method, secondary methods, require_mfa list, mfa_methods, magiclink settings, session TTLs, password policy, rate limits. Cold (restart required): service.signin.enabled.

Endpoints

UI endpoints (serve HTML pages):

  GET  /signin                     Redirect to primary authentication method
  GET  /signin/passwd              LDAP password sign-in page
  GET  /signin/passkey             WebAuthn passkey sign-in page
  GET  /signin/x509                X.509 certificate sign-in page
  GET  /signin/magiclink           Magic link email form
  GET  /signin/magiclink/verify    Magic link verification (clicked from email)
  GET  /signin/mfa                 MFA verification page (OTP or TOTP)

API endpoints (JSON/form):

  POST /api/signin                 Authenticate with credentials
                                   Body: {"method", "username", "password", "remember_me"}
                                   Returns: success with session_token, or requires_mfa with
                                   pre-auth session and available mfa_methods

  POST /api/signin/magiclink       Submit magic link request
                                   Body: email, return_url, auth_flow (form-encoded)
                                   Returns: device_code and expires_in for polling
                                   Rate limited: per-IP (5/1m) and per-email (3/10m)

  POST /api/signin/magiclink/poll  Poll magic link authorization status
                                   Body: device_code (form-encoded)
                                   Returns: {"status":"pending"} or {"status":"authorized","redirect":"..."}

  POST /api/signin/mfa             Verify MFA code
                                   Body: {"method", "code", "session_id" (HMAC-sealed), "trust_device"}
                                   Returns: success with redirect (session_id not exposed in response)

  POST /api/signin/mfa/resend      Resend OTP code (email OTP only)

X.509 over HTTP/3 note: QUIC does not support TLS renegotiation. If a user attempts X.509 auth over HTTP/3 without a client certificate, the server responds with Alt-Svc: clear and a 307 redirect to force retry over HTTP/2, which properly prompts for client certificate selection.

Troubleshooting

Common symptoms and diagnostic steps:

Authentication failures (generic “Invalid username or password”):

  - LDAP backend unreachable: 'auth ldap' to check connection health
  - Account locked in LDAP (nsAccountLock attribute): 'directory user <username>'
  - User not found in directory: 'directory user <username>' to verify existence
  - Incorrect bind DN or password: check LDAP module configuration
  - Start with: 'diagnose user <username>' for cross-subsystem check

MFA verification failing:

  - TOTP clock drift: user device time must be within 30-second window
  - OTP expired: default validity window is short, check 'auth otp'
  - Email OTP not delivered: 'smtp health' to verify SMTP service
  - Rate limited (429): max_retries exceeded, check 'metrics ratelimit'
  - Session expired: MFA pending session has 5-minute TTL by default
  - Check MFA session: 'sessions list --user=<username>' for pre-auth sessions

Magic link issues:

  - Email not received: 'smtp health' and 'notify health' to verify delivery path
  - Anti-enumeration: same response whether email exists or not (by design)
  - Token expired: default code_ttl is 10 minutes, check timing
  - Rate limited: per-IP (5/1m) or per-email (3/10m), check 'metrics ratelimit'
  - Poll returns "pending" indefinitely: verify SMTP delivery, check device code
    status via 'auth devicecodes'
  - "Link already used" error: tokens are single-use, mapping deleted after verify

Session creation failures:

  - Cluster quorum not met: 'cluster status' to verify quorum health
  - Session replication timeout: check cluster health for latency
  - Max concurrent sessions reached: 'sessions list --user=<username>'
  - Cookie not set: verify service hostname matches cookie domain
  - Session bound to wrong IP: check proxy/load balancer X-Forwarded-For headers

WebAuthn/passkey errors:

  - No passkey registered: 'webauthn list <username>' to check enrollments
  - Browser not supporting WebAuthn: requires HTTPS and a supported browser
  - Relying party ID mismatch: hostname must match RP ID in WebAuthn config
  - Challenge expired: WebAuthn challenges are cached temporarily

X.509 certificate sign-in issues:

  - Certificate not requested by browser: check TLS configuration
  - HTTP/3 fallback: Alt-Svc: clear redirect expected for QUIC connections
  - Certificate chain validation failure: check CA bundle configuration
  - Subject DN mapping: verify DN-to-username mapping rules
  - Check: 'certs x509 list' for registered client certificates

Password policy rejections:

  - zxcvbn score too low: user password not meeting strength requirements
  - Missing character classes: check uppercase/lowercase/digit/special requirements
  - Password expired: user gets dedicated session type, must change password
  - Check policy: 'config show service.signin' for password policy settings

Redirect loops after sign-in:

  - return_url invalid or pointing to sign-in page itself
  - Session cookie domain mismatch: verify service.hostname configuration
  - OIDC callback failure: check oidc_providers configuration
  - Check: 'sessions list --user=<username>' and 'auth status'

Relationships

Module dependencies and interactions:

authentication.ldap: Primary backend for passwd method. LDAP bind authentication

  with connection pooling. Reports account lock status (nsAccountLock). Password
  policy enforcement (strength, expiry, character requirements).

authentication.webauthn: Primary backend for passkey method. WebAuthn/FIDO2

  credential storage and verification. Hardware key and biometric support.

authentication.x509: Primary backend for X.509 certificate method. Certificate

  chain validation, Subject DN to username mapping, revocation checking.

authentication.oidc: Backend for OIDC single sign-on method. Redirects to

  external identity provider for authentication.

authentication.magiclink: Magic link token generation, email composition.

  Uses BASE-20 encoding with rejection sampling for unbiased token generation.

authentication.devicecode: RFC 8628 device code flow. Provides polling

  infrastructure and expiration for magic link authorization tracking.

authentication.otp: Email OTP generation and verification for MFA. Delivers

  codes via emailotp module with device fingerprinting.

authentication.totp: TOTP verification for MFA. Validates RFC 6238 codes

  from authenticator apps (Google Authenticator, Authy, etc.).

sessions: Cluster-wide session management with quorum replication.

  Creates authenticated sessions, MFA pending sessions, and password-expired
  sessions. Session rotation after MFA completion.

directory: User data synchronization after authentication (fire-and-forget).

  Provides user lookup by email (magic link), group membership, account status.
  Fresh data sync ensures up-to-date authorization after sign-in.

smtp: Email delivery for magic link messages and OTP codes. Fire-and-forget

  delivery ensures consistent response timing (anti-enumeration).

signout: Companion service for session termination and logout flows.
onboarding: Uses magic link flow for email verification, then transitions to

  passkey enrollment. The onboarding SPA calls /api/signin/magiclink and
  /api/signin/magiclink/poll directly via fetch(). After authorization, the
  poll handler creates a "user" session which onboarding detects on page reload.

passwordchange: Handles password change flows when password-expired session

  is active. Redirects back to sign-in after successful change.

firewall: Network-level access rules applied before sign-in endpoints.
protection: Rate limiting (fingerprint-based) on all sign-in endpoints.

  Prevents brute force attacks on credentials and MFA codes.