Reverse Proxy

Load Balancer

Multi-algorithm load balancing with health checks, circuit breakers, and outlier detection

Overview

Cluster-coordinated load balancer module providing enterprise-grade backend pool management. Combines features that typically require multiple tools (Envoy for advanced LB, Consul for health checks, custom code for circuit breaking) into a single module with TOML configuration.

Key differentiators:

Cluster-native state: all nodes share circuit breaker, health, and connection state

  via memory module. One node's health check benefits all nodes. Circuit trips propagate
  instantly across the cluster.

HTTP/3 backend support: full QUIC support for backend connections with HTTP/3 health

  checks. Per-protocol circuit breaking allows HTTP/3 to fail independently while
  HTTP/2 continues working (e.g., QUIC blocked by firewall).

Expression-based circuit breaker trip conditions: custom logic like

  "error_rate > 0.1 && latency_p99 > 2.0" instead of rigid thresholds.

JA4/JA4H fingerprint routing: session affinity based on TLS fingerprint, no cookies

  required. Works for any protocol, including non-HTTP.

7 health check types: TCP, HTTP, HTTP/3, gRPC, MySQL, PostgreSQL, Redis with native

  protocol checks (not just TCP port probes).

7 load balancing algorithms: adaptive (epsilon-greedy, default), round_robin,

  weighted (EDF scheduling), least_conn (Power of Two Choices), hash (xxhash),
  maglev (Google consistent hashing), random.

Outlier detection with three independent detection mechanisms: consecutive failures,

  success rate analysis (statistical), and failure percentage (threshold-based).

DNS-based service discovery with automatic backend add/remove on DNS changes.
Token bucket rate limiting with per-pool or per-user limits, distributed across cluster.

All state is shared cluster-wide. Reads are served from local in-memory cache for near-instant response. Writes are replicated to all nodes with eventual consistency.

Config

Configuration is primarily done through [[proxy.mapping]] sub-tables. The load balancer does not have its own top-level TOML section; pools are created and managed automatically by the proxy service when mappings have multiple backends.

[[proxy.mapping]]

  service = ["http://be1:8080", "http://be2:8080"]  # Array triggers LB pool creation
  lb_strategy = "adaptive"         # Algorithm: adaptive, round_robin, weighted, least_connections,
                                   #   hash, maglev, random (default: adaptive)
  lb_weights = [5, 3, 2]           # Backend weights for "weighted" strategy
  lb_hash_key = "cookie:session_id"  # Hash key for "hash"/"maglev" strategies
                                   # Options: cookie:<name>, ja4, ja4h, ip, header:<name>
  enable_http3 = false             # Enable HTTP/3 (QUIC) backend connections
  protocol_preference = "prefer_http3"  # Protocol preference: prefer_http3, prefer_http2
  dns_discovery = false            # Enable DNS-based service discovery
  dns_refresh = "30s"              # DNS refresh interval

Default health check (active by default even without this section).

Applies to ALL mappings that don’t have an explicit [proxy.mapping.health_check].

Any non-5xx response = healthy. Only connection errors, timeouts, and 5xx = unhealthy.

Health check path is derived from each mapping’s route path.

[proxy.default_health_check]

  enabled = true                   # Set false to disable default health checks
  type = "http"                    # Check type: tcp, http, http3, grpc
  method = "GET"                   # HTTP method (default: GET)
  interval = "15s"                 # Check interval (default: 15s)
  timeout = "5s"                   # Check timeout (default: 5s)
  unhealthy_threshold = 3          # Consecutive failures to mark unhealthy (default: 3)
  healthy_threshold = 2            # Consecutive successes to mark healthy (default: 2)

Per-mapping health check (overrides default_health_check for this mapping)

[proxy.mapping.health_check]

  enabled = true                   # Enable health checking (default: true)
  type = "http"                    # Check type: tcp, http, http3, grpc
  path = "/health"                 # HTTP/HTTP3 health check path
  method = "GET"                   # HTTP method (default: GET)
  expected_status = [200]          # Expected HTTP status codes (empty = any non-5xx is healthy)
  interval = "10s"                 # Check interval (default: 10s)
  timeout = "5s"                   # Check timeout (default: 5s)
  unhealthy_threshold = 3          # Consecutive failures to mark unhealthy (default: 3)
  healthy_threshold = 2            # Consecutive successes to mark healthy (default: 2)
  tls_skip_verify = false          # Skip TLS certificate verification
  grpc_service = ""                # gRPC service name for grpc.health.v1.Health/Check

[proxy.mapping.circuit_breaker]

  enabled = true                   # Enable circuit breaker (default: false)
  error_ratio_threshold = 0.5      # Max error ratio 0.0-1.0 (threshold mode)
  latency_p95_threshold = "1s"     # Max P95 latency (threshold mode)
  network_error_threshold = 0.3    # Max network error ratio (threshold mode)
  combine_mode = "or"              # Threshold combine: "or" (any) or "and" (all)
  trip_expression = ""             # Expression mode (overrides thresholds when set)
                                   # Variables: error_ratio, success_ratio, p50_latency,
                                   #   p95_latency, p99_latency, avg_latency (all ms),
                                   #   network_error_ratio, timeout_ratio, request_count
  min_samples = 10                 # Minimum requests before evaluation (default: 10)
  error_window = "60s"             # Sliding window for error tracking (default: 60s)
  fallback_duration = "30s"        # Duration circuit stays open (default: 30s)
  success_threshold = 3            # Successes in half-open to close (default: 3)
  fallback_mode = "error"          # Open behavior: "error" (503) or "pool" (fallback pool)
  fallback_pool_id = ""            # Pool ID for fallback_mode="pool"
  response_code = 503              # HTTP status when fallback_mode="error" (default: 503)
  per_protocol = false             # Track circuits per protocol (http3/http2/http1)
  fallback_protocol = "http2"      # Protocol when per-protocol circuit opens

[proxy.mapping.outlier_detection]

  enabled = true                   # Enable outlier detection (default: false)
  consecutive_5xx = 5              # Eject after N consecutive 5xx (default: 5)
  consecutive_gateway_failure = 5  # Eject after N consecutive 502/503/504 (default: 5)
  consecutive_local_failure = 5    # Eject after N consecutive connection errors (default: 5)
  success_rate_enabled = true      # Enable statistical success rate analysis
  success_rate_min_hosts = 5       # Min hosts for success rate calculation
  success_rate_min_requests = 100  # Min requests per host for success rate
  success_rate_stdev_factor = 1.9  # Standard deviation factor for outlier threshold
  success_rate_enforcing_pct = 100 # Percentage of detected outliers actually ejected
  failure_percentage_enabled = true  # Enable failure percentage threshold
  failure_percentage_threshold = 85  # Failure % above which backend is ejected
  failure_percentage_min_hosts = 5   # Min hosts for failure percentage
  failure_percentage_min_reqs = 50   # Min requests for failure percentage
  interval = "10s"                 # Detection sweep interval (default: 10s)
  base_ejection_time = "30s"       # Initial ejection duration (default: 30s)
  max_ejection_time = "300s"       # Maximum ejection duration (default: 300s)
  max_ejection_percent = 10        # Max % of backends ejected at once (default: 10)
  ejection_jitter_pct = 10         # Random jitter on re-admission time (default: 10)

[proxy.mapping.rate_limit]

  enabled = false                  # Enable per-pool rate limiting
  requests_per_window = 100        # Token bucket capacity
  window = "1m"                    # Rate limit window duration
  burst = 0                        # Optional burst allowance (0 = no burst)
  per_user = false                 # Per-user limits (vs per-pool)

DNS discovery modes (dns_mode field in programmatic API):

  "internal" (default): Use Hexon DNS module with DNSSEC, caching, cluster config
  "system": Use system resolver directly (no DNSSEC)
  "custom": Use custom resolvers directly (requires dns_resolvers list)

Maglev table size default: 65537 (prime number). Configurable via programmatic API.

Troubleshooting

Common symptoms and diagnostic steps:

Backend never selected (no healthy backends):

  - Check health status: 'proxy health' or 'proxy health <pool-id>'
  - Verify backend is reachable: 'dns test <backend-hostname>'
  - Wrong health check type: TCP check passes but HTTP check fails (wrong path/status)
  - gRPC health check requires grpc.health.v1.Health service on backend
  - HTTP/3 health check failing: QUIC/UDP may be blocked by firewall
  - Database checks (mysql/postgresql/redis): verify protocol handshake, not auth
  - Threshold too strict: lower unhealthy_threshold or increase interval

Circuit breaker tripped unexpectedly:

  - Check circuit state: 'proxy circuits'
  - Review trip expression: expression variables are in milliseconds for latency
  - min_samples too low: brief error bursts trip the circuit prematurely
  - error_window too short: transient errors accumulate faster
  - gRPC pitfall: HTTP status is always 200; circuit uses gRPC status codes
    (codes 4,8,13,14,15 = server error). Check grpc-status trailers.
  - Per-protocol circuit: HTTP/3 circuit may be open while HTTP/2 works fine;
    check per-protocol states via 'proxy circuits'
  - Manual reset: 'proxy reset <pool-id> <backend-id>'

Uneven traffic distribution:

  - Weighted strategy: verify lb_weights array matches service array length
  - Least-connections: requires ConnectionOpened/ConnectionClosed tracking;
    check 'proxy backends <pool-id>' for connection counts
  - Hash/Maglev: same hash key always routes to same backend (by design);
    verify lb_hash_key is sufficiently varied across requests
  - Maglev table imbalance: small backend count can cause uneven distribution;
    increase table size or add more backends

All backends ejected (outlier detection too aggressive):

  - Check outlier state: 'proxy outliers' or 'proxy outliers <pool-id>'
  - max_ejection_percent too high: set to 10-33% to always keep backends active
  - consecutive_5xx threshold too low: increase from 5 to 10+ for noisy backends
  - success_rate_stdev_factor too low: increase from 1.9 to 2.5+ for high variance
  - Manual re-admit: 'proxy uneject <pool-id> <backend-id>'
  - Ejection backoff: duration doubles each re-ejection (base * 2^count),
    capped at max_ejection_time

DNS discovery not updating backends:

  - Check discovery state: 'proxy pools <pool-id>' shows discovery config
  - DNS resolution failing: 'dns test <service-hostname>'
  - DNSSEC validation failing on unsigned zone: use dns_mode="system"
  - Custom resolvers unreachable: check dns_resolvers list
  - Exponential backoff active: after DNS failures, refresh backs off up to 5 minutes
  - Force immediate refresh via programmatic API (RefreshDiscovery operation)

Connection pool exhaustion (high latency, timeouts):

  - Check pool stats: 'connpool stats' and 'connpool pools'
  - Backend slow to respond: connections pile up, least-conn skews
  - Circuit breaker not tripping: error_window may be too long to catch slow failures

Rate limit hitting unexpectedly:

  - Per-user vs per-pool: per-pool limit shared across all users
  - Cluster-wide counting: nodes may have slight count drift due to replication lag
  - Burst exhausted: burst tokens consumed, waiting for window refill
  - Check: 'proxy traffic <app-name>' for request rate metrics

Cross-subsystem diagnostics:

  - Full domain diagnostic: 'diagnose domain <hostname>'
  - Full user access diagnostic: 'diagnose user <username>'

Architecture

Cluster state management and design:

State categories and retention:

  Pool configurations:        retained for 24h, refreshed on update
  Backend health states:      retained for 30s, updated by health checker
  Circuit breaker states:     retained for 24h
  Outlier detection states:   retained for 24h per backend
  Connection counts:          retained for 60s per backend
  Rate limit counters:        retained for window duration + 1 minute
  Backend metrics:            retained for circuit breaker decision window

Read/write model:

  Reads (e.g., backend selection, health status, circuit state) are served
  from local in-memory cache for near-instant response.
  Writes (e.g., pool creation, health updates, circuit trips) are replicated
  to all nodes with eventual consistency.

Load balancing algorithm details:

  adaptive: Epsilon-greedy selection that learns from request outcomes. Starts with
    round-robin (learning phase, first 50 requests), then exploits the best-scoring
    backend 95% of the time and explores randomly 5%. Scores combine success rate,
    latency (EMA), timeout rate, and consecutive failures with decay. Default strategy.
  round_robin: Sequential rotation across backends. O(1) per selection.
  weighted: Earliest Deadline First (EDF) scheduling. Higher weight = smaller deadline
    increment = selected more often. Smoother than GCD-based approaches (interleaved,
    not batched). O(log n) per selection.
  least_conn: Power of Two Choices (P2C) — picks 2 random backends, selects the one
    with fewer connections. O(1) complexity with near-optimal distribution. Weighted:
    effective_connections = actual_connections * 1000 / weight.
  hash: Consistent hashing with xxhash. Same key always routes to same backend unless
    pool membership changes. O(1) per selection.
  maglev: Google Maglev consistent hashing. Table size default 65537 (prime). Better
    distribution than ring hash with minimal disruption on backend changes. O(1) lookup
    after O(n * table_size) table build.
  random: Uniform random selection. O(1) per selection.

Circuit breaker state machine:

  closed --> open:      Trip conditions met (after min_samples within error_window)
  open --> half_open:   fallback_duration elapsed
  half_open --> closed: success_threshold consecutive successes
  half_open --> open:   Any failure

  Per-protocol mode: independent state machines for http3, http2, http1, tcp.
  Automatic fallback chain: http3 -> http2 -> http1 (configurable via fallback_protocol).

Outlier detection mechanisms:

  1. Consecutive failures (immediate): eject after N consecutive 5xx / gateway / local errors
  2. Success rate (statistical): eject if success_rate < (cluster_avg - stdev_factor * stdev)
  3. Failure percentage (threshold): eject if failure% > threshold
  Ejection backoff: base_ejection_time * 2^(ejection_count - 1), capped at max_ejection_time.
  Re-admission: automatic after ejection duration expires; counters reset.
  Jitter: ejection_jitter_pct random jitter on re-admission to prevent thundering herd.
  Safety: max_ejection_percent prevents ejecting all backends simultaneously.

Health check architecture:

  Each node independently runs health checks for its configured pools.
  Results are replicated to all nodes -- the most recent check from any node is used.
  Threshold logic: unhealthy_threshold consecutive failures to mark down;
  healthy_threshold consecutive successes to mark up.
  HTTP/3 checks: quic-go with connection pooling and idle connection cleanup.
  gRPC checks: native grpc.health.v1.Health/Check RPC with connection pooling.
  Database checks: protocol handshake only (MySQL initial packet, PostgreSQL SSLRequest,
  Redis PING/PONG). No authentication or query execution.

DNS discovery lifecycle:

  1. Periodically resolves hostname to IP addresses (refresh interval)
  2. New IPs: automatically added as backends to pool
  3. Removed IPs: automatically removed from pool
  4. DNS failure: exponential backoff up to 5 minutes
  5. Modes: internal (Hexon DNS with DNSSEC), system (OS resolver), custom (direct resolvers)

Interpreting tool output:

  'proxy pools':
    Healthy: All pools show ActiveBackends = TotalBackends, Strategy listed
    Degraded: ActiveBackends < TotalBackends — some backends ejected or unhealthy
    Action: Degraded → 'proxy health' for per-backend status, 'proxy outliers' for ejections

  'proxy health' (per-pool):
    Healthy: All backends Status=healthy, consecutive failures=0
    Unhealthy: Status=unhealthy with failure reason (connection_refused, timeout, http_error)
    Action: Unhealthy → check backend directly with 'net tcp <host:port>' or 'net http <url>'

  'proxy circuits':
    Closed: Normal operation — requests flowing to backend
    Half-open: Testing recovery — limited requests allowed through, do NOT reset manually
    Open: Tripped — backend failing, shows TripCondition and ErrorRate
    Action: Open → fix backend root cause first, then 'proxy reset <pool> <backend>' to clear

  'proxy outliers':
    Normal: No ejected backends
    Ejected: Backend removed from rotation — shows EjectionTime and FailureRate
    Action: Fix backend, then 'proxy uneject <pool> <backend>' to re-admit

Relationships

Module dependencies and interactions:

proxy: Primary consumer. Creates and manages LB pools automatically when

  [[proxy.mapping]] has multiple backends in the service array. Selects backends on
  every request, tracks connections, records results for circuit breaker and outlier
  detection. All LB configuration flows through proxy mapping sub-tables.

distributed cache: State storage backend. All pool configs, health states, circuit

  breaker states, outlier states, connection counts, rate limit counters, and backend
  stats are stored in the cluster-wide cache with appropriate TTLs.

dns: Backend hostname resolution for DNS-based service discovery. Supports three

  modes: internal (Hexon DNS module with DNSSEC), system (OS resolver), custom
  (direct resolvers). Per-pool DNS configuration.

sessions: Session affinity for hash-based algorithms. Cookie-based hash keys read

  session cookies. JA4/JA4H fingerprint routing uses TLS fingerprint from session.

connection_pool: Backend HTTP connection management. Circuit breaker integration

  prevents new connections to tripped backends. Connection counts feed least_conn
  algorithm decisions.

certificates: TLS for backend connections. Health checks honor TLS configuration

  (tls_skip_verify). HTTP/3 health checks require valid QUIC/TLS setup.

Reverse Proxy

HTTP reverse proxy with load balancing, circuit breaking, and OIDC SSO

Overview

The proxy service acts as an authenticating reverse proxy gateway for backend applications. It provides:

Host-based and path-based routing with 3-tier hybrid matcher (exact → prefix → regex)
Per-route OIDC SSO authentication with cross-domain cookie support
Group-based authorization (OR semantics — user needs any one listed group)
Identity header injection (X-Hexon-User, X-Hexon-Mail, X-Hexon-Name, X-Hexon-Groups)
Ed25519 header signing and optional full request signing for backend verification
Response header URL rewriting (Link, Content-Location, Refresh) and HTML body rewriting
JavaScript interceptor injection for dynamic URL rewrites (fetch, XHR, window.open)
Logout toolbar injection for authenticated routes (draggable, shows user + app name)
WebSocket and gRPC support, HTTP/3 (QUIC) backend connections
Zero-copy streaming mode for API routes (rewrite_host=false, saves 8-15ms, 4x throughput)
Brotli compression support (decompress, rewrite URLs, re-compress)
Per-mapping mTLS, CIDR subnet restriction, HTTP method filtering
Multi-backend load balancing (round-robin, weighted, least-conn, consistent hash, Maglev)
Circuit breakers with expression-based trip conditions and outlier detection
Native health checks (TCP, HTTP, HTTP/3, gRPC)
Hot-reload of routes, backends, auth rules without restart (atomic, +2ns overhead)
Landing page listing all accessible apps filtered by user groups (folder/tag grouping)
PROXY protocol v1/v2 support for preserving client IP through L4 load balancers
Per-mapping protection overrides (rate limit, size limit bypass)
Custom response headers with three-state logic (set/strip/inherit)
Cookie domain rewriting for SSO across subdomains (RFC 6265 compliant)
Request shadowing/mirroring for canary testing (async fire-and-forget, per-route sampling)
JWT Bearer token verification cache (sessions module, SHA256 of token as session ID, configurable TTL)
Personal Access Token (PAT) authentication via Bearer header with session validation and IP enforcement

Config

Core configuration under [proxy] and [[proxy.mapping]]:

[proxy]

  enabled = true                   # Enable reverse proxy service
  hostname = "apps.hexon.es"       # Landing page hostname (optional; if unset, serves on service.hostname)
  signing_enabled = true           # Ed25519 header signing (default: true if cluster_key set)
  signing_rotation = "15m"         # Key rotation interval (HKDF-SHA256 from cluster_key)
  brotli_support = true            # Decompress/reencode Brotli for URL rewriting
  group_refresh_interval = "15m"   # Background session group membership refresh (0 to disable)
  bearer_cache_ttl = "5m"          # JWT Bearer token verification cache TTL (0 to disable)
  gzip = true                      # Enable gzip compression
  headers = {}                     # Global response header overrides (three-state: value/"-"/empty)
  header_user = "X-Hexon-User"    # Identity header name overrides (rarely changed)
  header_mail = "X-Hexon-Mail"
  header_name = "X-Hexon-Name"
  header_groups = "X-Hexon-Groups"

[[proxy.mapping]]

  app = "Name"                     # Display name (shown in landing page and toolbar)
  host = "app.hexon.es"            # Hostname for routing (SNI matching)
  path = "^/.*"                    # Path regex (auto-classified into matching tiers)
  service = "https://backend:8080" # Backend URL(s) — string or array for load balancing
  auth = true                      # Require authentication
  groups = ["users"]               # Authorized groups (OR logic, empty = any authenticated user)
  add_auth_headers = true          # Inject X-Hexon-* identity headers
  add_bearer = true                # Inject signed JWT Bearer token to backend (SSO via OIDC)
  allow_upgrade = true             # WebSocket upgrade support
  rewrite_host = true              # HTML URL rewriting (default: true; false = zero-copy mode)
  inject_toolbar = true            # Logout toolbar (default: true when auth=true)
  rewrite_hosts = [["backend.com","app.hexon.es"]]  # Multi-domain URL mapping pairs
  priority = 500                   # Route priority (auto 0-1000, manual >1000 overrides)
  cert = "/path/to/cert.pem"      # Per-mapping TLS certificate
  key = "/path/to/key.pem"        # Per-mapping TLS private key
  tls_check = true                 # Verify backend TLS certificate
  mtls = false                     # Require client certificate (default: false)
  allowed_subnets = ["10.0.0.0/8"] # CIDR subnet restriction (OR logic, 403 if no match)
  allowed_methods = ["GET","POST"] # HTTP method filter (empty = all allowed, 405 if no match)
  audience = "custom"              # Custom audience for header signing (default: service URL)
  sign_request = false             # Full request signing (method, path, query, body)
  sign_request_max_body = "10MB"   # Max body size to hash (default 10MB; "0" = skip body hash)
  bearer_cache_ttl = "5m"          # Per-mapping JWT cache TTL override (inherits from global)
  oidc_providers = ["internal"]    # OIDC provider(s) for authentication
  dnssec = true                    # Per-route DNSSEC override
  dns_resolvers = ["10.0.0.1:53"] # Per-route DNS resolver override
  brotli_support = true            # Per-route Brotli override (falls back to global)
  permissions_policy = "..."       # Browser Permissions-Policy header
  referrer_policy = "..."          # Browser Referrer-Policy header
  csp_header = "..."               # Content-Security-Policy header
  headers = {}                     # Per-route response headers (completely replaces global)
  disable_rate_limit = false       # Bypass rate limiting for this route
  rate_limit = "200/1m"            # Custom rate limit
  disable_size_limit = false       # Bypass size limiting
  max_bytes = "100MB"              # Custom max body size
  forward_request_headers = false  # Forward Authorization header to backend
  forward_response_headers = false # Forward WWW-Authenticate header from backend
  folder = "Category"              # Landing page folder grouping
  tags = ["tag1"]                  # Landing page tags for filtering
  display = true                   # Show in portal/access list (default: true, set false for API-only)

DNS: Centralized in [dns] section. Proxy uses DNS module by default ([proxy.dns] use_cluster=true). Per-route overrides via dnssec and dns_resolvers fields for backends with special DNS needs (e.g., internal backends with internal DNS, or backends in unsigned DNS zones).

Load balancing: service can be an array of URLs. Configure algorithm and weights via lb_strategy and lb_weights fields. Default strategy is adaptive (epsilon-greedy).

Circuit breaker: [proxy.mapping.circuit_breaker] with trip expression and recovery settings. Health checks: all mappings get HTTP health checks by default (any non-5xx = healthy, 15s interval). Override globally via [proxy.default_health_check] or per-mapping via [proxy.mapping.health_check]. 4 active check types: tcp, http, http3, grpc. expected_status is an array (e.g. [200, 302]); empty means any non-5xx response is healthy. Health check path is derived from the mapping’s route path when using defaults. Health state is shared cluster-wide; unhealthy backends are removed from rotation until they recover. Connection pool: [connection_pool.http] for global pool settings (max_connections, adaptive_scaling).

Hot-reloadable: routes, backends, auth rules, paths, rewrite rules, protection overrides, identity header names, per-route DNS, certificates. Cold (restart required): proxy.enabled, global connection pool settings, DNS module config, cache.

Security

Identity Headers (when add_auth_headers=true):

  X-Hexon-User:     Username
  X-Hexon-Mail:     Email address
  X-Hexon-Name:     Full name (display name)
  X-Hexon-Groups:   Comma-separated group list (e.g. "users,admins,developers")

Groups are fetched fresh from directory on every request (not cached), ensuring immediate enforcement of group changes without re-authentication. Backends can trust these headers for SSO without implementing their own authentication.

Header Signing — Ed25519 (enabled by default when cluster_key is set):

  Additional headers injected when signing is enabled:
    X-Hexon-Audience:    Route audience string (custom audience field, or service URL)
    X-Hexon-Timestamp:   Unix epoch seconds when signature was created
    X-Hexon-Request-Id:  Unique request correlation ID for tracing
    X-Hexon-Signature:   Signature in format: v2.{timestamp}.{base64_ed25519}

  Signed payload (pipe-delimited, 7 fields):
    {timestamp}|{request_id}|{audience}|{user}|{email}|{name}|{groups}

  IMPORTANT: The groups field (last) may itself contain pipe characters.
  Backends MUST parse with SplitN(payload, "|", 7) — NOT Split(payload, "|").

  Why Ed25519 instead of HMAC:
    - HMAC requires sharing the secret key with verifiers, enabling forgery
    - Ed25519 distributes only the PUBLIC key — backends can verify but NOT forge
    - The private key never leaves the Hexon cluster

  Key derivation: HKDF-SHA256 (RFC 5869) from cluster_key with a versioned
  domain-specific salt. Rotates every signing_rotation (default 15m).
  Current + previous keypair kept in memory for rotation boundary handling.

  Backend Verification — Option 1: Delegated (simple, recommended for most backends):

    POST /.well-known/header-signing.verify
    Content-Type: application/json

    Request body fields: signature, timestamp, request_id, audience, user, email, name, groups
    Example:
      {"signature":"v2.1732800000.base64ed25519==","timestamp":1732800000,
       "request_id":"abc123","audience":"https://backend:8080",
       "user":"jdoe","email":"jdoe@example.com","name":"John Doe","groups":"admin,users"}

    Responses:
      200 OK:                  {"valid": true}
      401 Unauthorized:        {"valid": false, "error": "signature mismatch"}
      400 Bad Request:         {"valid": false, "error": "missing required field: signature"}
      503 Service Unavailable: {"valid": false, "error": "signing not enabled"}

    Benefits: zero crypto code in backend, automatic key rotation handling,
    works with nginx auth_request directive.

  Backend Verification — Option 2: Direct (fast, recommended for high-throughput):

    GET /.well-known/header-signing.key?t={timestamp}

    Response (200 OK):
      {"public_key":"base64_32_byte_key","valid_from":1732800000,"valid_until":1732800900}

    Verify Ed25519 signature locally:
      1. Parse X-Hexon-Signature: split by "." → [version, timestamp, signature_base64]
      2. Verify version is "v2", decode signature (64 bytes)
      3. Check X-Hexon-Audience matches expected audience
      4. Check timestamp is within 30 seconds of current time
      5. Fetch public key from /.well-known/header-signing.key?t={timestamp}
      6. Reconstruct payload: timestamp|request_id|audience|user|email|name|groups
      7. ed25519.Verify(publicKey, payload, signature) → true/false

    Public key is safe to cache (32 bytes, cannot create signatures — only verify).

  Clock synchronization requirements:
    - NTP required on all nodes (chrony or systemd-timesyncd)
    - Clock drift should be <1 second for reliable operation
    - Verification allows 30-second tolerance for network delays
    - Key rotation windows calculated from Unix epoch

Request Signing — Ed25519 (optional, per-route sign_request=true):

  Signs the entire HTTP request for end-to-end integrity verification.
  Protects against: method tampering, host header attacks, path manipulation,
  query injection, body tampering.

  Header: X-Hexon-Request-Signature: v1|{timestamp}|{base64_ed25519}

  Signed payload:
    REQ|{timestamp}|{method}|{host}|{path}|{query_hash}|{body_hash}

  Canonicalization rules:
    Path: URL-decoded → dot-segments resolved → slashes collapsed → leading slash ensured
      /api/../admin → /admin,  /api//users → /api/users,  /api/foo%2Fbar → /api/foo/bar
    Query: parsed → sorted alphabetically by key → re-encoded with URL escaping
      b=2&a=1 → a=1&b=2
    Body: SHA256 hash (base64). Bodies over sign_request_max_body → "SKIPPED".
      Empty body → hash of empty string (47DEQpj8HBSa...).
      Set sign_request_max_body = "0" to always skip body hashing.

  Verify via: POST /.well-known/request-signing.verify (same JSON format)
  or GET /.well-known/request-signing.key?t={timestamp} (same keypair as header signing).

  Header signing vs request signing:
    Header signing: covers auth headers only, enabled by default, X-Hexon-Signature
    Request signing: covers entire request, opt-in per route, X-Hexon-Request-Signature
    Use both on sensitive routes (e.g., payment gateways) for maximum security.

Bearer Token Injection (when add_bearer=true):

  Injects a signed JWT ID token as Authorization: Bearer <token> on proxied requests.
  Backend verifies the token via the /oidc/cert JWKS endpoint (standard OIDC discovery).
  Token signed with the OIDC provider's signing key — supports threshold ECDSA (TSS/DKG)
  or deterministic HKDF-derived keys, auto-swaps transparently.

  JWT claims: iss (gateway issuer), sub (username), aud (app name or custom audience),
  email, preferred_username, groups, exp, iat. Same structure as regular OIDC ID tokens.

  Audience defaults to the mapping's app name. Override via audience = "custom" field.
  Tokens are cached in the session module (type "proxy_bearer") with deterministic IDs
  derived from the user session and audience, distributed across all cluster nodes. With threshold
  signing, only one node performs the signing ceremony per user:audience pair. Cached JWTs
  are AES-256-GCM encrypted at rest using a cluster key derivative. Refreshed at 80% of TTL.

  Pre-minting: optimizes ECDSA signing time on first request by minting during OIDC callback.

  Existing Bearer tokens are NOT overwritten — if the request already carries a Bearer
  (e.g., kubelogin passthrough), the injection is skipped. This allows mixed usage:
  M2M clients with their own tokens and browser users with injected tokens on the same route.

  Use case: backends like ArgoCD, Rancher, Grafana, Kubernetes API trust the gateway's
  OIDC issuer and get SSO for free — no redirect flow, no separate auth integration.

PAT Bearer Authentication:

  Personal Access Tokens work as standard Bearer tokens for proxy access.
  Flow: Authorization: Bearer <PAT-JWT> → middleware detects opaque miss →
  JWT verify → PAT detection (jti non-empty) → PAT session validation → bearer auth.
  Session check on every request ensures instant revocation (~5-50µs local KV).
  IP restriction (allowed_ips) enforced from session metadata (exact IP + CIDR).
  Last-used tracking: fire-and-forget metadata update preserves fixed PAT expiry.
  Cache (bearer_cache session, SHA256 key): stores is_pat, jti, allowed_ips metadata.
  Cache hits skip JWT verify but always re-validate session (revocation gate).
  Stale cache entries auto-deleted when revocation detected.
  Groups from JWT claims used for per-route group authorization (OR logic).

  Two access paths to the same proxy mapping:
    Browser: PoW challenge → OIDC SSO → session cookie → proxy (human-optimized)
    Machine: Authorization: Bearer <token> → proxy (machine-optimized)
  Bearer resolves at step 1 of the middleware chain — before PoW, before OIDC redirect.
  All three token types (opaque access tokens, JWT ID tokens, PATs) bypass PoW and
  browser redirects. Provides direct, redirect-free, cookieless proxy access for CI/CD,
  monitoring, CLI tools (kubelogin), and service-to-service calls. Same group authorization,
  identity headers, and Ed25519 signing apply — only the authentication on-ramp differs.

Mutual TLS (per-mapping):

  mtls=false (default): no client certificate requested (no browser popup)
  mtls=true: TLS handshake requires valid client certificate (RequireAndVerifyClientCert)
  Certificate validated against ACME CA bundle or configured external PKI.
  Applied at TLS layer via GetConfigForClient callback during handshake.

Subnet Restriction:

  allowed_subnets uses CIDR notation, OR logic (client IP must match at least one).
  Uses X-Forwarded-For if present (for CDN/LB scenarios), falls back to direct IP.
  Enforced AFTER authentication but BEFORE proxy forwarding (defense-in-depth).
  CIDR validated at config load time — startup fails on invalid format.
  All violations logged at LevelWarn with app and host labels.

Cookie Handling:

  Set-Cookie domains rewritten from backend domain to proxy domain for SSO.
  Cookies intentionally shared across all subdomains (*.hexon.es) for single sign-on.
  RFC 6265 compliant: case-insensitive attribute parsing (Domain=/domain=/DOMAIN=).
  HttpOnly and Secure flags preserved during rewriting.

Response Header Overrides (three-state logic):

  "" (empty/omit): Inherit from backend (pass through unchanged)
  "-" (dash):      Strip header from response
  "value":         Override with the specified value
  Per-route headers completely replace global [proxy].headers (no merging).
  Empty map (headers = {}) disables all header processing for that route.
  Forbidden headers (blocked at config validation):
    Transfer-Encoding, Content-Length, Connection, Keep-Alive, Upgrade, Proxy-Connection, TE
  When both legacy fields (permissions_policy, referrer_policy, csp_header) and headers
  map target the same header, the headers map takes precedence.

Troubleshooting

Common symptoms and diagnostic steps:

502/503 Bad Gateway:

  - Backend unreachable: 'proxy health' shows backend down
  - Circuit breaker open: 'proxy circuits' shows tripped breakers
  - DNS resolution failure: 'dns test <backend-hostname>' to verify
  - DNSSEC failure on unsigned zone: set dnssec=false on that route
  - All custom resolvers failing: falls back to system DNS, check 'dns resolvers'
  - Start with: 'diagnose domain <hostname>' for cross-subsystem check

Auth redirect loops:

  - OIDC callback failing: check oidc_providers configuration
  - Cross-domain cookie issue: verify proxy hostname matches cookie domain
  - Session group mismatch: group_refresh_interval revoked the session
  - Multiple providers: ensure provider selection page renders correctly
  - Check: 'sessions list --user=X' and 'auth status'

WebSocket upgrade failures:

  - Missing allow_upgrade=true on the mapping
  - Backend not responding to Upgrade handshake
  - Rate limiting blocking upgrade requests: check disable_rate_limit
  - TLS verification failing: check tls_check setting

gRPC errors (circuit breaker tripping unexpectedly):

  - gRPC always returns HTTP 200; actual status is in grpc-status trailer
  - Circuit breaker uses gRPC-aware status extraction (codes 4,8,13,14,15 = server error)
  - Expression variables: grpc_error_rate, grpc_unavailable_rate, grpc_timeout_rate
  - Backend must implement grpc.health.v1.Health for native gRPC health checks
  - Enable: grpc=true on the mapping, grpc_health_check=true on circuit_breaker config

Slow responses:

  - HTML buffering: set rewrite_host=false for API routes (saves 8-15ms, 4x throughput)
  - Brotli decompression cost: set brotli_support=false on specific routes
  - Backend health degrading: 'proxy backends' for connection stats
  - Circuit breaker half-open: 'proxy circuits' for breaker states
  - Connection pool exhaustion: 'connpool stats' for pool metrics
  - Route matching slow: too many regex routes in Tier 3, convert to prefix patterns
  - Enable debug mode for Server-Timing header: shows route/auth/backend_ttfb/tls timing
    breakdown in browser DevTools Network tab (do NOT enable in production)

Header signing verification failures:

  - Clock skew >1s between proxy and backend: ensure NTP is running on all nodes
  - Key rotation window boundary: verification allows 30s tolerance
  - Wrong audience: check mapping's audience field matches backend expectation
  - Signature format: expect v2.{timestamp}.{base64}, parse with Split(".", 3)
  - Groups with pipes: backend must use SplitN("|", 7), not Split("|")

Request signing verification failures:

  - Path canonicalization mismatch: backend must URL-decode, resolve dots, collapse slashes
  - Query parameter order: backend must sort alphabetically before hashing
  - Body hash "SKIPPED": body exceeded sign_request_max_body, backend must handle this case
  - Large file uploads: set sign_request_max_body = "0" to skip body hashing

Landing page not showing apps:

  - User not in required groups: 'directory user <username>'
  - proxy.hostname not configured: landing page serves on service.hostname at /
  - Route auth=false: public apps show with PUBLIC badge
  - App not visible: check display=true (default) and folder/tags grouping in mapping config

PROXY protocol issues:

  - Backend not expecting PROXY protocol: set proxy_protocol=false
  - Wrong protocol version: check proxy_protocol_version (v1 text vs v2 binary)

PAT Bearer token not authenticating:

  - PAT falls through to session/OIDC: check 'logs search "handlers.bearer"' for PAT validation logs
  - "PAT rejected" in logs: session revoked or expired — check 'sessions list --type=pat --user=X'
  - "Cached PAT rejected" in logs: stale bearer_cache entry — auto-invalidated, retry should work
  - "source IP not allowed" in logs: PAT has allowed_ips restriction — check 'pats show <session_id>'
  - PAT works for QUIC but not proxy: ensure Authorization: Bearer header is sent correctly
  - Groups not matching route: PAT carries groups from creation time — if user groups changed,
    create new PAT with current groups or use route with groups the PAT carries

403 Forbidden on specific routes:

  - Subnet restriction: client IP not in allowed_subnets (check 'proxy traffic' metrics)
  - HTTP method not allowed (405): check allowed_methods list
  - Group authorization failed: user missing required group membership
  - mTLS required but no client certificate: check mtls setting

Interpreting tool output:

  'proxy health':
    Healthy: All backends Status=healthy, Latency < 100ms
    Warning: Status=healthy but Latency > 500ms — backend is slow, not down
    Degraded: Status=unhealthy with Reason: connection_refused | tls_error | timeout | dns_failed
    No pools: "No load balancer pools configured" means all mappings are single-backend (normal)
    Action: All backends unhealthy → 'proxy circuits' for open breakers

  'proxy circuits':
    Healthy: all breakers State=closed — normal operation
    Half-open: breaker is testing recovery — allow a few requests through, do NOT reset
    Open: breaker tripped — backend is failing, check Reason and TripCondition
    No pools: same as above — circuit breakers only exist for multi-backend pools
    Action: Open breaker → check 'proxy backends' for error counts, fix backend, then 'proxy reset'

  'proxy backends':
    Healthy: ActiveConns reasonable, ErrorRate < 1%, Latency stable
    Degraded: ErrorRate > 5% or Latency spiking — backend may be overloaded
    Ejected: Outlier detection removed backend — 'proxy uneject' to re-admit after fixing

  'proxy traffic':
    Normal: RequestRate steady, ErrorRate < 1%, Latency p99 < 500ms
    Abnormal: Sudden RequestRate spike (possible attack), ErrorRate > 5% (backend issue)
    Zero traffic: Route exists but no requests — check DNS/certificate for that hostname

Relationships

Module dependencies and interactions:

loadbalancer: Pool management, backend selection, health checks, circuit breakers,

  outlier detection. Multi-algorithm support (round-robin, weighted, least-conn,
  consistent hash, Maglev).

sessions: Authentication enforcement via session cookies. Session creation during

  OIDC callback. Session group monitor revokes on group changes.

certificates: TLS termination, SNI-based certificate selection for per-mapping certs.

  Falls back to service.tls_cert/tls_key if no mapping-specific cert. Invalid or missing
  certificates prevent route from mounting.

waf: Request filtering applied before proxy forwarding (WAF rules checked first).
authentication.oidc: SSO via internal OIDC provider. Uses a dedicated internal OIDC

  client with PKCE S256. Back-channel token exchange is in-process (no network hairpin in K8s).

directory: Group membership lookup on every request (fresh, not cached). Powers both

  per-request authorization and X-Hexon-Groups header, plus landing page app filtering.

dns: Backend hostname resolution with DNSSEC validation. Centralized in [dns] section.

  Per-route overrides for internal backends or unsigned DNS zones.

firewall: Network-level access rules applied before proxy routing.
protection: Rate limiting (JA4 fingerprint-based) and size limiting at router level.

  Per-mapping bypass via disable_rate_limit/disable_size_limit context keys.

connection_pool: Backend HTTP connection management with adaptive pool sizing,

  circuit breaker integration, and performance metrics. Pool consolidation for routes
  with identical transport configuration.

render: Landing page and toolbar asset serving (CSS, JS, images).

Architecture

Request flow:

Client request arrives → TLS termination with SNI certificate selection
Hostname match: O(1) hash map lookup → PathMatcher for that hostname
Path match: 3-tier hybrid matcher (exact → prefix tree → regex scan)
Middleware chain: rate limit → size limit → subnet check → method filter
Authentication: Bearer token check (opaque cache → JWT session cache → Ed25519 verify)

   PAT detection: if JWT has jti → PAT session validation (revocation + IP restriction) → last_used update
   then OIDC SSO check → redirect to /oidc/auth if no session

Authorization: group membership verified (fresh directory lookup, OR logic)
Director: inject identity headers, sign headers (Ed25519), sign request if enabled
Request forwarded to backend (load balanced if multi-backend)
Response processing: URL rewriting (HTML + response headers), cookie domain rewriting,

   toolbar + JS interceptor injection, custom header overrides

Response sent to client (zero-copy mode skips step 9 for API routes)

Route matching — 3-tier hybrid matcher:

  Tier 1 (exact): O(1) hash map — static paths like ^/api/v1/users$
  Tier 2 (prefix): O(log n) radix tree — wildcard paths like ^/api/.*
  Tier 3 (regex): O(n) sequential — complex patterns with alternation
  Auto-priority: (prefix_length × 100) + end_anchor_bonus(100) - alternation_penalty(50)
  Manual priority >1000 overrides auto-calculation. Catch-all always priority 0.
  Performance: exact ~50ns, prefix ~100-200ns, regex ~500ns-50μs (scales with route count).

OIDC SSO flow (solves cross-domain cookie problem):

  1. User hits proxy host with no session cookie
  2. Redirect to OIDC provider with PKCE challenge
  3. If user has existing session on main domain → auto-approved (no login prompt)
  4. Redirect back: /_hexon/oidc/callback?code=...
  5. Token exchange in-process (no network hairpin)
  6. Session cookie set ON the proxy host domain → future requests have session
  Security: PKCE S256, AES-GCM encrypted state with cluster key derivative,
  CSRF double-submit cookie, 10-minute state expiry, host binding in state,
  open redirect prevention (return URL validated against proxy hostname).

Hot-reload mechanism:

  1. Config change detected (file watcher or SIGHUP)
  2. Config hash compared to detect actual changes (skip if identical)
  3. Routes rebuilt from new configuration
  4. HTTP transport cache checked for connection pool reuse
  5. Circuit breaker state preserved for unchanged routes
  6. Response cache selectively invalidated (only changed routes)
  7. Server routes re-registered (atomic hostname updates)
  8. Proxy state swapped atomically (lock-free reads, +2ns overhead)
  Reload is all-or-nothing: failure keeps old routes active, error logged.
  Duration: 50-200ms for typical configs (10-50 routes).

HTML processing pipeline (when rewrite_host=true):

  1. Backend response received → check Content-Type (only process text/html)
  2. Decompress if needed (gzip always; Brotli if brotli_support=true)
  3. Replace backend URLs with proxy URLs in HTML body
  4. Rewrite URL-containing response headers (Link, Content-Location, Refresh)
  5. Rewrite Set-Cookie domains (case-insensitive attribute matching)
  6. Inject JavaScript interceptor (rewrites fetch, XHR, dynamic elements, window.open)
  7. Inject logout toolbar before </body> (if auth=true and inject_toolbar=true)
  8. Re-compress response (gzip or Brotli based on client Accept-Encoding)
  Zero-copy mode (rewrite_host=false, inject_toolbar=false): skip all HTML processing,
  eliminating 10MB allocation per request. Ideal for APIs, WebSocket, streaming.

Request Shadow/Mirror

Asynchronous HTTP request shadowing for proxy mappings with sampling and dedicated transport

Overview

The shadow module duplicates HTTP requests to secondary backends for testing, analytics, or migration validation. Shadow requests are completely asynchronous and never affect the primary request/response flow.

Core capabilities:

Asynchronous dispatch (no blocking the main request path)
Configurable sampling via runtime_fraction (percentage or fractional modes)
Dedicated HTTP transport pool separate from main proxy traffic
Shadow identification headers for backend awareness (X-Hexon-Shadow-*)
Per-shadow timeout and body size limits to prevent resource exhaustion
Distributed trace ID propagation for end-to-end observability
Per-mapping shadow configuration with global defaults
Multiple shadow targets per proxy mapping (e.g., canary + analytics)
Connector site routing: shadow targets at remote sites via QUIC tunnels

Shadow dispatch flow:

  1. Client request arrives at proxy handler
  2. Proxy forwards to primary backend (normal flow)
  3. For each configured shadow target, sampling decision is evaluated
  4. If sampled, request is dispatched asynchronously to the shadow module
  5. Shadow module replays the request to the shadow backend with its own transport
  6. Shadow response is discarded (metrics only, no client impact)

Shadow identification headers (when AddHeaders is enabled):

  X-Hexon-Shadow: "true"          - Identifies this as a shadow request
  X-Hexon-Shadow-Name: "<name>"   - Shadow target name for routing/filtering
  X-Hexon-Shadow-Source: "<host>" - Original request host
  X-Hexon-Shadow-Time: "<unix>"   - Unix timestamp of original request
  X-Hexon-Trace-ID: "<uuid>"      - Distributed trace ID
  X-Forwarded-Host: "<host>"      - Standard forwarded host header

Sampling modes:

  Percentage: runtime_fraction = { percent = 10 } for 10% of requests
  Fractional: runtime_fraction = { numerator = 1, denominator = 1000 } for 0.1%

Use cases:

  - Canary deployments: shadow 10% of traffic to new version before cutover
  - Analytics pipelines: mirror requests to analytics backend for processing
  - Migration validation: compare primary and shadow responses offline
  - Load testing: replay production traffic to staging environments
  - A/B backend testing: shadow to alternative implementation

Config

Global shadow defaults under [proxy.shadow]:

[proxy.shadow]

  enabled = true                    # Enable shadow dispatch globally
  timeout = "5s"                    # Default timeout for shadow requests
  max_body_size = "10MB"            # Maximum request body size to shadow
  add_headers = true                # Add X-Hexon-Shadow-* identification headers
  max_idle_conns = 50               # Transport pool: max idle connections
  max_idle_conns_per_host = 10      # Transport pool: max idle per host
  max_conns_per_host = 100          # Transport pool: max total per host
  idle_conn_timeout = 90            # Idle connection timeout (seconds)
  tls_handshake_timeout = 10        # TLS handshake timeout (seconds)
  tls_verify = true                 # Verify shadow backend TLS certificates

Per-mapping shadow configuration (overrides global defaults):

[[proxy.mapping]]

  host = "api.example.com"
  path = ".*"
  service = ["https://primary.internal"]

  [[proxy.mapping.shadow]]
  name = "canary"                   # Shadow target name (used in metrics and headers)
  service = "https://canary.internal:8443"  # Shadow backend URL
  runtime_fraction = { percent = 10 }       # Sample 10% of requests
  add_headers = true                # Override global add_headers

  [[proxy.mapping.shadow]]
  name = "analytics"
  service = "https://analytics.internal"
  runtime_fraction = { numerator = 1, denominator = 1000 }  # 0.1% sampling
  timeout = "2s"                    # Override global timeout

  # Shadow target at a remote site via connector tunnel:
  [[proxy.mapping.shadow]]
  name = "staging-mirror"
  service = "https://staging.internal:8443"
  site = "staging-eu"               # Routes through connector tunnel

Sampling configuration:

  Percentage mode (0-100):
    runtime_fraction = { percent = 10 }   # 10% of requests

  Fractional mode (precise low rates):
    runtime_fraction = { numerator = 1, denominator = 1000 }  # 0.1%

  No runtime_fraction: 100% of requests are shadowed.

Hot-reloadable: runtime_fraction, timeout, add_headers, max_body_size. Cold (restart required): enabled, transport pool settings (max_idle_conns, etc.).

Troubleshooting

Common symptoms and diagnostic steps:

Shadow requests not reaching the backend:

  - Verify [proxy.shadow] enabled = true
  - Check shadow target name matches in proxy mapping configuration
  - Verify shadow service URL is reachable from the Hexon server
  - Test connectivity: 'net tcp <shadow-host>:<port> --tls'
  - Check runtime_fraction is set correctly (0% = no traffic)
  - Verify max_body_size is sufficient for the request payload
  - Check shadow metrics: shadow_requests_total should be incrementing

Shadow requests timing out:

  - Increase timeout setting (default 5s may be too short for slow backends)
  - Check shadow backend health and response times
  - Verify network path between Hexon and shadow backend
  - Check max_conns_per_host limit (100 default) is not exhausted
  - Monitor shadow_request_duration histogram for latency distribution

Transport pool exhaustion:

  - Increase max_idle_conns (default 50) for high-traffic deployments
  - Increase max_conns_per_host (default 100) for single-target shadows
  - Reduce timeout to free connections faster
  - Check idle_conn_timeout (default 90s) is appropriate
  - Use short timeouts (2-5s) to prevent connection buildup

TLS errors to shadow backend:

  - Verify shadow backend TLS certificate is valid
  - Check tls_verify setting (set to false only for testing)
  - Verify tls_handshake_timeout is sufficient (default 10s)
  - Check if shadow backend requires specific TLS version or ciphers

Sampling rate seems incorrect:

  - Percentage mode: percent = 10 means approximately 10% (not exact)
  - Fractional mode: numerator/denominator for precise low rates
  - Sampling is per-request, random; short windows may show variance
  - Check shadow_requests_total vs total proxy requests for actual rate

Shadow affecting primary request latency:

  - Shadow dispatch should be fire-and-forget (no Wait())
  - If primary slows down, check if body buffering is the cause
  - Reduce max_body_size to limit memory allocation for large payloads
  - Verify shadow dispatch is non-blocking (asynchronous)

Metrics and monitoring:

  - shadow_requests_total{shadow_name}: total dispatched shadow requests
  - shadow_success_total{shadow_name}: requests with 2xx/3xx responses
  - shadow_errors_total{shadow_name, error_type}: requests with errors
  - shadow_request_duration{shadow_name}: latency histogram

Relationships

Module dependencies and interactions:

proxy: Primary consumer. The reverse proxy handler evaluates shadow

  configuration for each proxy mapping and dispatches shadow requests
  asynchronously when sampling criteria are met.
  Shadow config is nested under [[proxy.mapping.shadow]].

config: Global defaults from [proxy.shadow] merged with per-mapping

  shadow overrides. Runtime_fraction, timeout, and add_headers are
  hot-reloadable. Transport pool settings require restart.

telemetry: Shadow metrics exported for monitoring: request counts,

  success/error counts, and latency histograms per shadow target name.
  Structured logging for dispatch and response events.

dns: Shadow backend hostnames resolved via the DNS module with standard

  resolution and caching behavior.

certs: TLS certificate verification for shadow backends uses the

  system trust store. tls_verify controls whether verification is enforced.

Cluster RPC: Shadow dispatch uses the fire-and-forget pattern to ensure

  zero impact on primary request path. No cluster coordination needed;
  shadow runs on the receiving node only.

connector: When a shadow target specifies a “site” parameter, requests

  route through the QUIC connector tunnel to the remote site instead of
  direct connection. Transport is cached per site for connection reuse.