Observability

DNS Resolution

Custom DNS resolvers with DNSSEC validation, distributed caching, health checks, and adaptive selection

Overview

The DNS module provides secure DNS resolution services for all Hexon components including proxy, bastion, cluster discovery, and VPN.

Capabilities:

Custom DNS resolvers with automatic failover and circuit breaker pattern
DNSSEC validation in two modes: resolver-trust (fast) and full cryptographic (secure)
Distributed DNS caching via the memory storage module (local reads, broadcast writes)
Lookup coalescing to prevent cache poisoning from concurrent requests
Hostname validation to block DNS injection attacks (null bytes, CRLF)
IPv4 preference when both A and AAAA records are available
CNAME flattening with configurable depth limit (default 16, per RFC 1034)
DNS-over-TLS (DoT) support for encrypted transport (RFC 7858)
Adaptive resolver selection using epsilon-greedy algorithm (20-40% lower latency)
Health checking with exponential backoff and automatic system DNS fallback
Typed DNS queries for 30+ record types (A, AAAA, CAA, TLSA, SRV, MX, etc.)
Context propagation for request cancellation and graceful shutdown
TTL sanitization to prevent integer overflow attacks (capped at 1 week)

Operations:

Resolve: DNS resolution with optional DNSSEC, caching, and resolver selection
ValidateHostname: RFC-compliant hostname validation against injection attacks

Config

Core configuration under [dns]:

[dns]

  timeout = 5                          # DNS query timeout in seconds (default: 5)
  cache_ttl = 300                      # Default cache TTL in seconds (default: 300)
  cache_override = false               # Ignore DNS server TTL, always use cache_ttl (default: false)
  resolvers = ["1.1.1.1:53", "8.8.8.8:53", "9.9.9.9:53"]  # DNS resolvers (default: cluster.cluster_dns_resolvers)
  flatten_cname = true                 # Follow CNAMEs to final A/AAAA records (default: true)
  max_cname_depth = 16                 # Max CNAME chain depth to prevent loops (default: 16)

DNSSEC settings:

  dnssec_full_validation = false       # Full cryptographic RRSIG/DNSKEY validation (default: false)
  dnssec_strict = false                # Fail if zone is not DNSSEC-signed (default: false)

DNS-over-TLS (DoT):

  dot_enabled = false                  # Enable DNS-over-TLS transport (default: false)
  dot_port = 853                       # DoT port per RFC 7858 (default: 853)
  dot_verify_server_cert = true        # Verify resolver TLS certificate (default: true)

Health checking:

  health_check_enabled = true          # Enable resolver health monitoring (default: true)
  health_check_interval = 30           # Health check interval in seconds (default: 30)
  health_failure_threshold = 2         # Consecutive failures before marking unhealthy (default: 2)
  health_check_query = "google.com"    # Domain used for health check probes (default: "google.com")

Adaptive resolver selection (epsilon-greedy ML):

  adaptive_selector_enabled = true     # Enable adaptive resolver selection (default: true)
  adaptive_exploration_rate = 0.10     # Exploration rate 0.0-1.0 (default: 0.10 = 10%)
  adaptive_smoothing_factor = 0.3      # EMA smoothing factor for latency tracking (default: 0.3)
  adaptive_min_sample_size = 100       # Queries before switching from learning to intelligent mode (default: 100)
  adaptive_load_balance_enabled = true # Penalize recently-used resolvers to spread load (default: true)

Resolver architecture — three separate resolver pools:

  dns.resolvers                        # Infrastructure resolvers (health-checked, used by all modules)
  cluster.cluster_dns_resolvers        # Cluster discovery resolvers (fallback if dns.resolvers unset)
  proxy.dns.resolvers                  # Proxy-specific override (must be subset of dns.resolvers)

Per-route proxy DNS overrides in [[proxy.mapping]]:

  dnssec = true                        # Override global DNSSEC setting for this route
  dns_resolvers = ["10.0.0.1:53"]      # Override resolvers for this route (must be in dns.resolvers)

TTL precedence (cache_override=false): DNS server TTL > dns.cache_ttl > 300s default. TTL precedence (cache_override=true): dns.cache_ttl > 300s default. TTL bounds: minimum 1 second, maximum 604800 seconds (1 week).

Cache key format: “dns_cache:{hostname}:{resolver_hash}” (128-bit SHA256 hash). Cache reads are local (no network). Cache writes broadcast to cluster (fire-and-forget).

Hot-reloadable: resolvers, DNSSEC settings, cache TTL, health check parameters, adaptive settings. Cold (restart required): dot_enabled, dot_port.

Troubleshooting

Common symptoms and diagnostic steps:

DNS resolution failures:

  - Check resolver health: 'dns resolvers' shows status, latency, and failure counts
  - Test specific hostname: 'dns test <hostname>' performs live resolution
  - All resolvers unhealthy: module falls back to system DNS (/etc/resolv.conf)
  - Resolver filtered out: proxy resolvers must be a subset of dns.resolvers
  - Cross-subsystem check: 'diagnose domain <hostname>' tests DNS + proxy + TLS together

DNSSEC validation errors:

  - Zone not signed: set dnssec_strict=false to allow unsigned zones (default)
  - Resolver-trust mode: compromised resolver can fake AD bit — use dnssec_full_validation=true
  - Full validation slow: first query ~200ms (chain of trust), cached queries ~50ms
  - Clock skew: DNSSEC signatures have validity windows — ensure NTP is running
  - Check validation: 'dns test <hostname> --dnssec' shows validation result and mode
  - Strict mode blocking: dnssec_strict=true rejects all unsigned zones — check per-route override

Slow DNS resolution:

  - Check cache hit rate: 'dns cache' shows hit/miss ratio and entry count
  - High cache miss: increase cache_ttl or set cache_override=true for static backends
  - Resolver latency: 'dns resolvers' shows per-resolver average latency (EMA)
  - Adaptive selector: 'dns adaptive' shows resolver scores and selection distribution
  - Learning phase: first 100 queries use round-robin — performance improves after
  - CNAME chains: deep chains add latency per hop — check with 'dns test <hostname>'

All resolvers down (circuit breaker tripped):

  - Health checker marks resolver unhealthy after 2 consecutive failures (configurable)
  - Backoff schedule: 30s, 1m, 2m, 4m, 8m, 15m (max)
  - System DNS fallback activates automatically when all custom resolvers fail
  - Recovery is automatic — resolver returns to pool when health check succeeds
  - Force re-check: 'dns health --reset' clears backoff timers
  - Check: 'dns resolvers' shows healthy/unhealthy status and next retry time

Cache poisoning concerns:

  - Lookup coalescing: concurrent requests for same hostname share single lookup result
  - Per-hostname locking prevents race conditions (no global bottleneck)
  - Enable DNSSEC (dnssec_full_validation=true) for cryptographic validation
  - Use DoT (dot_enabled=true) to encrypt DNS transport against snooping

CNAME resolution issues:

  - CNAME not followed: check flatten_cname=true (default)
  - CNAME loop detected: max_cname_depth exceeded (default 16) — check DNS zone config
  - CNAME + ACL: ACL checks use original hostname, not CNAME target (prevents bypass)
  - Metrics: dns.cname_resolutions_total tracks success and depth_exceeded counts

DoT connection failures:

  - Port blocked: DoT uses port 853 (RFC 7858) — verify firewall rules
  - Certificate error: set dot_verify_server_cert=false to diagnose (re-enable after)
  - Non-standard port: module warns if dot_port is not 853

502/503 from proxy due to DNS:

  - DNSSEC failure blocks connection (no system DNS fallback for security)
  - DNS infrastructure failure falls back to system DNS (availability)
  - Fix: set dnssec=false on specific proxy routes for unsigned internal zones
  - Verify: 'dns test <backend-hostname> --dnssec' to check DNSSEC status

Interpreting tool output:

  'dns health':
    Healthy: Status=healthy, Healthy resolvers = total resolvers
    Degraded: Healthy < total — some resolvers failing, but DNS still works
    Down: Healthy=0 — all resolvers failed, system DNS fallback active
    Action: Degraded/Down → 'dns resolvers' for per-resolver breakdown

  'dns resolvers':
    Healthy: Status=healthy, Latency < 50ms, Score > 100
    Degraded: Status=unhealthy with BackoffUntil timestamp — resolver in circuit breaker
    Learning: Score near 100 with low QueryCount — adaptive selector still calibrating (normal)
    Action: All unhealthy → check network connectivity to resolver IPs, verify port 53/853 open

  'dns test <hostname>':
    Success: IPs returned, TTL shown, DNSSEC=valid (if enabled)
    DNSSEC failure: DNSSEC=invalid — zone is unsigned or signatures expired
    No results: hostname does not resolve — check DNS zone configuration
    Action: DNSSEC failure + proxy 502 → set dnssec=false on that proxy route

Architecture

Resolution flow:

Resolve request arrives (from proxy, bastion, VPN, or discovery)
Hostname validation: RFC compliance check, injection prevention (null bytes, CRLF, length)
Cache lookup: local memory read for “dns_cache:{hostname}:{resolver_hash}”
If cache hit: return cached IPs immediately (no network call)
If cache miss: acquire per-hostname lock (coalescing for concurrent requests)
Resolver selection: adaptive selector picks best resolver (or round-robin during learning)
Health filter: only healthy resolvers considered (circuit breaker pattern)
DNS query: send query via UDP (or DoT if enabled) with configured timeout
DNSSEC validation (if enabled):

   a. Resolver-trust mode: check AD bit in response
   b. Full validation: verify RRSIG signatures, validate DNSKEY chain to root trust anchor

CNAME handling: if CNAME response and flatten_cname=true, recursively resolve target
IPv4 preference: sort results with A records before AAAA records
TTL extraction: from DNS response (DNSSEC/custom resolver) or use configured default
TTL sanitization: clamp to [1s, 604800s], zero defaults to 300s
Cache store: broadcast write to cluster memory (fire-and-forget, best-effort)
Release per-hostname lock, waiting callers receive same result
Return ResolveResponse with IPs, TTL, cached flag, DNSSEC validity

Adaptive resolver selection (epsilon-greedy):

  Learning phase (first 100 queries): round-robin across all healthy resolvers
  Intelligent phase: 90% exploitation (best score), 10% exploration (random)
  Score = 100 + (success_rate * 50) - (avg_latency_ms / 10) - (timeout_pct * 30)
        - (consecutive_failures * 20) - (recently_used * 10)
  Latency tracked via EMA: new_avg = 0.3 * sample + 0.7 * old_avg
  Load balancing penalty: -10 points if resolver used within last 1 second

Health checker circuit breaker:

  Healthy: failure_count = 0, available for selection
  Unhealthy: failure_count >= threshold (default 2), excluded from selection
  Backoff: 30s -> 1m -> 2m -> 4m -> 8m -> 15m (max)
  Recovery: single successful health check returns resolver to healthy state
  System DNS fallback: automatic when ALL custom resolvers are unhealthy
  Memory cleanup: Resolver sync removes stale entries on config reload

DNSSEC full validation chain:

  1. Query resolver with DO bit set
  2. Extract RRSIG from response
  3. Fetch DNSKEY for target zone (cached with TTL)
  4. Verify RRSIG signature using DNSKEY (RSA/SHA-256, ECDSA P-256, Ed25519)
  5. Fetch DS record from parent zone
  6. Verify DNSKEY hash matches DS record
  7. Recurse up to root zone
  8. Validate root DNSKEY against hardcoded IANA trust anchor (KSK 20326)
  9. Validate NSEC/NSEC3 for authenticated denial of existence

Distributed caching via memory module:

  Read path: local-only (no network, no quorum)
  Write path: broadcast to all cluster nodes (fire-and-forget)
  Key format: "dns_cache:{hostname}:{sha256_hash_of_resolvers}" (collision-resistant)
  Eviction: TTL-based (respects DNS TTL or configured override)
  Coalescing: per-hostname mutex prevents concurrent duplicate lookups

Metrics emitted:

  dns.resolve_total (tags: status, cached, dnssec)
  dns.resolve_latency_ms (histogram)
  dns.cache_hit_total / dns.cache_miss_total
  dns.health_check_total (tags: resolver, status)
  dns.adaptive_resolver_selected (tags: resolver, reason)
  dns.resolver_score (gauge, tags: resolver)
  dns.resolver_avg_latency_ms (gauge, tags: resolver)
  dns.cname_resolutions_total (tags: status)

Relationships

Module dependencies and interactions:

proxy: Backend hostname resolution for all proxy routes. Uses [dns] configuration

  by default. Per-route overrides via dnssec and dns_resolvers fields in [[proxy.mapping]].
  DNSSEC validation failure blocks connection (no system DNS fallback — prevents downgrade).
  DNS infrastructure failure falls back to system DNS (availability).

bastion: SSH connection and port forwarding hostname resolution. Uses [dns] configuration

  directly (no bastion-specific overrides). DNSSEC protects against SSH destination poisoning.

discovery: Cluster peer discovery via DNS SRV records. Uses [dns] configuration for

  resolver settings. Critical for cluster formation and membership.

vpn (IKEv2): VPN client DNS resolution and split DNS support for tunneled queries.

  Uses [dns] configuration for upstream resolver selection.

acme: ACME challenge validation uses typed DNS queries (CAA record checking per RFC 8659).

  SERVFAIL handling distinguishes "no records" from "DNS infrastructure error" for security.

memory: Distributed DNS cache storage. Local reads (fast), broadcast writes (best-effort).

  No quorum required — cache is opportunistic, falls back to fresh lookup on miss.

config: Reads [dns] and [cluster] TOML sections. Hot-reload updates resolvers, DNSSEC

  settings, cache parameters, health check configuration, and adaptive selection tuning.
  Resolver sync cleans up stale resolver state on reload (memory leak prevention).

metrics (telemetry): Emits counters, histograms, and gauges for resolution, caching,

  health checks, and adaptive selection. Enables monitoring dashboards and alerting.

Kubernetes CRD Configuration

Kubernetes-native configuration via Custom Resource Definitions with bootstrap reconciliation, live watching, and status feedback

Overview

HexonGateway supports Kubernetes-native configuration through Custom Resource Definitions (CRDs). When running in Kubernetes, operators can manage gateway configuration using standard kubectl commands instead of (or alongside) TOML files.

The system defines 55 CRD types covering every configuration section:

  - Service, cluster, telemetry, health, DNS, SMTP, filesystem, memory
  - Proxy mappings, connection pools, TCP proxy, forward proxy, subrequest
  - Authentication: OIDC clients, SAML service providers, auth flows, signup flows
  - Identity: LDAP, OIDC providers, SCIM providers
  - Protection: WAF config, WAF rules, firewall rules/aliases, rate limiting
  - Infrastructure: VPN, bastion, SQL bastion, SSH certificates, port forwarding
  - Certificates: ACME CA server, ACME client
  - Operations: admin, MCP, LLM, playbooks, webhooks, SPIFFE, RADIUS
  - Observability: log intelligence, notifications

CRDs are optional — the gateway runs identically on VMs, Docker, or Kubernetes using TOML configuration. CRDs provide a Kubernetes-native alternative that integrates with GitOps tools like ArgoCD and Flux.

All CRDs belong to the config.hexon.io API group with v1alpha1 version. Namespaced scope — instances live in the hexon-system namespace by default.

Config

CRD Installation:

  kubectl apply -f https://registry.hexon.io/crds/latest/hexon-crds.yaml

  Or a specific version:
  kubectl apply -f https://registry.hexon.io/crds/0.9.1/hexon-crds.yaml

  Individual CRDs can also be installed:
  kubectl apply -f https://registry.hexon.io/crds/0.9.1/hexonproxies.yaml

CRD Lifecycle:

  1. Bootstrap: On first start, the cluster leader creates CRD instances from
     the running config (TOML + env overrides + defaults merged). Each instance
     is labeled config.hexon.io/origin: bootstrap.
  2. Pruning: Bootstrap-owned array CRDs no longer in config are automatically
     deleted, along with their companion Secrets. Operator-owned CRDs are never
     touched. This ensures TOML deletions propagate to Kubernetes.
  3. Watching: Informers watch for CRD changes via the Kubernetes API. Changes are
     debounced (500ms window) to batch rapid edits.
  4. Apply: CRD spec is converted to the internal config struct, validated, and
     applied atomically. Config reload callbacks fire for all modules.
  5. Status: Each CRD instance gets status conditions reflecting apply success/failure.

Example — create a proxy mapping:

  apiVersion: config.hexon.io/v1alpha1
  kind: HexonProxy
  metadata:
    name: dashboard
    namespace: hexon-system
  spec:
    hostname: dashboard.example.com
    target: http://dashboard-service:3000
    auth_type: oidc

Sensitive fields (SecretKeyRef):

  Sensitive config fields (certificates, private keys, passwords, API secrets,
  RADIUS shared secrets, OIDC client secrets) are never stored in CRD specs.
  Instead, they are stored in companion Kubernetes Secrets and referenced via
  SecretKeyRef entries in the CRD spec:

    spec:
      apiKey:
        name: hexon-hexonproxies-dashboard    # Secret name
        key: apiKey                           # Key within the Secret

  - Empty sensitive fields (e.g., no custom certificate) produce no Secret.
    The field stays empty and the gateway uses its default (e.g., wildcard cert).
  - Non-empty fields are stored in a companion Secret named
    hexon-<plural>-<instance> (e.g., hexon-hexonproxies-dashboard).
  - Operators can reference any Secret they create — not limited to the
    bootstrap naming convention.
  - RBAC: The gateway pod needs get/list/create/update/delete on core Secrets.

Ownership model:

  - Bootstrap-created CRDs and companion Secrets have label:
    config.hexon.io/origin: bootstrap
  - Remove the label to "take ownership" — bootstrap will no longer overwrite
  - Operator-created CRDs (no label) are never modified or deleted by bootstrap
  - Bootstrap-owned array CRDs removed from config are pruned on next restart

Singleton vs Array CRDs:

  - Singleton: one instance named "default" (e.g., HexonClusterConfig, HexonDNSConfig)
  - Array: multiple instances, name derived from config (e.g., HexonProxy per mapping)

Resource naming:

  - K8s resource names are sanitized from config names (lowercased, spaces/underscores
    to dashes, special chars to dashes, max 253 chars). Example: config app "Kubernetes /
    Production" becomes resource name "kubernetes---production".
  - The original config name is preserved in the CRD spec (e.g., spec.app for proxies).
  - The "crd show" command accepts either the K8s resource name or the original config name.
  - Use "crd list <kind>" to see both resource names and config names side by side.

Status conditions:

  Every CRD instance reports an "Applied" condition:
    Applied=True   reason=ConfigValid      — config applied successfully
    Applied=False  reason=ApplyError       — config apply failed
    Applied=False  reason=ConversionError  — CRD-to-config conversion failed

  Check status: kubectl get hexonproxies -o wide
  The "Applied" printer column shows the current phase (Ready/Error).

Troubleshooting

CRD instances not being created on startup:

  - Only the cluster leader runs bootstrap reconciliation
  - Check logs for "bootstrap reconciliation complete" message
  - Verify CRD definitions are installed: kubectl get crd | grep hexon

CRD changes not applying:

  - Changes are debounced with a 500ms window — wait briefly
  - Check status conditions: kubectl describe <crd-kind> <name>
  - Look for Applied=False with reason and message
  - Verify RBAC: the gateway pod needs get/list/watch/create/update/patch permissions
    on all config.hexon.io resources and their status subresources

Status shows Applied=False reason=ConversionError:

  - The CRD spec doesn't match the expected config structure
  - Check field names match TOML keys (snake_case in spec)
  - Verify enum values are valid (e.g., auth_type must be a recognized method)

Bootstrap keeps overwriting my changes:

  - Remove the config.hexon.io/origin label from the CRD instance:
    kubectl label hexonproxy <name> config.hexon.io/origin-
  - Once the label is removed, bootstrap treats it as operator-owned and skips it
  - Do the same for companion Secrets if you want to manage them independently

Sensitive field shows empty after CRD apply:

  - Check the companion Secret exists: kubectl get secret hexon-<plural>-<name>
  - Verify the Secret has the expected key: kubectl get secret <name> -o jsonpath='{.data}'
  - Check RBAC allows Secret read: the gateway pod needs get on core/v1 secrets
  - If the Secret was manually deleted, restart the gateway to recreate it via bootstrap

CRD still exists after removing mapping from TOML:

  - Bootstrap prunes only CRDs with the config.hexon.io/origin: bootstrap label
  - If the label was removed (operator-owned), delete it manually:
    kubectl delete hexonproxy <name>

Config export for migration:

  - Use the admin CLI: config export
  - Exports running config as multi-document YAML CRD manifests
  - Filter by section: config export proxy
  - JSON format: config export --format=json
  - Only available when running in Kubernetes

Relationships

Module dependencies and interactions:

Configuration system: CRD changes are applied to the same config store used by

  TOML and environment variables. All modules see changes via the standard config
  reload mechanism. CRDs have the same precedence as TOML — environment variables
  still override.

Cluster coordination: Bootstrap reconciliation runs on the cluster leader only.

  Config changes from CRDs propagate to all nodes via the standard config reload
  broadcast (NATS-based).

Admin CLI: The “config export” command generates CRD YAML from running config,

  enabling migration from TOML to Kubernetes-native management. When using
  "config export --apply", companion Secrets are created for sensitive fields
  (without the bootstrap label — operator-owned). The "config show" and
  "config describe" commands work regardless of config source.

Helm chart: CRDs are distributed separately from the Helm chart. Install CRDs

  first, then deploy the chart. This avoids Helm's CRD lifecycle limitations
  (no update on upgrade, deletion on uninstall).

CI/CD integration: CRD manifests are published to registry.hexon.io/crds/ with

  versioning. Compatible with ArgoCD, Flux, and other GitOps tools. The all-in-one
  bundle (hexon-crds.yaml) contains all 55 CRD definitions.

Codegen tool: CRD YAML manifests are generated from Go struct tags using the

  build tool (build-crd.sh). OpenAPI v3 schemas include validation constraints
  derived from struct tags (required, enum, min, max, default, desc).

AI Assistant

Built-in AI-powered natural language interface for gateway operations via the bastion shell

Overview

The AI assistant enables natural language interaction with all gateway admin tools through the bastion shell’s “ai” command. It shares the same tool set and execution path as MCP, ensuring identical tool visibility, read/write enforcement, metrics, and audit logging.

Capabilities:

  Tool execution - Runs any admin CLI command via an agentic loop. The AI
    reads tool results, reasons about them, and decides what to run next.
    Read-only commands execute automatically. Write operations pause for
    interactive operator approval in the SSH session.

  Multi-provider support - Works with Anthropic (Claude), OpenAI (GPT-4),
    Azure OpenAI, Google Gemini, and Ollama/vLLM for local models. Provider
    auto-detected from the API URL or set explicitly.

  Conversation context - Maintains per-session conversation history so
    follow-up questions build on prior answers. Operators can set session
    context hints and the AI sees recent shell commands for awareness.

  Background monitoring - The schedule_task tool runs commands periodically
    in the background. Results appear between shell prompts. Operators
    manage tasks with "task list", "task stop".

  Inline monitoring loops - The sleep tool pauses the AI within its
    reasoning loop, then resumes with full context. Enables "check health,
    wait 30s, check again, compare, report changes" patterns. Each sleep
    extends the tool-calling budget so monitoring does not fight the
    per-query round limit. Governed by max_sleep_duration (default 5m per
    call) and max_sleeps_per_query (default 60 iterations). Ctrl+C
    interrupts immediately.

  Cluster knowledge base - Persistent cross-session memory for operational
    insights and rules. The AI learns from investigations and applies that
    knowledge in future sessions.

  Prompt caching - Anthropic provider supports prompt caching (5m or 1h
    TTL) to reduce token costs on repeated interactions.

Configuration: [llm] section with api_url, api_key, model, required_groups. Enable in bastion with [bastion] use_llm = true. Per-user custom instructions via moduledata or config.

Safety

Multiple layers prevent runaway AI behavior:

  Tool round limit - max_tool_rounds (default 15) caps the number of
    reasoning cycles per query. Sleep calls extend this budget so
    monitoring loops get additional rounds.

  Write operation limit - max_write_ops_per_query (default 3) caps
    mutations per query. The AI cannot retry failing write commands
    with slight variations.

  Sleep guardrails - max_sleep_duration (default 5m) caps individual
    pauses. max_sleeps_per_query (default 60) caps total iterations.
    Token cost on each wake-up naturally limits runaway loops.

  Failed operation dedup - Commands that fail are tracked by operation
    key. The AI cannot re-execute the same failing command.

  RBAC - required_groups restricts which operators can use AI features.
    allowed_commands whitelist limits which tools the AI can call.

  Interactive approval - Write operations prompt the operator for y/n
    confirmation in the SSH session before execution.

  Audit trail - All AI interactions logged with distributed tracing.
    Sensitive data redacted by default.

  Rate limiting - Per-user query rate limit (default 10/1m) prevents
    excessive API usage.

Security

Multiple defense layers protect the AI assistant:

  RBAC - required_groups restricts which operators can use AI features.
    Only operators in the configured groups can access the "ai" command.

  Command whitelist - allowed_commands limits which tools the AI can call.
    Operators cannot override this from within the AI session.

  Write protection - Read-only commands execute automatically. Write
    operations pause for interactive operator approval (y/n) in the SSH
    session before execution. Cannot be overridden by the AI.

  Rate limiting - Per-user query rate limit (default 10/1m) prevents
    excessive API usage and token cost accumulation.

  Audit trail - All AI interactions logged with distributed tracing.
    Sensitive data redacted by default (redact_sensitive = true).

  Runaway prevention - Tool round limit (default 15), write operation
    limit (default 3), sleep guardrails (5m per call, 60 iterations max),
    and failed operation dedup all prevent excessive token consumption.

Troubleshooting

Common symptoms and diagnostic steps:

AI command not available in bastion:

  - Verify [bastion] use_llm = true in config
  - Verify [llm] section is configured with api_url, api_key, model
  - Check required_groups: operator must be in one of the listed groups
  - Check: 'config show llm' to verify configuration

AI returns errors or empty responses:

  - Check API connectivity: verify api_url is reachable from the gateway
  - Check API key validity: invalid keys produce authentication errors
  - Check provider detection: auto-detect uses api_url hostname, set
    provider explicitly if using a proxy or non-standard endpoint
  - Check: 'logs search llm --since=5m' for API errors

Write operations not being approved:

  - Write ops require interactive SSH session (not available via MCP)
  - Operator must respond y/n to the approval prompt
  - max_write_ops_per_query (default 3) may be exhausted for the query
  - Check allowed_commands whitelist if specific commands are blocked

AI stops responding mid-conversation:

  - max_tool_rounds (default 15) reached: increase if needed for complex
    queries, but be aware of token cost implications
  - Sleep monitoring loop: max_sleeps_per_query (default 60) may be
    exhausted. Ctrl+C interrupts immediately
  - Check: 'logs search llm "round limit"' for limit violations

Background tasks not running:

  - 'task list' shows scheduled tasks and their status
  - 'task stop <id>' to cancel a misbehaving task
  - Only read-only commands can be scheduled as background tasks

High API costs:

  - Enable prompt caching for Anthropic provider (cache_ttl setting)
  - Reduce max_tool_rounds to limit reasoning cycles
  - Review max_sleeps_per_query for monitoring loops
  - Check per-user rate limits (default 10 queries/minute)

Relationships

Cross-subsystem interactions:

Admin CLI: Single source of truth for all tools. The AI calls the same

  command handlers available via MCP and the bastion shell.

MCP: Shares system instructions, tool definitions, and response

  formatting. Both interfaces use the same execution path.

Bastion shell: Hosts the “ai” command and interactive AI mode. Manages

  conversation history, approval prompts, and background task lifecycle.

Cluster knowledge: Memory entries (insights and rules) stored in

  cluster-wide distributed storage with configurable TTL.

Admin CLI commands: diagnose, health, proxy, sessions, certs, vpn, dns,

  directory, config, and 30+ more — all available as AI tools.

Notification Service

Unified notification routing to email and webhooks with template-driven payloads

Overview

The notify module handles all outbound notification delivery including single event notifications, digest notifications, and health checks for configured endpoints. It routes messages to email and/or webhooks simultaneously with template-driven payload formatting.

Core capabilities:

Multi-channel routing: email, Slack, Teams, Discord, PagerDuty, custom webhooks
Single event notifications (Send) with subject, body, and severity
Digest notifications (SendDigest) batching multiple results into one message
Five builtin webhook payload formats: generic, slack, teams, discord, pagerduty
Custom Go text/template payloads with json, severityColor, severityEmoji helpers
Partial success model: Success=true if at least one endpoint delivers
Branded HTML email templates rendered via the render module
Plain text fallback when render module is unavailable
Targeted routing: empty Webhook sends to all, “email” for email only,

  or a specific webhook name for single-target delivery

Health checking for all configured notification endpoints

Routing logic for the Webhook field:

  - "" (empty): broadcast to all enabled channels (email + all webhooks)
  - "email": send to email channel only (requires To field)
  - "<name>": send to the named webhook only (e.g., "slack-ops")

Email delivery chain:

  1. Notify module requests email rendering with template + data
  2. Render module loads the appropriate notification and digest templates
  3. Rendered HTML and plain text forwarded to SMTP module as multipart
  4. Fallback: if render unavailable, plain text auto-wrapped in <pre> tags

Webhook payload formats:

  - generic: flat JSON with subject, body, severity, username, hostname, timestamp
  - slack: Block Kit with header, severity emoji, code block body, context footer
  - teams: Adaptive Card v1.4 with TextBlock, FactSet, monospace body
  - discord: Embed with severity color mapping, code block body, footer
  - pagerduty: Events API v2 with routing_key from Metadata, severity mapping

Config

Notification configuration under [notify] section:

[notify]

  digest_window = "5m"              # Window for batching digest items

[notify.email]

  enabled = true                    # Enable email notifications (uses [smtp] config)

[[notify.webhooks]]

  name = "slack-ops"                # Webhook name (used for targeted routing)
  url = "https://hooks.slack.com/services/T00/B00/XXX"  # Webhook endpoint URL
  format = "slack"                  # Payload format: generic, slack, teams, discord, pagerduty
  timeout = "10s"                   # Request timeout (default: 10s)

[[notify.webhooks]]

  name = "teams-infra"
  url = "https://outlook.office.com/webhook/XXX"
  format = "teams"
  timeout = "15s"

[[notify.webhooks]]

  name = "pagerduty-critical"
  url = "https://events.pagerduty.com/v2/enqueue"
  format = "pagerduty"

[[notify.webhooks]]

  name = "custom-endpoint"
  url = "https://api.internal/alerts"
  format = "generic"                # Base format (overridden by body_template)
  content_type = "application/json" # Custom content type
  body_template = '{"alert": "{{json .Subject}}", "detail": "{{json .Body}}"}'
  [notify.webhooks.headers]
  Authorization = "Bearer token123" # Custom headers sent with every request

Template functions available in custom body_template:

  {{json .Field}}          - JSON-escape a string (quotes, backslashes, control chars)
  {{severityColor .Sev}}   - Map severity to Discord embed color (int)
  {{severityEmoji .Sev}}   - Map severity to Slack emoji string
  {{severityPD .Sev}}      - Map severity to PagerDuty severity level

Template variables (TemplateData fields):

  .Subject, .Body, .Severity, .Username, .Hostname, .Timestamp,
  .Metadata (map[string]string), .Items (digest), .ItemCount (digest)

Email template variables (passed to render module):

  Subject, Body, Severity, SeverityLabel, Username, Hostname,
  Timestamp, Disclaimer, Items (digest), ItemCount (digest)

The HTMLBody field on SendRequest bypasses template rendering entirely, allowing callers to provide custom HTML content. Email requires the To field to be set (recipient address).

Hot-reloadable: webhook URLs, formats, headers, timeouts, email enabled. Cold (restart required): none (all notify config is hot-reloadable).

Troubleshooting

Common symptoms and diagnostic steps:

Webhook not receiving notifications:

  - Run 'notify health' to check connectivity to all endpoints
  - Verify webhook URL is correct and accessible from the Hexon server
  - Check webhook name matches exactly (case-sensitive) when using targeted routing
  - Verify format is one of: generic, slack, teams, discord, pagerduty
  - Check timeout setting (default 10s) is sufficient for the endpoint
  - For custom endpoints, verify content_type and body_template are valid
  - Check custom headers (Authorization, API keys) are correct

Email notifications not being delivered:

  - Verify [notify.email] enabled = true
  - Check SMTP module health: 'smtp health'
  - Verify To field is set on the SendRequest
  - Check render module is available for template rendering
  - If render unavailable, plain text fallback should still work
  - Check spam folder - configure SPF/DKIM/DMARC for production

Partial success (some channels fail, others succeed):

  - This is expected behavior with the partial success model
  - Check resp.Failed[] for endpoints that failed and resp.Sent[] for successes
  - resp.Error contains a summary of failures
  - Individual endpoint failures do not block other deliveries
  - Success=true means at least one endpoint delivered successfully

Digest notifications empty or missing items:

  - Verify Items array is populated in SendDigestRequest
  - Each DigestItem needs TaskID and Description at minimum
  - Check digest_window setting if items seem to be batched incorrectly

Custom template errors:

  - Templates are parsed and validated at config load time
  - Template execution errors prevent notification delivery
  - Use {{json .Field}} for all string interpolation to prevent JSON injection
  - Check template syntax matches Go text/template format
  - Verify all referenced fields exist in TemplateData

Slack/Teams formatting issues:

  - Slack format uses Block Kit (verify workspace supports it)
  - Teams format uses Adaptive Card v1.4 (verify connector version)
  - Discord embeds have 4096 character limit for description
  - PagerDuty requires routing_key in Metadata map

Test notifications:

  - Use 'notify test <webhook-name>' to test a specific webhook
  - Use 'notify test email <address>' to test email delivery
  - Use 'notify list' to see all configured endpoints

Security

Security considerations for notification delivery:

Webhook URL protection:

  Webhook URLs are marked as sensitive in configuration and are not exposed
  in config dumps or diagnostic output. Only HTTPS URLs are recommended for
  production deployments. HTTP is allowed for internal or testing endpoints.

Authentication headers:

  Webhook headers (Authorization, API keys) are stored in configuration.
  Headers are sent with every webhook request to the endpoint. Consider
  using environment variable references for secrets in production.

Custom template safety:

  Templates are parsed and validated at config load time to catch syntax
  errors early. Always use {{json .Field}} for string interpolation in
  custom templates to prevent JSON injection attacks. Template execution
  errors are returned and the notification is not sent (fail-safe).

HTML content handling:

  Email body text is HTML-escaped when auto-converting plain text to HTML.
  The HTMLBody field content is sent as-is without sanitization - callers
  are responsible for ensuring HTML content is safe.
  The render module handles template escaping for branded email templates.

Credential exposure prevention:

  SMTP credentials are handled by the smtp module (never exposed by notify).
  Webhook URLs with embedded tokens (Slack, Teams) are redacted in logs.
  Error messages from failed webhook deliveries do not include full URLs.

Relationships

Module dependencies and interactions:

smtp: Email delivery backend. Notify sends emails through the SMTP module for

  all email notifications. SMTP configuration ([smtp] section) determines
  the mail server, credentials, and encryption mode.

UI templates: Email template rendering. Notify renders branded HTML

  email templates cluster-wide.
  If template rendering is unavailable, plain text fallback is used.

Admin CLI: Exposes notify CLI commands (list, health, test) for management

  and diagnostics.

mcp: Notify operations available as MCP tools for AI-assisted operations.

  LLM and bastion AI assistant can send notifications through MCP tools.

config: All notification settings are hot-reloadable. Webhook URLs, formats,

  headers, timeouts, and email enabled flag can be changed without restart.

telemetry: Structured logging for notification delivery with endpoint name,

  delivery status, and latency. Metrics for send counts and failures.

Rate limiting: Callers should apply rate limiting in HTTP handlers

  to prevent notification flooding. Notify module itself does not rate limit.

Various callers: Any module can send notifications cluster-wide. Common

  callers include authentication modules (login anomalies), certificate
  management (renewal notifications), and the bastion AI assistant.

SMTP Email Delivery

Email delivery via SMTP with SSL/STARTTLS encryption, templated emails, localization, and attachments

Overview

The SMTP module handles all outbound email operations for HexonGateway. It is invoked by other modules that need to send email (OTP authentication, certificate renewal notifications, passkey expiration reminders, and generic email delivery).

Core capabilities:

Generic email sending with HTML and plain text multipart content
OTP (One-Time Password) emails for authentication flows
Certificate renewal notification emails with cert and CA bundle attached
Passkey expiration reminder emails with re-enrollment link
Health checks for SMTP server connectivity verification
Multi-part email composition with file attachments
Three encryption modes: SSL (port 465), STARTTLS (port 587), plain (port 25)
Multi-language email localization (en, es, fr, zh, ca)
Template rendering with branded HTML email templates
User language preference lookup for automatic localization
RFC 5321 compliant address validation (local ≤ 64, domain ≤ 255 chars)
RFC 5322 compliant headers (Date, Message-ID on every email)
RFC 8255 Content-Language header on templated emails
Message-ID in structured logs (success + failure) for MTA correlation

Localization priority for templated emails:

  1. Language field explicitly set in the request
  2. User preference from stored preferences
  3. Default fallback to "en" (English)

Supported languages: en (English), es (Spanish), fr (French), zh (Chinese), ca (Catalan)

Config

SMTP configuration under [smtp] section:

[smtp]

  host = "smtp.gmail.com"           # SMTP server hostname (required)
  port = 587                        # SMTP server port (required)
  encryption = "starttls"           # Encryption mode: "ssl", "starttls", or "none"
  user = "noreply@example.com"      # SMTP authentication username
  password = "app-specific-password" # SMTP authentication password (sensitive)
  from = "noreply@example.com"      # Sender email address (From header)
  reply_to = "support@example.com"  # Reply-To header address (optional)
  name = "HexonAuth"                # Sender display name (optional)
  skip_tls = false                  # Skip TLS certificate verification (default: false)

skip_tls: Disables server certificate validation for SSL and STARTTLS modes.

  Logs a WARN on every send when enabled. Use only when the SMTP server presents
  an untrusted or hostname-mismatched certificate. NOT recommended for production.

Encryption modes:

  ssl (port 465): Direct TLS connection from the start.
  starttls (port 587): Plain connection upgraded to TLS. Recommended.
  none (port 25): Unencrypted. Not recommended for production.

Common SMTP provider configurations:

  Gmail:    host = "smtp.gmail.com", port = 587, encryption = "starttls"
            (requires App Passwords with 2FA enabled)
  SendGrid: host = "smtp.sendgrid.net", port = 587, encryption = "starttls"
            user = "apikey", password = "<sendgrid-api-key>"
  AWS SES:  host = "email-smtp.<region>.amazonaws.com", port = 587
  Mailgun:  host = "smtp.mailgun.org", port = 587, encryption = "starttls"

Hot-reloadable: all SMTP settings (host, port, encryption, credentials). Cold (restart required): none.

Troubleshooting

Common symptoms and diagnostic steps:

SMTP connection failures:

  - Check SMTP health: 'smtp health' tests connectivity and authentication
  - Verify host and port match encryption mode (SSL=465, STARTTLS=587)
  - Firewall blocking outbound: verify server can reach SMTP host:port
  - Network probe: 'net tcp <smtp-host>:<port> --tls' for SSL
  - DNS resolution: 'dns test <smtp-host>' to verify hostname resolves

Authentication failures:

  - Gmail: requires App Passwords (regular password won't work with 2FA)
  - SendGrid: user must be literal string "apikey", password is the API key
  - AWS SES: IAM credentials, not root account credentials
  - Check: 'config show smtp' to verify configuration (password redacted)

Emails not being received:

  - Check spam/junk folder at recipient mail provider
  - Verify from address matches authenticated user or authorized alias
  - Configure SPF, DKIM, and DMARC DNS records for sending domain
  - Test delivery: 'smtp test <to-address>' sends a test message
  - Check: 'notify health' for notification system status

OTP emails delayed or missing:

  - SMTP latency: 200-1000ms is normal, check 'smtp health'
  - OTP code expired before email arrived: check OTP validity window
  - Rate limiting by SMTP provider: check provider dashboard

Passkey expiration emails not sent:

  - Expired passkeys intentionally do not receive reminder emails
  - Verify DaysRemaining is positive (zero or negative triggers no email)

TLS certificate verification failures:

  - "STARTTLS failed: tls: failed to verify certificate" or "TLS dial failed"
  - Common cause: SMTP relay hostname differs from certificate CN/SAN
    (e.g., smtp.company.com forwards to smtp.gmail.com)
  - Temporary fix: set skip_tls = true in [smtp] config (logs WARN per send)
  - Proper fix: configure the SMTP server with a valid certificate matching its hostname
  - Check: 'net tls <smtp-host>:<port>' to inspect the certificate chain

Template rendering errors:

  - Missing locale: unsupported language falls back to English
  - Template rendering failures prevent email send (no fallback)

Address validation failures (RFC 5321):

  - "local part exceeds 64 character limit": email local part too long
  - "domain part exceeds 255 character limit": email domain too long
  - These limits are per RFC 5321 §4.5.3.1

Correlating email delivery with MTA logs:

  - Every send logs Message-ID (success and failure)
  - Use Message-ID to trace through relay MTAs, bounce reports, DMARC feedback
  - Search gateway logs: 'logs search message_id=<value>'

Relationships

Module dependencies and interactions:

OTP authentication: Delivers one-time passwords for email-based auth.
Certificate management: Sends renewal notification emails with cert

  and CA bundle attachments.

WebAuthn/Passkey: Sends passkey expiration reminder emails.
Notification service: Uses SMTP for email channel delivery alongside

  webhooks for multi-channel routing.

Directory: User full name lookup for personalized email greetings.
Localization: Localized email text loaded from locale files.
Configuration: Reads [smtp] TOML section. All settings hot-reloadable.
Admin CLI: ‘smtp health’ and ‘smtp test’ commands for diagnostics.

Telemetry & Logging

Structured logging with OTLP export, per-module log levels, audit class, ring buffer queries, and trace correlation

Overview

The telemetry module provides structured logging with key-value pairs, multiple output targets, and cross-module trace correlation for cluster-wide observability.

Core capabilities:

Structured logging with key-value pairs and fluent builder API
Six log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
AUDIT log class: bypasses level filtering for security events
Per-module log level configuration (override global level per module)
OTLP gRPC log export to OpenTelemetry-compatible collectors
Trace ID correlation across modules (128-bit hex IDs per request)
In-memory ring buffer for admin CLI log queries
JSON and human-readable output formats
Security context builder for auth-related log entries

Output modes:

  stdout: Structured logs written to standard output (default)
  otlp:   Logs exported via gRPC to an OpenTelemetry collector
  both:   Simultaneous stdout and OTLP export

OTLP export includes:

  - timestamp, severity, body (message), module attribute
  - service.name, service.version, environment, host.name, host.ip
  - Native OTLP TraceId field for trace-to-log correlation
  - Batched async export via SDK log processor

Ring buffer:

  Configurable in-memory buffer (default 10,000 entries) for admin CLI log
  queries ('logs tail', 'logs search'). Provides instant access to recent
  logs without external log aggregation. Set to 0 to disable.

Config

Configuration under [telemetry] section:

[telemetry]

  log_level = "info"               # Global: trace|debug|info|warn|error|fatal
  log_format = "json"              # Output format: "json" or "human"
  output = "stdout"                # Output target: "stdout", "otlp", or "both"
  otlp_endpoint = "otel-collector:4317"  # Required when output is "otlp" or "both"
  log_buffer_size = 10000          # Ring buffer entries for log queries (0 = disabled)
  audit = true                     # Audit class: always display security events regardless of log_level

[telemetry.module_levels]

  oidc = "debug"                   # Per-module override (module name = level)
  webauthn = "info"
  ikev2 = "trace"

OTLP endpoint format:

  "host:port"          - Plain gRPC connection
  "http://host:port"   - Insecure gRPC (http:// stripped, WithInsecure applied)
  "https://host:port"  - TLS gRPC connection

Compatible collectors: Grafana Alloy, OpenTelemetry Collector, Datadog Agent, Splunk OTel Collector, any OTLP/gRPC compatible receiver.

If the OTLP endpoint is unreachable at startup, the system falls back to stdout and logs a warning. gRPC connections are lazy (connect on first export).

Audit class:

  When audit = true (default), log entries marked with AsAudit() bypass level
  filtering. Security events (SFTP ops, SSH connections, admin commands, TLS
  protection) are always visible even when log_level is set to "error".

Hot-reloadable: log_level, module_levels, log_format. Cold (restart required): output, otlp_endpoint, log_buffer_size, audit.

Troubleshooting

Common symptoms and diagnostic steps:

Logs not appearing in OTLP collector:

  - Verify output is set to "otlp" or "both" in [telemetry]
  - Check otlp_endpoint format (host:port, no trailing slash)
  - Network connectivity: 'net tcp <collector-host>:<port>'
  - Collector may reject due to resource limits or auth requirements
  - Startup fallback: if endpoint was unreachable at startup, logs go to stdout
  - Check: 'logs tail' to verify logs are being generated locally

Per-module log level not working:

  - Verify [telemetry.module_levels] has exact module name (case-sensitive)
  - Module names use dot notation: "oidc", "ikev2.dns", "identity.scim"
  - Per-module level must be lower priority than global to have effect
  - Check: 'config show telemetry' to verify active configuration

Ring buffer queries returning no results:

  - Verify log_buffer_size > 0 (0 disables the ring buffer)
  - Buffer is in-memory only; cleared on restart
  - 'logs tail' shows most recent entries
  - 'logs search <keyword>' filters by content
  - Buffer wraps around: oldest entries are overwritten when full

Log format issues:

  - "json": structured key-value JSON (recommended for log aggregation)
  - "human": colored, readable format (recommended for development)
  - Trace IDs: full 128-bit hex in JSON, truncated 8-char in human format

High log volume impacting performance:

  - Raise global log_level to "warn" or "error"
  - Use per-module levels to keep verbose logging only where needed
  - OTLP batched export is async and does not block request processing
  - Ring buffer size: reduce log_buffer_size if memory is a concern

Relationships

Module dependencies and interactions:

All modules: Every module in the system uses telemetry for structured

  logging with trace correlation.

Admin CLI: ‘logs tail’, ‘logs search’, ‘logs stats’, ‘logs anomalies’,

  'logs patterns' commands query the ring buffer.

Configuration: Reads [telemetry] section. Log level and format are

  hot-reloadable without restart.

Cluster: Each node maintains its own ring buffer. Admin CLI log queries

  fan out to all nodes and merge results.