Observability
DNS Resolution
Custom DNS resolvers with DNSSEC validation, distributed caching, health checks, and adaptive selection
Overview
The DNS module provides secure DNS resolution services for all Hexon components including proxy, bastion, cluster discovery, and VPN.
Capabilities:
- Custom DNS resolvers with automatic failover and circuit breaker pattern
- DNSSEC validation in two modes: resolver-trust (fast) and full cryptographic (secure)
- Distributed DNS caching via the memory storage module (local reads, broadcast writes)
- Lookup coalescing to prevent cache poisoning from concurrent requests
- Hostname validation to block DNS injection attacks (null bytes, CRLF)
- IPv4 preference when both A and AAAA records are available
- CNAME flattening with configurable depth limit (default 16, per RFC 1034)
- DNS-over-TLS (DoT) support for encrypted transport (RFC 7858)
- Adaptive resolver selection using epsilon-greedy algorithm (20-40% lower latency)
- Health checking with exponential backoff and automatic system DNS fallback
- Typed DNS queries for 30+ record types (A, AAAA, CAA, TLSA, SRV, MX, etc.)
- Context propagation for request cancellation and graceful shutdown
- TTL sanitization to prevent integer overflow attacks (capped at 1 week)
Operations:
- Resolve: DNS resolution with optional DNSSEC, caching, and resolver selection
- ValidateHostname: RFC-compliant hostname validation against injection attacks
Config
Core configuration under [dns]:
[dns]
timeout = 5 # DNS query timeout in seconds (default: 5) cache_ttl = 300 # Default cache TTL in seconds (default: 300) cache_override = false # Ignore DNS server TTL, always use cache_ttl (default: false) resolvers = ["1.1.1.1:53", "8.8.8.8:53", "9.9.9.9:53"] # DNS resolvers (default: cluster.cluster_dns_resolvers) flatten_cname = true # Follow CNAMEs to final A/AAAA records (default: true) max_cname_depth = 16 # Max CNAME chain depth to prevent loops (default: 16)DNSSEC settings:
dnssec_full_validation = false # Full cryptographic RRSIG/DNSKEY validation (default: false) dnssec_strict = false # Fail if zone is not DNSSEC-signed (default: false)DNS-over-TLS (DoT):
dot_enabled = false # Enable DNS-over-TLS transport (default: false) dot_port = 853 # DoT port per RFC 7858 (default: 853) dot_verify_server_cert = true # Verify resolver TLS certificate (default: true)Health checking:
health_check_enabled = true # Enable resolver health monitoring (default: true) health_check_interval = 30 # Health check interval in seconds (default: 30) health_failure_threshold = 2 # Consecutive failures before marking unhealthy (default: 2) health_check_query = "google.com" # Domain used for health check probes (default: "google.com")Adaptive resolver selection (epsilon-greedy ML):
adaptive_selector_enabled = true # Enable adaptive resolver selection (default: true) adaptive_exploration_rate = 0.10 # Exploration rate 0.0-1.0 (default: 0.10 = 10%) adaptive_smoothing_factor = 0.3 # EMA smoothing factor for latency tracking (default: 0.3) adaptive_min_sample_size = 100 # Queries before switching from learning to intelligent mode (default: 100) adaptive_load_balance_enabled = true # Penalize recently-used resolvers to spread load (default: true)Resolver architecture — three separate resolver pools:
dns.resolvers # Infrastructure resolvers (health-checked, used by all modules) cluster.cluster_dns_resolvers # Cluster discovery resolvers (fallback if dns.resolvers unset) proxy.dns.resolvers # Proxy-specific override (must be subset of dns.resolvers)Per-route proxy DNS overrides in [[proxy.mapping]]:
dnssec = true # Override global DNSSEC setting for this route dns_resolvers = ["10.0.0.1:53"] # Override resolvers for this route (must be in dns.resolvers)TTL precedence (cache_override=false): DNS server TTL > dns.cache_ttl > 300s default. TTL precedence (cache_override=true): dns.cache_ttl > 300s default. TTL bounds: minimum 1 second, maximum 604800 seconds (1 week).
Cache key format: “dns_cache:{hostname}:{resolver_hash}” (128-bit SHA256 hash). Cache reads are local (no network). Cache writes broadcast to cluster (fire-and-forget).
Hot-reloadable: resolvers, DNSSEC settings, cache TTL, health check parameters, adaptive settings. Cold (restart required): dot_enabled, dot_port.
Troubleshooting
Common symptoms and diagnostic steps:
DNS resolution failures:
- Check resolver health: 'dns resolvers' shows status, latency, and failure counts - Test specific hostname: 'dns test <hostname>' performs live resolution - All resolvers unhealthy: module falls back to system DNS (/etc/resolv.conf) - Resolver filtered out: proxy resolvers must be a subset of dns.resolvers - Cross-subsystem check: 'diagnose domain <hostname>' tests DNS + proxy + TLS togetherDNSSEC validation errors:
- Zone not signed: set dnssec_strict=false to allow unsigned zones (default) - Resolver-trust mode: compromised resolver can fake AD bit — use dnssec_full_validation=true - Full validation slow: first query ~200ms (chain of trust), cached queries ~50ms - Clock skew: DNSSEC signatures have validity windows — ensure NTP is running - Check validation: 'dns test <hostname> --dnssec' shows validation result and mode - Strict mode blocking: dnssec_strict=true rejects all unsigned zones — check per-route overrideSlow DNS resolution:
- Check cache hit rate: 'dns cache' shows hit/miss ratio and entry count - High cache miss: increase cache_ttl or set cache_override=true for static backends - Resolver latency: 'dns resolvers' shows per-resolver average latency (EMA) - Adaptive selector: 'dns adaptive' shows resolver scores and selection distribution - Learning phase: first 100 queries use round-robin — performance improves after - CNAME chains: deep chains add latency per hop — check with 'dns test <hostname>'All resolvers down (circuit breaker tripped):
- Health checker marks resolver unhealthy after 2 consecutive failures (configurable) - Backoff schedule: 30s, 1m, 2m, 4m, 8m, 15m (max) - System DNS fallback activates automatically when all custom resolvers fail - Recovery is automatic — resolver returns to pool when health check succeeds - Force re-check: 'dns health --reset' clears backoff timers - Check: 'dns resolvers' shows healthy/unhealthy status and next retry timeCache poisoning concerns:
- Lookup coalescing: concurrent requests for same hostname share single lookup result - Per-hostname locking prevents race conditions (no global bottleneck) - Enable DNSSEC (dnssec_full_validation=true) for cryptographic validation - Use DoT (dot_enabled=true) to encrypt DNS transport against snoopingCNAME resolution issues:
- CNAME not followed: check flatten_cname=true (default) - CNAME loop detected: max_cname_depth exceeded (default 16) — check DNS zone config - CNAME + ACL: ACL checks use original hostname, not CNAME target (prevents bypass) - Metrics: dns.cname_resolutions_total tracks success and depth_exceeded countsDoT connection failures:
- Port blocked: DoT uses port 853 (RFC 7858) — verify firewall rules - Certificate error: set dot_verify_server_cert=false to diagnose (re-enable after) - Non-standard port: module warns if dot_port is not 853502/503 from proxy due to DNS:
- DNSSEC failure blocks connection (no system DNS fallback for security) - DNS infrastructure failure falls back to system DNS (availability) - Fix: set dnssec=false on specific proxy routes for unsigned internal zones - Verify: 'dns test <backend-hostname> --dnssec' to check DNSSEC statusInterpreting tool output:
'dns health': Healthy: Status=healthy, Healthy resolvers = total resolvers Degraded: Healthy < total — some resolvers failing, but DNS still works Down: Healthy=0 — all resolvers failed, system DNS fallback active Action: Degraded/Down → 'dns resolvers' for per-resolver breakdown 'dns resolvers': Healthy: Status=healthy, Latency < 50ms, Score > 100 Degraded: Status=unhealthy with BackoffUntil timestamp — resolver in circuit breaker Learning: Score near 100 with low QueryCount — adaptive selector still calibrating (normal) Action: All unhealthy → check network connectivity to resolver IPs, verify port 53/853 open 'dns test <hostname>': Success: IPs returned, TTL shown, DNSSEC=valid (if enabled) DNSSEC failure: DNSSEC=invalid — zone is unsigned or signatures expired No results: hostname does not resolve — check DNS zone configuration Action: DNSSEC failure + proxy 502 → set dnssec=false on that proxy routeArchitecture
Resolution flow:
- Resolve request arrives (from proxy, bastion, VPN, or discovery)
- Hostname validation: RFC compliance check, injection prevention (null bytes, CRLF, length)
- Cache lookup: local memory read for “dns_cache:{hostname}:{resolver_hash}”
- If cache hit: return cached IPs immediately (no network call)
- If cache miss: acquire per-hostname lock (coalescing for concurrent requests)
- Resolver selection: adaptive selector picks best resolver (or round-robin during learning)
- Health filter: only healthy resolvers considered (circuit breaker pattern)
- DNS query: send query via UDP (or DoT if enabled) with configured timeout
- DNSSEC validation (if enabled):
a. Resolver-trust mode: check AD bit in response b. Full validation: verify RRSIG signatures, validate DNSKEY chain to root trust anchor- CNAME handling: if CNAME response and flatten_cname=true, recursively resolve target
- IPv4 preference: sort results with A records before AAAA records
- TTL extraction: from DNS response (DNSSEC/custom resolver) or use configured default
- TTL sanitization: clamp to [1s, 604800s], zero defaults to 300s
- Cache store: broadcast write to cluster memory (fire-and-forget, best-effort)
- Release per-hostname lock, waiting callers receive same result
- Return ResolveResponse with IPs, TTL, cached flag, DNSSEC validity
Adaptive resolver selection (epsilon-greedy):
Learning phase (first 100 queries): round-robin across all healthy resolvers Intelligent phase: 90% exploitation (best score), 10% exploration (random) Score = 100 + (success_rate * 50) - (avg_latency_ms / 10) - (timeout_pct * 30) - (consecutive_failures * 20) - (recently_used * 10) Latency tracked via EMA: new_avg = 0.3 * sample + 0.7 * old_avg Load balancing penalty: -10 points if resolver used within last 1 secondHealth checker circuit breaker:
Healthy: failure_count = 0, available for selection Unhealthy: failure_count >= threshold (default 2), excluded from selection Backoff: 30s -> 1m -> 2m -> 4m -> 8m -> 15m (max) Recovery: single successful health check returns resolver to healthy state System DNS fallback: automatic when ALL custom resolvers are unhealthy Memory cleanup: Resolver sync removes stale entries on config reloadDNSSEC full validation chain:
1. Query resolver with DO bit set 2. Extract RRSIG from response 3. Fetch DNSKEY for target zone (cached with TTL) 4. Verify RRSIG signature using DNSKEY (RSA/SHA-256, ECDSA P-256, Ed25519) 5. Fetch DS record from parent zone 6. Verify DNSKEY hash matches DS record 7. Recurse up to root zone 8. Validate root DNSKEY against hardcoded IANA trust anchor (KSK 20326) 9. Validate NSEC/NSEC3 for authenticated denial of existenceDistributed caching via memory module:
Read path: local-only (no network, no quorum) Write path: broadcast to all cluster nodes (fire-and-forget) Key format: "dns_cache:{hostname}:{sha256_hash_of_resolvers}" (collision-resistant) Eviction: TTL-based (respects DNS TTL or configured override) Coalescing: per-hostname mutex prevents concurrent duplicate lookupsMetrics emitted:
dns.resolve_total (tags: status, cached, dnssec) dns.resolve_latency_ms (histogram) dns.cache_hit_total / dns.cache_miss_total dns.health_check_total (tags: resolver, status) dns.adaptive_resolver_selected (tags: resolver, reason) dns.resolver_score (gauge, tags: resolver) dns.resolver_avg_latency_ms (gauge, tags: resolver) dns.cname_resolutions_total (tags: status)Relationships
Module dependencies and interactions:
- proxy: Backend hostname resolution for all proxy routes. Uses [dns] configuration
by default. Per-route overrides via dnssec and dns_resolvers fields in [[proxy.mapping]]. DNSSEC validation failure blocks connection (no system DNS fallback — prevents downgrade). DNS infrastructure failure falls back to system DNS (availability).- bastion: SSH connection and port forwarding hostname resolution. Uses [dns] configuration
directly (no bastion-specific overrides). DNSSEC protects against SSH destination poisoning.- discovery: Cluster peer discovery via DNS SRV records. Uses [dns] configuration for
resolver settings. Critical for cluster formation and membership.- vpn (IKEv2): VPN client DNS resolution and split DNS support for tunneled queries.
Uses [dns] configuration for upstream resolver selection.- acme: ACME challenge validation uses typed DNS queries (CAA record checking per RFC 8659).
SERVFAIL handling distinguishes "no records" from "DNS infrastructure error" for security.- memory: Distributed DNS cache storage. Local reads (fast), broadcast writes (best-effort).
No quorum required — cache is opportunistic, falls back to fresh lookup on miss.- config: Reads [dns] and [cluster] TOML sections. Hot-reload updates resolvers, DNSSEC
settings, cache parameters, health check configuration, and adaptive selection tuning. Resolver sync cleans up stale resolver state on reload (memory leak prevention).- metrics (telemetry): Emits counters, histograms, and gauges for resolution, caching,
health checks, and adaptive selection. Enables monitoring dashboards and alerting.Kubernetes CRD Configuration
Kubernetes-native configuration via Custom Resource Definitions with bootstrap reconciliation, live watching, and status feedback
Overview
HexonGateway supports Kubernetes-native configuration through Custom Resource Definitions (CRDs). When running in Kubernetes, operators can manage gateway configuration using standard kubectl commands instead of (or alongside) TOML files.
The system defines 55 CRD types covering every configuration section:
- Service, cluster, telemetry, health, DNS, SMTP, filesystem, memory - Proxy mappings, connection pools, TCP proxy, forward proxy, subrequest - Authentication: OIDC clients, SAML service providers, auth flows, signup flows - Identity: LDAP, OIDC providers, SCIM providers - Protection: WAF config, WAF rules, firewall rules/aliases, rate limiting - Infrastructure: VPN, bastion, SQL bastion, SSH certificates, port forwarding - Certificates: ACME CA server, ACME client - Operations: admin, MCP, LLM, playbooks, webhooks, SPIFFE, RADIUS - Observability: log intelligence, notificationsCRDs are optional — the gateway runs identically on VMs, Docker, or Kubernetes using TOML configuration. CRDs provide a Kubernetes-native alternative that integrates with GitOps tools like ArgoCD and Flux.
All CRDs belong to the config.hexon.io API group with v1alpha1 version. Namespaced scope — instances live in the hexon-system namespace by default.
Config
CRD Installation:
kubectl apply -f https://registry.hexon.io/crds/latest/hexon-crds.yaml Or a specific version: kubectl apply -f https://registry.hexon.io/crds/0.9.1/hexon-crds.yaml Individual CRDs can also be installed: kubectl apply -f https://registry.hexon.io/crds/0.9.1/hexonproxies.yamlCRD Lifecycle:
1. Bootstrap: On first start, the cluster leader creates CRD instances from the running config (TOML + env overrides + defaults merged). Each instance is labeled config.hexon.io/origin: bootstrap. 2. Pruning: Bootstrap-owned array CRDs no longer in config are automatically deleted, along with their companion Secrets. Operator-owned CRDs are never touched. This ensures TOML deletions propagate to Kubernetes. 3. Watching: Informers watch for CRD changes via the Kubernetes API. Changes are debounced (500ms window) to batch rapid edits. 4. Apply: CRD spec is converted to the internal config struct, validated, and applied atomically. Config reload callbacks fire for all modules. 5. Status: Each CRD instance gets status conditions reflecting apply success/failure.Example — create a proxy mapping:
apiVersion: config.hexon.io/v1alpha1 kind: HexonProxy metadata: name: dashboard namespace: hexon-system spec: hostname: dashboard.example.com target: http://dashboard-service:3000 auth_type: oidcSensitive fields (SecretKeyRef):
Sensitive config fields (certificates, private keys, passwords, API secrets, RADIUS shared secrets, OIDC client secrets) are never stored in CRD specs. Instead, they are stored in companion Kubernetes Secrets and referenced via SecretKeyRef entries in the CRD spec: spec: apiKey: name: hexon-hexonproxies-dashboard # Secret name key: apiKey # Key within the Secret - Empty sensitive fields (e.g., no custom certificate) produce no Secret. The field stays empty and the gateway uses its default (e.g., wildcard cert). - Non-empty fields are stored in a companion Secret named hexon-<plural>-<instance> (e.g., hexon-hexonproxies-dashboard). - Operators can reference any Secret they create — not limited to the bootstrap naming convention. - RBAC: The gateway pod needs get/list/create/update/delete on core Secrets.Ownership model:
- Bootstrap-created CRDs and companion Secrets have label: config.hexon.io/origin: bootstrap - Remove the label to "take ownership" — bootstrap will no longer overwrite - Operator-created CRDs (no label) are never modified or deleted by bootstrap - Bootstrap-owned array CRDs removed from config are pruned on next restartSingleton vs Array CRDs:
- Singleton: one instance named "default" (e.g., HexonClusterConfig, HexonDNSConfig) - Array: multiple instances, name derived from config (e.g., HexonProxy per mapping)Resource naming:
- K8s resource names are sanitized from config names (lowercased, spaces/underscores to dashes, special chars to dashes, max 253 chars). Example: config app "Kubernetes / Production" becomes resource name "kubernetes---production". - The original config name is preserved in the CRD spec (e.g., spec.app for proxies). - The "crd show" command accepts either the K8s resource name or the original config name. - Use "crd list <kind>" to see both resource names and config names side by side.Status conditions:
Every CRD instance reports an "Applied" condition: Applied=True reason=ConfigValid — config applied successfully Applied=False reason=ApplyError — config apply failed Applied=False reason=ConversionError — CRD-to-config conversion failed Check status: kubectl get hexonproxies -o wide The "Applied" printer column shows the current phase (Ready/Error).Troubleshooting
CRD instances not being created on startup:
- Only the cluster leader runs bootstrap reconciliation - Check logs for "bootstrap reconciliation complete" message - Verify CRD definitions are installed: kubectl get crd | grep hexonCRD changes not applying:
- Changes are debounced with a 500ms window — wait briefly - Check status conditions: kubectl describe <crd-kind> <name> - Look for Applied=False with reason and message - Verify RBAC: the gateway pod needs get/list/watch/create/update/patch permissions on all config.hexon.io resources and their status subresourcesStatus shows Applied=False reason=ConversionError:
- The CRD spec doesn't match the expected config structure - Check field names match TOML keys (snake_case in spec) - Verify enum values are valid (e.g., auth_type must be a recognized method)Bootstrap keeps overwriting my changes:
- Remove the config.hexon.io/origin label from the CRD instance: kubectl label hexonproxy <name> config.hexon.io/origin- - Once the label is removed, bootstrap treats it as operator-owned and skips it - Do the same for companion Secrets if you want to manage them independentlySensitive field shows empty after CRD apply:
- Check the companion Secret exists: kubectl get secret hexon-<plural>-<name> - Verify the Secret has the expected key: kubectl get secret <name> -o jsonpath='{.data}' - Check RBAC allows Secret read: the gateway pod needs get on core/v1 secrets - If the Secret was manually deleted, restart the gateway to recreate it via bootstrapCRD still exists after removing mapping from TOML:
- Bootstrap prunes only CRDs with the config.hexon.io/origin: bootstrap label - If the label was removed (operator-owned), delete it manually: kubectl delete hexonproxy <name>Config export for migration:
- Use the admin CLI: config export - Exports running config as multi-document YAML CRD manifests - Filter by section: config export proxy - JSON format: config export --format=json - Only available when running in KubernetesRelationships
Module dependencies and interactions:
- Configuration system: CRD changes are applied to the same config store used by
TOML and environment variables. All modules see changes via the standard config reload mechanism. CRDs have the same precedence as TOML — environment variables still override.- Cluster coordination: Bootstrap reconciliation runs on the cluster leader only.
Config changes from CRDs propagate to all nodes via the standard config reload broadcast (NATS-based).- Admin CLI: The “config export” command generates CRD YAML from running config,
enabling migration from TOML to Kubernetes-native management. When using "config export --apply", companion Secrets are created for sensitive fields (without the bootstrap label — operator-owned). The "config show" and "config describe" commands work regardless of config source.- Helm chart: CRDs are distributed separately from the Helm chart. Install CRDs
first, then deploy the chart. This avoids Helm's CRD lifecycle limitations (no update on upgrade, deletion on uninstall).- CI/CD integration: CRD manifests are published to registry.hexon.io/crds/ with
versioning. Compatible with ArgoCD, Flux, and other GitOps tools. The all-in-one bundle (hexon-crds.yaml) contains all 55 CRD definitions.- Codegen tool: CRD YAML manifests are generated from Go struct tags using the
build tool (build-crd.sh). OpenAPI v3 schemas include validation constraints derived from struct tags (required, enum, min, max, default, desc).AI Assistant
Built-in AI-powered natural language interface for gateway operations via the bastion shell
Overview
The AI assistant enables natural language interaction with all gateway admin tools through the bastion shell’s “ai” command. It shares the same tool set and execution path as MCP, ensuring identical tool visibility, read/write enforcement, metrics, and audit logging.
Capabilities:
Tool execution - Runs any admin CLI command via an agentic loop. The AI reads tool results, reasons about them, and decides what to run next. Read-only commands execute automatically. Write operations pause for interactive operator approval in the SSH session. Multi-provider support - Works with Anthropic (Claude), OpenAI (GPT-4), Azure OpenAI, Google Gemini, and Ollama/vLLM for local models. Provider auto-detected from the API URL or set explicitly. Conversation context - Maintains per-session conversation history so follow-up questions build on prior answers. Operators can set session context hints and the AI sees recent shell commands for awareness. Background monitoring - The schedule_task tool runs commands periodically in the background. Results appear between shell prompts. Operators manage tasks with "task list", "task stop". Inline monitoring loops - The sleep tool pauses the AI within its reasoning loop, then resumes with full context. Enables "check health, wait 30s, check again, compare, report changes" patterns. Each sleep extends the tool-calling budget so monitoring does not fight the per-query round limit. Governed by max_sleep_duration (default 5m per call) and max_sleeps_per_query (default 60 iterations). Ctrl+C interrupts immediately. Cluster knowledge base - Persistent cross-session memory for operational insights and rules. The AI learns from investigations and applies that knowledge in future sessions. Prompt caching - Anthropic provider supports prompt caching (5m or 1h TTL) to reduce token costs on repeated interactions.Configuration: [llm] section with api_url, api_key, model, required_groups. Enable in bastion with [bastion] use_llm = true. Per-user custom instructions via moduledata or config.
Safety
Multiple layers prevent runaway AI behavior:
Tool round limit - max_tool_rounds (default 15) caps the number of reasoning cycles per query. Sleep calls extend this budget so monitoring loops get additional rounds. Write operation limit - max_write_ops_per_query (default 3) caps mutations per query. The AI cannot retry failing write commands with slight variations. Sleep guardrails - max_sleep_duration (default 5m) caps individual pauses. max_sleeps_per_query (default 60) caps total iterations. Token cost on each wake-up naturally limits runaway loops. Failed operation dedup - Commands that fail are tracked by operation key. The AI cannot re-execute the same failing command. RBAC - required_groups restricts which operators can use AI features. allowed_commands whitelist limits which tools the AI can call. Interactive approval - Write operations prompt the operator for y/n confirmation in the SSH session before execution. Audit trail - All AI interactions logged with distributed tracing. Sensitive data redacted by default. Rate limiting - Per-user query rate limit (default 10/1m) prevents excessive API usage.Security
Multiple defense layers protect the AI assistant:
RBAC - required_groups restricts which operators can use AI features. Only operators in the configured groups can access the "ai" command. Command whitelist - allowed_commands limits which tools the AI can call. Operators cannot override this from within the AI session. Write protection - Read-only commands execute automatically. Write operations pause for interactive operator approval (y/n) in the SSH session before execution. Cannot be overridden by the AI. Rate limiting - Per-user query rate limit (default 10/1m) prevents excessive API usage and token cost accumulation. Audit trail - All AI interactions logged with distributed tracing. Sensitive data redacted by default (redact_sensitive = true). Runaway prevention - Tool round limit (default 15), write operation limit (default 3), sleep guardrails (5m per call, 60 iterations max), and failed operation dedup all prevent excessive token consumption.Troubleshooting
Common symptoms and diagnostic steps:
AI command not available in bastion:
- Verify [bastion] use_llm = true in config - Verify [llm] section is configured with api_url, api_key, model - Check required_groups: operator must be in one of the listed groups - Check: 'config show llm' to verify configurationAI returns errors or empty responses:
- Check API connectivity: verify api_url is reachable from the gateway - Check API key validity: invalid keys produce authentication errors - Check provider detection: auto-detect uses api_url hostname, set provider explicitly if using a proxy or non-standard endpoint - Check: 'logs search llm --since=5m' for API errorsWrite operations not being approved:
- Write ops require interactive SSH session (not available via MCP) - Operator must respond y/n to the approval prompt - max_write_ops_per_query (default 3) may be exhausted for the query - Check allowed_commands whitelist if specific commands are blockedAI stops responding mid-conversation:
- max_tool_rounds (default 15) reached: increase if needed for complex queries, but be aware of token cost implications - Sleep monitoring loop: max_sleeps_per_query (default 60) may be exhausted. Ctrl+C interrupts immediately - Check: 'logs search llm "round limit"' for limit violationsBackground tasks not running:
- 'task list' shows scheduled tasks and their status - 'task stop <id>' to cancel a misbehaving task - Only read-only commands can be scheduled as background tasksHigh API costs:
- Enable prompt caching for Anthropic provider (cache_ttl setting) - Reduce max_tool_rounds to limit reasoning cycles - Review max_sleeps_per_query for monitoring loops - Check per-user rate limits (default 10 queries/minute)Relationships
Cross-subsystem interactions:
- Admin CLI: Single source of truth for all tools. The AI calls the same
command handlers available via MCP and the bastion shell.- MCP: Shares system instructions, tool definitions, and response
formatting. Both interfaces use the same execution path.- Bastion shell: Hosts the “ai” command and interactive AI mode. Manages
conversation history, approval prompts, and background task lifecycle.- Cluster knowledge: Memory entries (insights and rules) stored in
cluster-wide distributed storage with configurable TTL.- Admin CLI commands: diagnose, health, proxy, sessions, certs, vpn, dns,
directory, config, and 30+ more — all available as AI tools.Notification Service
Unified notification routing to email and webhooks with template-driven payloads
Overview
The notify module handles all outbound notification delivery including single event notifications, digest notifications, and health checks for configured endpoints. It routes messages to email and/or webhooks simultaneously with template-driven payload formatting.
Core capabilities:
- Multi-channel routing: email, Slack, Teams, Discord, PagerDuty, custom webhooks
- Single event notifications (Send) with subject, body, and severity
- Digest notifications (SendDigest) batching multiple results into one message
- Five builtin webhook payload formats: generic, slack, teams, discord, pagerduty
- Custom Go text/template payloads with json, severityColor, severityEmoji helpers
- Partial success model: Success=true if at least one endpoint delivers
- Branded HTML email templates rendered via the render module
- Plain text fallback when render module is unavailable
- Targeted routing: empty Webhook sends to all, “email” for email only,
or a specific webhook name for single-target delivery- Health checking for all configured notification endpoints
Routing logic for the Webhook field:
- "" (empty): broadcast to all enabled channels (email + all webhooks) - "email": send to email channel only (requires To field) - "<name>": send to the named webhook only (e.g., "slack-ops")Email delivery chain:
1. Notify module requests email rendering with template + data 2. Render module loads the appropriate notification and digest templates 3. Rendered HTML and plain text forwarded to SMTP module as multipart 4. Fallback: if render unavailable, plain text auto-wrapped in <pre> tagsWebhook payload formats:
- generic: flat JSON with subject, body, severity, username, hostname, timestamp - slack: Block Kit with header, severity emoji, code block body, context footer - teams: Adaptive Card v1.4 with TextBlock, FactSet, monospace body - discord: Embed with severity color mapping, code block body, footer - pagerduty: Events API v2 with routing_key from Metadata, severity mappingConfig
Notification configuration under [notify] section:
[notify]
digest_window = "5m" # Window for batching digest items[notify.email]
enabled = true # Enable email notifications (uses [smtp] config)[[notify.webhooks]]
name = "slack-ops" # Webhook name (used for targeted routing) url = "https://hooks.slack.com/services/T00/B00/XXX" # Webhook endpoint URL format = "slack" # Payload format: generic, slack, teams, discord, pagerduty timeout = "10s" # Request timeout (default: 10s)[[notify.webhooks]]
name = "teams-infra" url = "https://outlook.office.com/webhook/XXX" format = "teams" timeout = "15s"[[notify.webhooks]]
name = "pagerduty-critical" url = "https://events.pagerduty.com/v2/enqueue" format = "pagerduty"[[notify.webhooks]]
name = "custom-endpoint" url = "https://api.internal/alerts" format = "generic" # Base format (overridden by body_template) content_type = "application/json" # Custom content type body_template = '{"alert": "{{json .Subject}}", "detail": "{{json .Body}}"}' [notify.webhooks.headers] Authorization = "Bearer token123" # Custom headers sent with every requestTemplate functions available in custom body_template:
{{json .Field}} - JSON-escape a string (quotes, backslashes, control chars) {{severityColor .Sev}} - Map severity to Discord embed color (int) {{severityEmoji .Sev}} - Map severity to Slack emoji string {{severityPD .Sev}} - Map severity to PagerDuty severity levelTemplate variables (TemplateData fields):
.Subject, .Body, .Severity, .Username, .Hostname, .Timestamp, .Metadata (map[string]string), .Items (digest), .ItemCount (digest)Email template variables (passed to render module):
Subject, Body, Severity, SeverityLabel, Username, Hostname, Timestamp, Disclaimer, Items (digest), ItemCount (digest)The HTMLBody field on SendRequest bypasses template rendering entirely, allowing callers to provide custom HTML content. Email requires the To field to be set (recipient address).
Hot-reloadable: webhook URLs, formats, headers, timeouts, email enabled. Cold (restart required): none (all notify config is hot-reloadable).
Troubleshooting
Common symptoms and diagnostic steps:
Webhook not receiving notifications:
- Run 'notify health' to check connectivity to all endpoints - Verify webhook URL is correct and accessible from the Hexon server - Check webhook name matches exactly (case-sensitive) when using targeted routing - Verify format is one of: generic, slack, teams, discord, pagerduty - Check timeout setting (default 10s) is sufficient for the endpoint - For custom endpoints, verify content_type and body_template are valid - Check custom headers (Authorization, API keys) are correctEmail notifications not being delivered:
- Verify [notify.email] enabled = true - Check SMTP module health: 'smtp health' - Verify To field is set on the SendRequest - Check render module is available for template rendering - If render unavailable, plain text fallback should still work - Check spam folder - configure SPF/DKIM/DMARC for productionPartial success (some channels fail, others succeed):
- This is expected behavior with the partial success model - Check resp.Failed[] for endpoints that failed and resp.Sent[] for successes - resp.Error contains a summary of failures - Individual endpoint failures do not block other deliveries - Success=true means at least one endpoint delivered successfullyDigest notifications empty or missing items:
- Verify Items array is populated in SendDigestRequest - Each DigestItem needs TaskID and Description at minimum - Check digest_window setting if items seem to be batched incorrectlyCustom template errors:
- Templates are parsed and validated at config load time - Template execution errors prevent notification delivery - Use {{json .Field}} for all string interpolation to prevent JSON injection - Check template syntax matches Go text/template format - Verify all referenced fields exist in TemplateDataSlack/Teams formatting issues:
- Slack format uses Block Kit (verify workspace supports it) - Teams format uses Adaptive Card v1.4 (verify connector version) - Discord embeds have 4096 character limit for description - PagerDuty requires routing_key in Metadata mapTest notifications:
- Use 'notify test <webhook-name>' to test a specific webhook - Use 'notify test email <address>' to test email delivery - Use 'notify list' to see all configured endpointsSecurity
Security considerations for notification delivery:
Webhook URL protection:
Webhook URLs are marked as sensitive in configuration and are not exposed in config dumps or diagnostic output. Only HTTPS URLs are recommended for production deployments. HTTP is allowed for internal or testing endpoints.Authentication headers:
Webhook headers (Authorization, API keys) are stored in configuration. Headers are sent with every webhook request to the endpoint. Consider using environment variable references for secrets in production.Custom template safety:
Templates are parsed and validated at config load time to catch syntax errors early. Always use {{json .Field}} for string interpolation in custom templates to prevent JSON injection attacks. Template execution errors are returned and the notification is not sent (fail-safe).HTML content handling:
Email body text is HTML-escaped when auto-converting plain text to HTML. The HTMLBody field content is sent as-is without sanitization - callers are responsible for ensuring HTML content is safe. The render module handles template escaping for branded email templates.Credential exposure prevention:
SMTP credentials are handled by the smtp module (never exposed by notify). Webhook URLs with embedded tokens (Slack, Teams) are redacted in logs. Error messages from failed webhook deliveries do not include full URLs.Relationships
Module dependencies and interactions:
- smtp: Email delivery backend. Notify sends emails through the SMTP module for
all email notifications. SMTP configuration ([smtp] section) determines the mail server, credentials, and encryption mode.- UI templates: Email template rendering. Notify renders branded HTML
email templates cluster-wide. If template rendering is unavailable, plain text fallback is used.- Admin CLI: Exposes notify CLI commands (list, health, test) for management
and diagnostics.- mcp: Notify operations available as MCP tools for AI-assisted operations.
LLM and bastion AI assistant can send notifications through MCP tools.- config: All notification settings are hot-reloadable. Webhook URLs, formats,
headers, timeouts, and email enabled flag can be changed without restart.- telemetry: Structured logging for notification delivery with endpoint name,
delivery status, and latency. Metrics for send counts and failures.- Rate limiting: Callers should apply rate limiting in HTTP handlers
to prevent notification flooding. Notify module itself does not rate limit.- Various callers: Any module can send notifications cluster-wide. Common
callers include authentication modules (login anomalies), certificate management (renewal notifications), and the bastion AI assistant.SMTP Email Delivery
Email delivery via SMTP with SSL/STARTTLS encryption, templated emails, localization, and attachments
Overview
The SMTP module handles all outbound email operations for HexonGateway. It is invoked by other modules that need to send email (OTP authentication, certificate renewal notifications, passkey expiration reminders, and generic email delivery).
Core capabilities:
- Generic email sending with HTML and plain text multipart content
- OTP (One-Time Password) emails for authentication flows
- Certificate renewal notification emails with cert and CA bundle attached
- Passkey expiration reminder emails with re-enrollment link
- Health checks for SMTP server connectivity verification
- Multi-part email composition with file attachments
- Three encryption modes: SSL (port 465), STARTTLS (port 587), plain (port 25)
- Multi-language email localization (en, es, fr, zh, ca)
- Template rendering with branded HTML email templates
- User language preference lookup for automatic localization
- RFC 5321 compliant address validation (local ≤ 64, domain ≤ 255 chars)
- RFC 5322 compliant headers (Date, Message-ID on every email)
- RFC 8255 Content-Language header on templated emails
- Message-ID in structured logs (success + failure) for MTA correlation
Localization priority for templated emails:
1. Language field explicitly set in the request 2. User preference from stored preferences 3. Default fallback to "en" (English)Supported languages: en (English), es (Spanish), fr (French), zh (Chinese), ca (Catalan)
Config
SMTP configuration under [smtp] section:
[smtp]
host = "smtp.gmail.com" # SMTP server hostname (required) port = 587 # SMTP server port (required) encryption = "starttls" # Encryption mode: "ssl", "starttls", or "none" user = "noreply@example.com" # SMTP authentication username password = "app-specific-password" # SMTP authentication password (sensitive) from = "noreply@example.com" # Sender email address (From header) reply_to = "support@example.com" # Reply-To header address (optional) name = "HexonAuth" # Sender display name (optional) skip_tls = false # Skip TLS certificate verification (default: false)skip_tls: Disables server certificate validation for SSL and STARTTLS modes.
Logs a WARN on every send when enabled. Use only when the SMTP server presents an untrusted or hostname-mismatched certificate. NOT recommended for production.Encryption modes:
ssl (port 465): Direct TLS connection from the start. starttls (port 587): Plain connection upgraded to TLS. Recommended. none (port 25): Unencrypted. Not recommended for production.Common SMTP provider configurations:
Gmail: host = "smtp.gmail.com", port = 587, encryption = "starttls" (requires App Passwords with 2FA enabled) SendGrid: host = "smtp.sendgrid.net", port = 587, encryption = "starttls" user = "apikey", password = "<sendgrid-api-key>" AWS SES: host = "email-smtp.<region>.amazonaws.com", port = 587 Mailgun: host = "smtp.mailgun.org", port = 587, encryption = "starttls"Hot-reloadable: all SMTP settings (host, port, encryption, credentials). Cold (restart required): none.
Troubleshooting
Common symptoms and diagnostic steps:
SMTP connection failures:
- Check SMTP health: 'smtp health' tests connectivity and authentication - Verify host and port match encryption mode (SSL=465, STARTTLS=587) - Firewall blocking outbound: verify server can reach SMTP host:port - Network probe: 'net tcp <smtp-host>:<port> --tls' for SSL - DNS resolution: 'dns test <smtp-host>' to verify hostname resolvesAuthentication failures:
- Gmail: requires App Passwords (regular password won't work with 2FA) - SendGrid: user must be literal string "apikey", password is the API key - AWS SES: IAM credentials, not root account credentials - Check: 'config show smtp' to verify configuration (password redacted)Emails not being received:
- Check spam/junk folder at recipient mail provider - Verify from address matches authenticated user or authorized alias - Configure SPF, DKIM, and DMARC DNS records for sending domain - Test delivery: 'smtp test <to-address>' sends a test message - Check: 'notify health' for notification system statusOTP emails delayed or missing:
- SMTP latency: 200-1000ms is normal, check 'smtp health' - OTP code expired before email arrived: check OTP validity window - Rate limiting by SMTP provider: check provider dashboardPasskey expiration emails not sent:
- Expired passkeys intentionally do not receive reminder emails - Verify DaysRemaining is positive (zero or negative triggers no email)TLS certificate verification failures:
- "STARTTLS failed: tls: failed to verify certificate" or "TLS dial failed" - Common cause: SMTP relay hostname differs from certificate CN/SAN (e.g., smtp.company.com forwards to smtp.gmail.com) - Temporary fix: set skip_tls = true in [smtp] config (logs WARN per send) - Proper fix: configure the SMTP server with a valid certificate matching its hostname - Check: 'net tls <smtp-host>:<port>' to inspect the certificate chainTemplate rendering errors:
- Missing locale: unsupported language falls back to English - Template rendering failures prevent email send (no fallback)Address validation failures (RFC 5321):
- "local part exceeds 64 character limit": email local part too long - "domain part exceeds 255 character limit": email domain too long - These limits are per RFC 5321 §4.5.3.1Correlating email delivery with MTA logs:
- Every send logs Message-ID (success and failure) - Use Message-ID to trace through relay MTAs, bounce reports, DMARC feedback - Search gateway logs: 'logs search message_id=<value>'Relationships
Module dependencies and interactions:
- OTP authentication: Delivers one-time passwords for email-based auth.
- Certificate management: Sends renewal notification emails with cert
and CA bundle attachments.- WebAuthn/Passkey: Sends passkey expiration reminder emails.
- Notification service: Uses SMTP for email channel delivery alongside
webhooks for multi-channel routing.- Directory: User full name lookup for personalized email greetings.
- Localization: Localized email text loaded from locale files.
- Configuration: Reads [smtp] TOML section. All settings hot-reloadable.
- Admin CLI: ‘smtp health’ and ‘smtp test’ commands for diagnostics.
Telemetry & Logging
Structured logging with OTLP export, per-module log levels, audit class, ring buffer queries, and trace correlation
Overview
The telemetry module provides structured logging with key-value pairs, multiple output targets, and cross-module trace correlation for cluster-wide observability.
Core capabilities:
- Structured logging with key-value pairs and fluent builder API
- Six log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
- AUDIT log class: bypasses level filtering for security events
- Per-module log level configuration (override global level per module)
- OTLP gRPC log export to OpenTelemetry-compatible collectors
- Trace ID correlation across modules (128-bit hex IDs per request)
- In-memory ring buffer for admin CLI log queries
- JSON and human-readable output formats
- Security context builder for auth-related log entries
Output modes:
stdout: Structured logs written to standard output (default) otlp: Logs exported via gRPC to an OpenTelemetry collector both: Simultaneous stdout and OTLP exportOTLP export includes:
- timestamp, severity, body (message), module attribute - service.name, service.version, environment, host.name, host.ip - Native OTLP TraceId field for trace-to-log correlation - Batched async export via SDK log processorRing buffer:
Configurable in-memory buffer (default 10,000 entries) for admin CLI log queries ('logs tail', 'logs search'). Provides instant access to recent logs without external log aggregation. Set to 0 to disable.Config
Configuration under [telemetry] section:
[telemetry]
log_level = "info" # Global: trace|debug|info|warn|error|fatal log_format = "json" # Output format: "json" or "human" output = "stdout" # Output target: "stdout", "otlp", or "both" otlp_endpoint = "otel-collector:4317" # Required when output is "otlp" or "both" log_buffer_size = 10000 # Ring buffer entries for log queries (0 = disabled) audit = true # Audit class: always display security events regardless of log_level[telemetry.module_levels]
oidc = "debug" # Per-module override (module name = level) webauthn = "info" ikev2 = "trace"OTLP endpoint format:
"host:port" - Plain gRPC connection "http://host:port" - Insecure gRPC (http:// stripped, WithInsecure applied) "https://host:port" - TLS gRPC connectionCompatible collectors: Grafana Alloy, OpenTelemetry Collector, Datadog Agent, Splunk OTel Collector, any OTLP/gRPC compatible receiver.
If the OTLP endpoint is unreachable at startup, the system falls back to stdout and logs a warning. gRPC connections are lazy (connect on first export).
Audit class:
When audit = true (default), log entries marked with AsAudit() bypass level filtering. Security events (SFTP ops, SSH connections, admin commands, TLS protection) are always visible even when log_level is set to "error".Hot-reloadable: log_level, module_levels, log_format. Cold (restart required): output, otlp_endpoint, log_buffer_size, audit.
Troubleshooting
Common symptoms and diagnostic steps:
Logs not appearing in OTLP collector:
- Verify output is set to "otlp" or "both" in [telemetry] - Check otlp_endpoint format (host:port, no trailing slash) - Network connectivity: 'net tcp <collector-host>:<port>' - Collector may reject due to resource limits or auth requirements - Startup fallback: if endpoint was unreachable at startup, logs go to stdout - Check: 'logs tail' to verify logs are being generated locallyPer-module log level not working:
- Verify [telemetry.module_levels] has exact module name (case-sensitive) - Module names use dot notation: "oidc", "ikev2.dns", "identity.scim" - Per-module level must be lower priority than global to have effect - Check: 'config show telemetry' to verify active configurationRing buffer queries returning no results:
- Verify log_buffer_size > 0 (0 disables the ring buffer) - Buffer is in-memory only; cleared on restart - 'logs tail' shows most recent entries - 'logs search <keyword>' filters by content - Buffer wraps around: oldest entries are overwritten when fullLog format issues:
- "json": structured key-value JSON (recommended for log aggregation) - "human": colored, readable format (recommended for development) - Trace IDs: full 128-bit hex in JSON, truncated 8-char in human formatHigh log volume impacting performance:
- Raise global log_level to "warn" or "error" - Use per-module levels to keep verbose logging only where needed - OTLP batched export is async and does not block request processing - Ring buffer size: reduce log_buffer_size if memory is a concernRelationships
Module dependencies and interactions:
- All modules: Every module in the system uses telemetry for structured
logging with trace correlation.- Admin CLI: ‘logs tail’, ‘logs search’, ‘logs stats’, ‘logs anomalies’,
'logs patterns' commands query the ring buffer.- Configuration: Reads [telemetry] section. Log level and format are
hot-reloadable without restart.- Cluster: Each node maintains its own ring buffer. Admin CLI log queries
fan out to all nodes and merge results.