Certificates & PKI
Certificate Management
Unified TLS certificate lifecycle across internal ACME CA, external ACME clients, and static PEM sources
Overview
Certificate Management is the central coordination layer for all TLS certificates in HexonGateway. It unifies three distinct certificate sources into a single storage and distribution model:
- Internal ACME CA: Issues certificates for internal services using a
built-in RFC 8555 compliant Certificate Authority. Supports http-01, dns-01, and tls-alpn-01 challenges with OCSP and CRL distribution.- External ACME Client: Obtains certificates from Let’s Encrypt or any
external ACME-compliant CA. Handles automatic renewal, bootstrap fallback, and cluster-wide distribution.- Static PEM: Certificates loaded directly from configuration files
(tls_cert/tls_key). Highest priority source, used when pre-provisioned certificates are available.Regardless of source, all certificates flow through the certificate manager for unified storage, caching, and distribution. The TLS handshake retrieves certificates from the local in-memory cache with sub-microsecond latency, while cluster-wide consistency is maintained via broadcast operations.
Certificate selection during TLS handshake follows a three-tier priority:
exact domain match > wildcard match (*.example.com) > default certificateStorage layers:
1. Distributed cache: keyed by domain with TTL matching certificate expiry 2. Local cache: parsed TLS certificates held in-memory for zero-latency TLS serving (reads are local-only, writes broadcast cluster-wide)Validation on storage:
- Maximum PEM sizes enforced (256KB cert, 32KB key) - Certificate chain length limited to 10 - Domain length capped at 253 characters (RFC 1035) - Date range and domain-SAN match validationConfig
Certificate management is an infrastructure module — it has no dedicated configuration section. Certificate sources are configured via other modules:
Static certificates:
[service] tls_cert = "/path/to/cert.pem" # Static TLS certificate tls_key = "/path/to/key.pem" # Static TLS private key Per-proxy mapping certificates: [[proxy.mapping]] hostname = "api.example.com" tls_cert = "/path/to/api-cert.pem" # Per-route certificate tls_key = "/path/to/api-key.pem"Internal ACME CA (automatic):
[acme] enabled = true # Enables internal certificate issuanceExternal ACME Client (automatic):
[acme_client] enabled = true # Obtains certs from Let's Encrypt or similarAutoTLS (automatic wildcard):
[service] auto_tls = true # Wildcard cert via internal ACME CACertificate selection priority during TLS handshake:
1. Static PEM (from tls_cert/tls_key or per-mapping config) 2. ACME-issued certificate (internal or external) 3. Default certificate (service-level or AutoTLS wildcard)Domain matching priority:
exact match > wildcard match (*.example.com) > default certificateValidation limits (enforced on all certificate storage):
- Maximum certificate PEM size: 256KB - Maximum private key PEM size: 32KB - Maximum chain length: 10 certificates - Maximum domain name length: 253 characters (RFC 1035) - Date range validation: NotBefore <= now <= NotAfterTroubleshooting
Common symptoms and diagnostic steps:
Certificate not found for domain:
- Check 'certs list' for all managed certificates - Check 'certs show <domain>' for specific domain - Verify domain matches exactly (case-sensitive) or has wildcard match - Check certificate source: static (config), ACME (internal CA), or ACME client - If ACME: check 'autotls status' and 'certs acme' for issuance statusTLS handshake using wrong certificate:
- Check priority: static > ACME > default - Per-mapping certificates override service-level defaults - 'diagnose domain <hostname>' shows which certificate is being served - Wildcard certificates only match one level (*.example.com matches api.example.com but NOT sub.api.example.com)Certificate expired or expiring soon:
- Check 'certs list' for expiration dates - ACME certificates renew automatically before expiry - Static certificates must be replaced manually and config reloaded - If auto-renewal failed: check ACME CA health ('health components') - Check 'logs search certmanager --level=warn' for expiration warningsCertificate not propagating across cluster:
- Writes are broadcast to all nodes; check cluster health - Check 'cluster status' for quorum and node connectivity - Each node maintains a local cache — propagation is near-instant - If a node missed the broadcast: restart triggers fresh certificate loadInvalid certificate PEM:
- Check PEM encoding: must be valid base64 with proper headers - Chain must include intermediates (server cert + intermediate CAs) - Key must match the certificate's public key - File path must be readable by the gateway processMetrics for monitoring:
- certmanager_certificates_total: total certificates in cache - certmanager_set_total: certificate store operations by source - certmanager_get_total: cache lookups (hit=true/false) - certmanager_expired_total: certificates that expired from cacheRelationships
Child modules:
- certificates.acme: Internal ACME CA server for issuing certificates to internal services. Acts as the certificate authority that clients can request certificates from. - certificates.acmeclient: ACME protocol client that obtains certificates from Let's Encrypt or any ACME-compliant CA (including the internal CA). Handles renewal scheduling, bootstrap fallback, and cluster distribution.Key dependents:
- TLS server: Retrieves certificates for TLS handshake callbacks. - Proxy: Per-mapping TLS certificates for SNI-based routing. Invalid or missing certificates prevent routes from mounting. - AutoTLS: Uses the internal ACME CA to issue wildcard certificates, then stores them through the certificate manager for cluster-wide availability.Infrastructure dependencies:
- Distributed storage: Certificate cache with TTL-based expiration. - Cluster broadcast: Operations for cluster-wide certificate updates. - Configuration: Certificate source selection and TLS parameters.ACME CA Server
RFC 8555 compliant internal ACME Certificate Authority with OCSP, CRL, and deterministic DNS challenges
Overview
The ACME module implements a full RFC 8555 ACME Certificate Authority server for automated certificate issuance within internal infrastructure. It is compatible with standard ACME clients including certbot, cert-manager, Caddy, and Traefik.
Core capabilities:
- Full RFC 8555 ACME protocol compliance
- Stateless accounts derived from JWK thumbprint (no account database)
- All three challenge types: http-01, dns-01, and tls-alpn-01
- IP address certificates via RFC 8738
- CAA checking via RFC 8659 with domain hierarchy walk-up
- OCSP responder (RFC 6960) with caching for real-time certificate status
- CRL distribution (RFC 5280) rebuilt on each revocation
- Deterministic DNS challenges for internal domains without DNS API
- UUID v4 certificate serial numbers (collision-free)
- Cluster-ready with distributed storage across all nodes
- Comprehensive multi-dimensional rate limiting (7 dimensions)
- Saga pattern for atomic distributed updates (TOCTOU prevention)
- Optimistic concurrency control for rate limit counters
- Threshold ECDSA CA: when acme_ca_threshold=true, the CA private key exists only as
distributed threshold shares (GG18 DKG). Fail-closed — no certs until DKG completes.- CA certificate rotation is automatic — managed by AutoTLS renewalLoop or ACME client
renewal scheduler. Operators do NOT need to set calendar reminders for CA cert expiry. Only investigate if 'health components' shows CA warnings or 'certs list' shows expiring certs.HTTP endpoints under /acme (configurable prefix):
GET /acme/directory -> ACME directory with all endpoint URLs HEAD /acme/new-nonce -> Anti-replay nonce POST /acme/new-account -> Create/lookup account POST /acme/new-order -> Create certificate order POST /acme/order/{id} -> Order status POST /acme/authz/{id} -> Authorization status POST /acme/challenge/{id} -> Challenge response POST /acme/finalize/{id} -> Finalize order with CSR POST /acme/cert/{id} -> Download certificate (PEM) POST /acme/revoke-cert -> Revoke certificate GET /acme/ca-certs -> CA certificate bundle (PEM) GET /acme/crl -> CRL (DER-encoded) GET /acme/ocsp/{req} -> OCSP check (GET) POST /acme/ocsp -> OCSP check (POST)Storage model:
- Volatile: in-memory cache for fast access (orders, authorizations, challenges, nonces, OCSP)
- Persistent: NATS JetStream KV for durability (certificates, CRL, serial index)
- All persistent data encrypted at rest (key derived from cluster_key)
- Startup: certificates loaded from persistent storage into memory cache
Cluster behavior:
- Write operations (create order, finalize, revoke): Replicated with quorum
- Read operations (get directory, get order, get cert): Local only
- Validation operations (validate challenge, check CAA): Local with external calls
- Nonces: created across all nodes; validated and consumed locally
Config
ACME CA configuration in hexon.toml under [acme]:
[acme]
enabled = true # Enable ACME CA server path_prefix = "/acme" # URL prefix for ACME endpoints (default: /acme) external_url = "" # External URL (derived from hostname if not set) # Access control allowed_cidrs = ["10.0.0.0/8"] # Restrict ACME API to specific networks (optional) allowed_identifiers = ["*.internal.example.com"] # Domain patterns (optional, wildcards) # Challenge configuration challenges_enabled = ["http-01"] # Enabled challenge types (default: http-01 only) challenge_validity = "15m" # Challenge validity period (default: 15m) nonce_validity = "15m" # Nonce validity period (default: 15m) # Certificate parameters max_validity = "2160h" # Maximum certificate validity (default: 90 days) default_validity = "2160h" # Default certificate validity (default: 90 days) max_san_count = 100 # Maximum SANs per certificate (default: 100) enable_ip_identifiers = true # Enable IP address identifiers (RFC 8738) # CAA checking (RFC 8659) caa_checking = false # Enable CAA record checking (default: false) caa_identifiers = ["acme.example.com"] # CAA identifiers for this CA # OCSP Responder (RFC 6960) ocsp_enabled = true # Enable OCSP responder (default: true) ocsp_cache_ttl = "5m" # OCSP response cache TTL (default: 5m) ocsp_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for OCSP (default: all) # CRL Distribution (RFC 5280) crl_enabled = true # Enable CRL endpoint (default: true) crl_cidrs = ["0.0.0.0/0"] # Allowed CIDRs for CRL (default: all) crl_next_update = "48h" # CRL NextUpdate offset (default: 48h) # Deterministic DNS for internal domains dns_deterministic = false # Enable deterministic DNS challenges dns_deterministic_cidrs = ["10.0.0.0/8"] # Allowed CIDRs for deterministic DNS # Legacy rate limits (simple) rate_limit_orders_per_ip = 50 # Orders per IP per hour rate_limit_certs_per_domain = 50 # Certs per domain per week # Comprehensive rate limits [acme.rate_limits] enabled = true orders_per_account = 5000 # Max orders per account per window orders_per_account_window = "3h" certs_per_domain = 500 # Max certs per eTLD+1 domain per window certs_per_domain_window = "168h" # 1 week certs_per_exact_set = 50 # Max certs per exact domain set certs_per_exact_set_window = "168h" auth_failures_per_domain = 50 # Max auth failures per domain per window auth_failures_window = "1h" orders_per_ip = 1000 # Max orders per IP per window orders_per_ip_window = "1h" failed_finalizations_per_order = 10 min_order_interval = "100ms" # Minimum time between orders per account buffer_percent = 10 # Warning threshold at 90% of limitSafe defaults with just enabled = true:
- http-01 challenge enabled, CAA checking disabled
- 90-day certificate validity, 15-minute challenge/nonce validity
- 100 SANs maximum, IP identifiers enabled
- OCSP responder enabled (5-minute cache)
- CRL distribution enabled (48-hour NextUpdate)
- No CIDR or domain restrictions
Hot-reloadable: rate limits, allowed_cidrs, allowed_identifiers, OCSP/CRL settings. Cold (restart required): enabled, path_prefix, challenges_enabled.
Troubleshooting
Common symptoms and diagnostic steps:
“badNonce” errors from ACME clients:
- Nonce expired: increase nonce_validity (default 15m) - Nonce already consumed: client must retry with fresh nonce from Replay-Nonce header - Clock skew between cluster nodes: verify NTP synchronization - Single-use enforcement: each nonce valid for exactly one request“connection” errors during http-01 challenge:
- Firewall blocking port 80 from ACME server to client - Client not serving challenge response at /.well-known/acme-challenge/{token} - Wrong content at challenge URL (must be {token}.{thumbprint}) - Validation timeout: client has 30 seconds total (2s initial delay, 5 retries)“dns” errors during dns-01 challenge:
- TXT record _acme-challenge.{domain} not propagated yet - Wrong record value (must be base64url(SHA256(keyAuthorization))) - DNS TTL too high, stale cached record - DNS module must be enabled and healthy“caa” errors during certificate issuance:
- Domain has CAA records that do not include this CA's identifier - SERVFAIL on CAA lookup denies issuance per RFC 8659 (mandatory) - Add CA identifier to domain CAA records or disable caa_checking - CAA queries always bypass DNS cache for fresh data“unauthorized” errors:
- Account key mismatch between request JWK and order account - Order belongs to a different account thumbprint - Certificate revocation attempted by non-owner without certificate key“rejectedIdentifier” errors:
- Domain not matching allowed_identifiers patterns - IP identifier requested but enable_ip_identifiers = false“rateLimited” errors:
- Check which rate limit dimension was hit (logged at WARN level) - Rate limits use fail-open design: errors do not block operations - IPv6 addresses normalized to /64 prefix for rate limiting - When both legacy and comprehensive rate limits configured, both enforcedOCSP responder returning “unknown” status:
- Serial number not recognized by this CA - Certificate not loaded into memory on startup (check startup logs) - Persistent storage lookup failed (check NATS JetStream health)CRL endpoint returning empty or stale CRL:
- CRL rebuilt only on certificate revocation (not periodically) - Check crl_enabled = true - Verify persistent storage (NATS JetStream KV) is healthy - Lazy load: first access after restart may be slowerCertificate storage issues:
- Memory storage: check distributed memory cache module health - Persistent storage: check NATS JetStream KV connectivity - Key naming: NATS KV does not allow ":" in keys, uses "/" separator - Encryption: all persistent data encrypted with AES-256-GCM - Startup loading: expired certificates are skipped during reloadChallenge validation timing out:
- Initial delay allows client time to set up challenge response - Multiple retry attempts with backoff before giving up - Total validation timeout is 30 seconds - tls-alpn-01: verify client serves ALPN protocol "acme-tls/1" on port 443Verify CA threshold signing works:
Run 'hexdcall threshold test' to trigger a test signing ceremony. Shows per-node participation, latency, and signature verification. Use '--trace' for phase-level timing and per-node message counts.Security
Security model and hardening:
Account security:
Stateless accounts derived from JWK thumbprint. No account credentials stored server-side. Account key compromise allows certificate issuance for any allowed domain. Consider key rotation procedures for high-security deployments.Challenge validation:
http-01: validates web server access on port 80 (follows up to 10 redirects) dns-01: validates DNS control via TXT record at _acme-challenge.{domain} tls-alpn-01: validates TLS server access via acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31) All challenges have short validity (default 15 minutes) and are single-use.Certificate security:
UUID v4 serial numbers prevent collision attacks. CAA checking (when enabled) prevents unauthorized issuance per RFC 8659. CIDR restrictions limit API access. Rate limiting prevents abuse across 7 dimensions.Nonce security:
Cryptographically random, single-use, short validity (15 minutes default). Consumed immediately on use. Prevents replay attacks per RFC 8555.Deterministic DNS security boundary:
Token derived from cluster_key and domain name. Domain must resolve to IP within dns_deterministic_cidrs. Only for internal domains where DNS API is unavailable.Persistent storage encryption:
All certificate data encrypted at rest with keys derived from cluster_key. Defense-in-depth on top of transport encryption. Private keys never stored in plaintext.Distributed consistency:
Write operations use transactional patterns to prevent race conditions. Write operations require quorum consensus. Rate limits fail-open for availability.Threshold CA key protection:
When acme_ca_threshold=true, the ACME CA private key never exists in full on any node. Generated via distributed key generation, exists only as shares. Signing requires a quorum of nodes. Shares encrypted at rest with keys derived from cluster_key. After initial key generation, membership changes use resharing (no re-generation). Fail-closed: no certificates issued until key generation completes.OCSP/CRL security:
OCSP responses signed with CA key, cached for performance (configurable TTL). Cache invalidated immediately on certificate revocation. CRL signed with CA key, rebuilt on each revocation (not periodic). Both endpoints support CIDR-based access control.Relationships
Module dependencies and interactions:
- certmanager: Issued certificates can be stored via certmanager for TLS serving.
ACME CA is the issuer; certmanager is the consumer for cluster-wide distribution.- autotls: AutoTLS uses the internal ACME CA for wildcard certificate issuance.
When auto_tls = true, ACME is automatically enabled.- acmeclient: The ACME client module can point at this internal CA as its ACME
server, creating a fully internal PKI without external dependencies.- dns: Used for dns-01 challenge validation (TXT record queries), CAA record
checking (typed DNS lookup with "CAA" query type), and deterministic DNS token validation. SERVFAIL on CAA lookup must deny issuance per RFC 8659.- Distributed cache: Primary storage for orders, authorizations, challenges, nonces,
OCSP cache, and CRL. Distributed across cluster nodes.- Persistent storage: Durable storage for certificates, serial number index, and CRL.
Uses NATS JetStream KV with encryption at rest.- config: Hot-reload of rate limits, allowed CIDRs, OCSP/CRL settings.
- telemetry: Structured logging and Prometheus metrics for orders, challenges,
certificates, OCSP, CRL, and rate limit events.- Rate limiting: ACME implements its own rate limiting layer (7 dimensions)
independent of the global rate limiter. Both legacy and comprehensive limits can be enforced simultaneously (defense in depth).ACME Client
Automatic TLS certificate management via Let’s Encrypt or ACME-compliant CAs with cluster-wide distribution
Overview
The ACME client module obtains and manages TLS certificates from Let’s Encrypt or any ACME-compliant Certificate Authority (including Hexon’s internal ACME CA). It handles certificate issuance, automatic renewal, cluster-wide distribution, and bootstrap fallback for high-availability deployments.
Core capabilities:
- Automatic certificate issuance via ACME protocol (RFC 8555)
- HTTP-01 challenge with dynamic port 80 listener (only during verification)
- Cluster-wide certificate distribution via persistent KV watch (NATS JetStream KV)
- Encrypted persistent storage (AES-256-GCM with cluster_key domain separation)
- Automatic renewal with configurable threshold (default: 30 days before expiry)
- ACME Renewal Information (ARI) support (RFC 8739) for optimal renewal windows
- Bootstrap fallback: self-signed temporary certificate when ACME fails on startup
- Recovery mechanism: exponential backoff retry after bootstrap fallback
- Smart startup with leader detection for cluster-wide deduplication
- Wildcard certificate coverage detection to avoid redundant issuance
- Client-side rate limiting to avoid hitting CA limits
- Startup readiness integration to prevent HTTPS binding without valid certificate
Certificate modes (dual mode):
Static: tls_cert/tls_key in config used directly (highest priority) ACME: acme_client.enabled = true, certificates managed automatically Both can coexist. Static wildcards suppress redundant ACME issuance.Dynamic port 80 challenge listener architecture:
1. Certificate issuance starts -> challenge listener started on ALL nodes 2. All nodes start port 80 listener -> Wait for quorum confirmation 3. ACME challenge tokens stored in distributed memory cache (cluster-wide) 4. ACME server validates -> Any node can respond to challenge 5. Certificate issued -> challenge listener stopped on ALL nodes 6. All nodes stop port 80 listener Port 80 exposed only during brief verification window (~30 seconds).Cluster coordination:
1. Leader node performs ACME protocol exchange 2. Certificate saved to Persistent Storage (encrypted) 3. persistent KV watch automatically syncs to all cluster nodes 4. All nodes update in-memory TLS configuration via watch handler No manual certificate distribution needed.Storage model:
Persistent: NATS JetStream KV with AES-256-GCM encryption Keys: account, cert/{base64url(domain)}, issuance/{base64url(domain)} Watch pattern: "cert/*" for automatic cluster sync Domain encoding: base64url (RFC 4648 without padding) for special charactersConfig
ACME client configuration in hexon.toml under [acme_client]:
[acme_client]
enabled = true # Enable ACME client email = "admin@example.com" # Contact email for CA notifications accept_tos = true # Accept CA terms of service (required) reset = false # Delete all ACME data on startup (default: false) # Certificate parameters key_type = "ecdsa256" # Key type: ecdsa256, ecdsa384, rsa2048, rsa4096 renewal_threshold_hours = 720 # Renew when fewer than N hours remain (default: 30 days) renewal_check_interval = "6h" # How often to check for renewals (default: 6h) auto_proxy_domains = true # Auto-issue certs for proxy mapping domains # Challenge configuration challenge_port = 80 # Port for HTTP-01 challenge listener (default: 80) # Bootstrap fallback allow_bootstrap_fallback = true # Use self-signed cert if ACME fails (default: true) startup_timeout = "60s" # Max time for ACME on startup (default: 60s) startup_retries = 3 # Retries within timeout before fallback (default: 3) # ACME Renewal Information (ARI) - RFC 8739 ari_enabled = true # Fetch optimal renewal windows from CA (default: true) ari_check_interval = "6h" # How often to refresh ARI data (default: 6h) # Client-side rate limits (avoid hitting CA limits) [acme_client.rate_limits] enabled = true # Enable rate limit tracking (default: true) orders_per_account = 300 # Max orders per account per window (default: 300) orders_window = "3h" # Orders window (default: 3h) certs_per_domain = 50 # Max certs per domain per window (default: 50) certs_per_domain_window = "168h" # 7 days (default: 168h) buffer_percent = 10 # Safety margin before limit (default: 10%)TLS certificate source priority:
1. Static certificate (tls_cert + tls_key) -- highest priority 2. AutoTLS (auto_tls = true) 3. ACME client (acme_client.enabled = true) 4. Error -- no TLS configuredBootstrap certificate characteristics (when ACME fails):
CN: HEXON-BOOTSTRAP-{hostname}, O: HexonGateway Validity: 7 days, Key: ECDSA P-256, SANs: configured hostname only NOT persisted -- regenerated on each startup if neededRecovery schedule after bootstrap fallback:
1 minute -> 5 minutes -> 15 minutes -> 30 minutes -> 1 hour -> normal cycle (6h)Hot-reloadable: renewal_threshold_hours, renewal_check_interval, rate limits. Cold (restart required): enabled, email, accept_tos, key_type, challenge_port.
Troubleshooting
Common symptoms and diagnostic steps:
Certificate not issued on startup:
- Check if leader exists: issuance uses leader-only scheduling which requires a leader node - Smart retry loop polls for leader with exponential backoff - Verify startup_timeout (default 60s) and startup_retries (default 3) - If no leader within timeout and allow_bootstrap_fallback = true, bootstrap cert used - Check logs for "unknown UUID" errors (indicates leader-only scheduling called without leader)Using bootstrap certificate (self-signed):
- Bootstrap certificate in use indicates ACME failure on startup - Recovery routine runs with exponential backoff (1m, 5m, 15m, 30m, 1h, then 6h) - Check if ACME directory is reachable from the node - Verify DNS resolution for the configured domain - Check ACME account creation succeeded (email, accept_tos required)HTTP-01 challenge failing:
- Port 80 must be accessible from the ACME CA server to any cluster node - Verify port 80 is not already in use (challenge_port config) - Challenge listener is dynamic: only active during issuance (~30 seconds) - Check distributed memory cache health: challenge tokens stored cluster-wide - Challenge tokens have short TTL (5 minutes) - Path traversal protection on challenge token validationCertificate not renewing:
- Check renewal_threshold_hours (default 720 = 30 days) - Verify renewal_check_interval schedule (default 6h) - Renewal checks run via the internal scheduler automatically - ARI-suggested renewal windows may differ from threshold-based renewal - Check rate limits: client-side tracking prevents exceeding CA limitsCertificate not appearing on all nodes:
- persistent KV watch subscribes to "cert/*" for automatic sync - Check NATS JetStream KV connectivity on all nodes - WatchEventPut: decrypt and install; WatchEventDelete: remove from cache - Encryption key mismatch: all nodes must share same cluster_key - Check AES-256-GCM decryption errors in logsRate limiting issues:
- Let's Encrypt limits: 300 orders/3h, 50 certs/domain/7d, 5 exact set/7d - Client-side tracking uses memory module with TTLs matching CA windows - HTTP 429 with Retry-After: short waits retried immediately, long waits deferred - ARI-suggested renewals exempt from some rate limits - buffer_percent (default 10%) triggers warning at 90% of limitWildcard coverage preventing ACME issuance:
- Static wildcard cert (e.g., *.example.com) suppresses ACME for covered domains - *.example.com covers api.example.com but NOT example.com (apex) - *.example.com does NOT cover sub.api.example.com (nested subdomain) - Check if domain is covered by existing static certificateHexonReady timeout:
- ACME client registers a readiness check for certificate availability - HexonReady polls every 500ms with 2-minute timeout - If timeout: either ACME failed and bootstrap disabled, or leader unavailable - Check certificatesReady atomic flag in module stateMetrics for diagnosis:
acmeclient_issuance_started_total, acmeclient_issuance_success_total acmeclient_issuance_failed_total (labels: domain, error_type) acmeclient_renewal_checks_total, acmeclient_certificates_expiring acmeclient_challenges_served_total (labels: status) acmeclient_certificates_loaded (gauge), acmeclient_certificate_days_until_expiryInterpreting tool output:
'certs list': Healthy: All certs Status=OK, Days Left > 30 Warning: Days Left < 30 — renewal should happen automatically, check 'certs acme' Expiring: Days Left < 7 — urgent, check ACME client health immediately Bootstrap: Source=bootstrap — self-signed temporary cert, ACME failed on startup 'certs acme list': Healthy: All domains show Status=valid with reasonable expiry Pending: Status=pending — issuance in progress or waiting for challenge Failed: Status=failed with error — check challenge port 80 accessibility Action: Failed → 'logs search acmeclient' for issuance error details 'autotls status': Healthy: Certificate valid, Days Left > 30, auto-renewal scheduled Renewing: Renewal in progress — certificate will update automatically Failed: Renewal failed — check internal ACME CA health with 'health components'Security
Security model and hardening:
Private key protection:
All sensitive data encrypted at rest using AES-256-GCM. Key derived from cluster_key with module-specific domain separation ("acmeclient"). Defense-in-depth on top of NATS transport encryption. Private keys never stored in plaintext in persistent storage.Challenge listener security:
Port 80 only exposed during brief verification window (~30 seconds). Challenge tokens have short TTL (5 minutes) in distributed memory cache. Token validation prevents path traversal attacks. Challenge listener management restricted to internal operations only.Cluster synchronization security:
Automatic encrypted sync across all nodes (no manual distribution needed). All nodes must share the same cluster_key for decryption. NATS JetStream uses TLS encryption in transit.Bootstrap certificate limitations:
Self-signed, not trusted by any external client. 7-day validity, ECDSA P-256 key. NOT persisted -- cannot be accidentally used long-term. Recovery routine continuously attempts real ACME certificate.Access control:
Certificate issuance and renewal are restricted to internal scheduler and admin commands. Challenge listener management is internal only (during issuance). Certificate status queries are available to all services and admin commands.Relationships
Module dependencies and interactions:
- Certificate manager: Primary consumer. ACME client stores issued certificates
and distributes them cluster-wide via the certificate manager.- TLS server: Retrieves certificates via certificate manager for TLS handshakes.
Checks for ACME-managed certificates when no static TLS cert configured.- Internal ACME CA: Can be configured as the ACME server endpoint, creating
a fully internal PKI. The ACME client is a standard client -- it works with any RFC 8555 compliant CA.- AutoTLS: Alternative for internal-only deployments. AutoTLS uses the internal
CA directly. Priority: static > autotls > acmeclient.- Distributed memory: Challenge tokens stored cluster-wide for any-node validation.
Rate limit counters tracked with CA-matched TTLs.- Persistent storage: Durable certificate storage (encrypted, NATS JetStream KV).
Automatic cluster distribution via watch on certificate keys.- Cluster coordination: Leader detection for deduplication of certificate issuance.
Readiness check for startup sequencing.- Config: Certificate mode selection (static vs ACME), renewal parameters,
rate limit configuration. Hot-reload for renewal settings.- Telemetry: Prometheus metrics for issuance, renewal, challenges, and certificate
state. Structured logging for all ACME protocol interactions.- Scheduler: Renewal checks run on configurable interval (default 6h).
ARI check runs on separate interval (default 6h).AutoTLS Certificate Management
Automatic wildcard TLS certificates from internal ACME CA with deterministic key derivation for cluster consistency
Overview
AutoTLS provides fully automatic TLS certificate management for internal domains. When enabled (auto_tls = true), it issues wildcard certificates (e.g., *.example.com) from the cluster’s internal ACME CA — no external dependencies, no manual intervention.
Core capabilities:
- Fully automatic certificate issuance and renewal — zero operator intervention required
- Wildcard certificates covering all subdomains of the configured hostname
- Deterministic key derivation (HKDF from cluster_key) for cluster-wide SPKI consistency
- Configurable renewal cycles (default: 30 days) and validity (default: 60 days)
- Seamless key rotation on each renewal cycle (new HKDF-derived key per cycle)
- Background renewal loop with automatic retry on failure
- Hostname change detection with automatic re-issuance
IMPORTANT — Automatic renewal:
Certificate renewal is fully automatic. The renewal loop runs continuously in the background, sleeping until the next cycle boundary (deterministic mode) or until the renewal threshold (random mode). When the timer fires, a new certificate is automatically issued and installed. There is NO manual rotation step. Operators do NOT need to set calendar reminders or monitor expiry dates for routine certificate management. The system handles it. Only investigate if 'autotls status' shows unexpected errors or if certificates are not renewing (check logs for issuance failures).Each node operates independently:
1. Derives deterministic ECDSA P-256 private key from cluster_key and current cycle 2. Signs wildcard certificate via internal CA 3. Stores certificate and sets as default for SNI fallback 4. Background loop sleeps until next cycle boundary, then repeatsNo cluster coordination needed — deterministic derivation from cluster_key means all nodes produce certificates with the same public key.
Deterministic keys
Deterministic key derivation — what it means and why it is secure:
IMPORTANT — “deterministic” refers to KEY MATERIAL ONLY, not signatures. ECDSA signatures still use standard randomness for nonce generation. This is NOT “deterministic signing” in the RFC 6979 sense.
How it works:
Private keys are derived deterministically from cluster_key, the certificate SANs, and a cycle counter. The same inputs always produce the same key. Each renewal cycle increments the counter → new key material.Why deterministic keys:
1. SPKI pinning: The public key is identical across all cluster nodes, enabling external clients to pin the certificate 2. Cluster consistency: No coordination needed — all nodes derive the same key 3. Reproducibility: Certificate can be re-derived after node restart without fetching state from other nodesWhat remains random:
- ECDSA signature nonces use standard randomness — certificate bytes may differ between nodes, but the public key is identical - The certificate is signed with full cryptographic randomness; only the subject key pair is deterministicSecurity properties:
- Key material entropy comes from cluster_key (256-bit minimum) - Cryptographic domain separation ensures different inputs produce independent keys - Each cycle produces an independent key — compromise of one cycle's key does not reveal other cycles' keys - Without cluster_key (single-node mode), keys are fully randomWhen the admin AI or operators see “deterministic” in AutoTLS context, it means deterministic KEY DERIVATION for cluster consistency — not reduced randomness. The cryptographic security is equivalent to random key generation.
Config
AutoTLS configuration in hexon.toml under [service]:
[service]
hostname = "access.corp.internal" auto_tls = true # auto_tls_renewal = 30 # Renewal cycle in days (default: 30, range: 20-525) # auto_tls_validity = 60 # Certificate validity in days (default: 60, range: 30-790)Certificate timing:
Renewal cycle: how often a new certificate is issued (default: 30 days) Validity period: how long each certificate is valid (default: 60 days) Overlap: validity - renewal = 30 days of dual-certificate coverage Constraint: overlap must be 20%-80% of validity (auto-adjusted if not)When auto_tls = true:
- Internal ACME CA is automatically enabled - ACME CA endpoints available for trust anchoring (/acme/ca-bundle) - Static TLS certificates (tls_cert/tls_key) take priority if definedCertificate details:
Key: ECDSA P-256 (deterministically derived when cluster_key set, random otherwise) Serial: Deterministic (derived from SANs and cycle for consistency) SANs: *.{base_domain} + {hostname} CN: HEXON-AUTOTLS-*.{base_domain}Hot-reloadable: auto_tls_renewal, auto_tls_validity (via renewalLoop detection). Cold (restart required): auto_tls, hostname.
Troubleshooting
Common symptoms and diagnostic steps:
Certificate not issued on startup:
- Check if internal ACME CA is healthy: 'health components' - Verify cluster_key is set (required for deterministic mode) - Check logs for certificate signing errors - If startup fails, the renewal loop retries automaticallyCertificate not renewing:
- Renewal is fully automatic — check if the renewal loop is running - Check 'autotls status' for current certificate state and days left - Check logs for "Renewing AutoTLS certificate" messagesHostname changed but old certificate still served:
- renewalLoop detects hostname changes and re-issues automatically - Old certificate expires naturally (validity period) - Force immediate renewal: 'autotls renew' (admin command)SPKI pin changed unexpectedly:
- Pin changes on each renewal cycle (new HKDF-derived key) - Update pinned hashes after each renewal cycle - Pin rotation window = certificate overlap period (default: 30 days) - Use both current and next pin for seamless rotationTrust not established:
- Clients must trust the internal CA root certificate - CA bundle available at /acme/ca-bundle (HTTPS endpoint) - Add to system trust store: update-ca-certificates (Linux) or Keychain (macOS)Interpreting ‘autotls status’ tool output:
Healthy: Certificate valid, Days Left > renewal_threshold Renewing: Background loop triggered renewal — automatic, no action needed Failed: Check logs for certificate signing errors Disabled: auto_tls = false in configRelationships
Module dependencies and interactions:
- acme (internal CA): AutoTLS uses the internal ACME CA for certificate signing.
When auto_tls = true, ACME CA is automatically enabled. Certificate signing is local — no network round-trip, no ACME protocol overhead.- certmanager: AutoTLS stores issued certificates in the certificate manager.
The certificate manager handles TLS handshake certificate selection (exact match > wildcard > default).- config: Reads hostname, auto_tls, auto_tls_renewal, auto_tls_validity from
[service] section. Detects hostname changes for automatic re-issuance.- server: Server’s TLS handshake retrieves certificates from certmanager.
Priority: static (tls_cert) > AutoTLS > ACME client > error.- acmeclient: Alternative to AutoTLS for external CA-signed certificates.
Both can coexist; static certs take highest priority.SPIFFE Workload Identity
SPIFFE workload identity certificate issuance via modified ACME profile with pre-authenticated JWK thumbprint matching
Overview
The SPIFFE service implements HTTP handlers for workload identity certificate issuance using a modified ACME (RFC 8555) protocol. Unlike standard ACME which requires domain validation challenges (HTTP-01, DNS-01, TLS-ALPN-01), the SPIFFE profile uses pre-authenticated workloads with JWK thumbprint matching for zero-challenge certificate issuance.
Key capabilities:
- Pre-registration: Workloads configured with public keys in TOML config
- No challenges: Authorization based on JWK thumbprint matching (RFC 7638)
- CIDR enforcement: Per-workload and global IP restrictions, re-validated at every operation
- Rate limiting: Per-workload sliding window (1 hour) with eventual consistency
- Short-lived certificates: Workload-specific TTL (default 24h, max configurable)
- SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
- AllowedPeers extension: Custom OID (1.3.6.1.4.1.64753.1.1) with JSON peer list
- CRL and OCSP integration: Certificates include CRL Distribution Point and OCSP responder URL
- Hot-reload: New/removed/modified workloads applied without restart
- Workload snapshots: Orders capture config at creation time for zero-downtime updates
ACME endpoints (default prefix /acme/spiffe):
GET /directory ACME directory with endpoint URLs (public) GET /bundle CA trust bundle in PEM format (public) GET /tos Terms of Service (public) HEAD /new-nonce Get replay nonce for JWS requests POST /new-account Create or retrieve SPIFFE account (JWS required) POST /new-order Create certificate order, auto-approved (JWS required) POST /order/{id} Get order status via POST-as-GET (JWS required) POST /finalize/{id} Submit CSR to finalize order (JWS required) POST /cert/{id} Download issued certificate via POST-as-GET (JWS required) POST /revoke-cert Revoke a certificate (JWS required)Certificate features:
- SPIFFE URI SAN: spiffe://{hostname}/workload/{identity}
- Extended Key Usage: Server Authentication + Client Authentication
- AllowedPeers X.509 extension with authorized peer SPIFFE IDs
- CRL Distribution Point and OCSP Responder URLs embedded
Config
Core configuration under [spiffe] and [[spiffe.workloads]]:
[spiffe]
enabled = true # Enable SPIFFE workload identity service path_prefix = "/acme/spiffe" # HTTP endpoint prefix (default: /acme/spiffe) allowed_cidrs = ["10.0.0.0/8"] # Global CIDR allowlist for all workloads default_ttl = "24h" # Default certificate TTL max_ttl = "168h" # Maximum certificate TTL (7 days) rate_limit_per_workload = 100 # Max certificates per workload per hour order_timeout = "1h" # Order expiration timeout allowed_key_algorithms = ["EC-P256", "EC-P384", "RSA-2048"] # Allowed CSR key algorithms[[spiffe.workloads]]
identity = "api-backend" # Workload identity name (used in SPIFFE URI) account_public_key = "-----BEGIN..." # PEM-encoded public key for JWK thumbprint matching sans = ["api.example.com"] # Allowed DNS SANs for this workload allowed_peers = ["frontend", "db"] # Peer SPIFFE IDs embedded in AllowedPeers extension allowed_cidrs = ["10.0.1.0/24"] # Per-workload CIDR restriction (optional, narrows global) ttl = "4h" # Per-workload TTL override (optional, must be <= max_ttl)JWK thumbprint computation (RFC 7638):
1. Parse DER-encoded public key from account_public_key PEM 2. Convert to canonical JWK format (lexicographically sorted fields) 3. SHA-256 hash of UTF-8 encoded JWK JSON 4. Base64url encode the hash (no padding) Workload authenticates by signing JWS requests with matching private key.JWS verification requirements for authenticated endpoints:
- Algorithm: ES256 (ECDSA P-256) or RS256/RS384/RS512 (RSA 2048+) - URL field must match request URL - Nonce: single-use replay protection via Replay-Nonce header - Signature verified against account public keyCSR validation rules:
- Maximum size: 64KB - CSR must be self-signed (signature verified) - SANs must match order identifiers exactly - Key algorithm must be in allowed_key_algorithms - RSA keys require minimum 2048 bitsHot-reload behavior:
New workloads: immediately available for account creation and issuance Removed workloads: in-flight orders (v2) complete via snapshot; new orders blocked Modified workloads: TTL/CIDR/SAN/peer changes apply to new orders only Public key changes: old thumbprint orphaned; new account required with new keyCluster storage:
Accounts: cluster-wide storage with 90-day TTL, quorum required Orders: cluster-wide storage with order_timeout + 10min buffer, quorum required Certificates: cluster-wide storage with configurable TTL, quorum required Rate limits: cluster-wide storage with 2-hour TTL, best-effort eventual consistencyTroubleshooting
Common symptoms and diagnostic steps:
Account creation fails with “unauthorized”:
- JWK thumbprint does not match any configured workload public key - Verify thumbprint: 'step crypto jwk thumbprint workload-pub.pem' - Check config: 'spiffe workloads' to list configured workloads - CIDR mismatch: client IP not in global or per-workload allowed_cidrs - Check CIDR: 'spiffe check <workload-id>' for workload detailsOrder creation fails:
- Account deactivated: cannot create new orders - Rate limit exceeded: per-workload sliding window (1 hour) hit - SAN validation: requested identifiers not in workload's sans list - Check status: 'spiffe status' for overall SPIFFE healthCertificate finalization fails:
- CSR too large: maximum 64KB - CSR signature invalid: CSR must be self-signed - SAN mismatch: CSR SANs must match order identifiers exactly - Key algorithm not allowed: check allowed_key_algorithms in config - RSA key too small: minimum 2048 bits required - Rate limit re-check: limit may have been reached between order and finalize - Order expired: check order_timeout settingCertificate retrieval returns error:
- CIDR re-validation: client IP checked again at retrieval time - Order not yet valid: certificate issuance is asynchronous (semaphore-limited) - Issuance queue full: check 'metrics prometheus spiffe_issuance_queue' for queue depthCIDR enforcement issues:
- Global [spiffe].allowed_cidrs applies to ALL requests - Per-workload allowed_cidrs narrows the global allowlist - CIDR is re-validated at every operation (account, order, finalize, retrieve, revoke) - Check specific IP: 'geo lookup <ip>' for network detailsRate limiting unexpectedly blocking:
- Eventual consistency: under high concurrent load from multiple cluster nodes, limits may be exceeded by up to the number of concurrent requests - Sliding window is 1 hour; check 'metrics prometheus spiffe_ratelimit' for current usage - Fail-open: if rate limit state is unavailable, requests are allowedNonce errors (badNonce):
- Nonces are single-use; retry with fresh nonce from Replay-Nonce response header - All error responses include a new Replay-Nonce header for immediate retryCertificate not trusted by peers:
- Trust bundle: GET /acme/spiffe/bundle returns PEM-encoded CA chain - AllowedPeers: verify peer SPIFFE ID is in allowed_peers list - OCSP/CRL: check 'certs ocsp' and 'certs crl' for responder statusIntegration with cert-manager:
- Use SPIFFE ACME directory URL as ClusterIssuer server - No challenge solvers needed (SPIFFE auto-approves) - Account key secret must match configured workload public key - cert-manager may require empty solvers list or dummy solverRelationships
Module dependencies and interactions:
- acme: SPIFFE uses the internal ACME CA for certificate signing. The CA signing
operation runs asynchronously after order finalization with a configurable concurrency semaphore (default: 50). CA signing latency tracked via spiffe_ca_signing_duration_ms metric.- x509: Certificates include SPIFFE URI SAN, AllowedPeers custom extension,
and standard Extended Key Usage (serverAuth + clientAuth). CRL Distribution Point and OCSP Responder URLs are embedded in issued certificates.- sessions: No direct dependency. SPIFFE uses JWS-based authentication
(not session cookies). Each request is independently authenticated via JWK thumbprint matching.- directory: No direct dependency. Workload identity is managed through
TOML configuration, not the directory module.- storage: Accounts, orders, certificates, and rate limits stored cluster-wide via
distributed storage with quorum requirements. Rate limits use eventual consistency.- hotreload: Configuration changes detected via file watcher or SIGHUP.
Workload snapshots (v2 orders) enable zero-downtime updates during rolling config changes.- loadbalancer: No direct dependency. SPIFFE handles its own rate limiting
via per-workload sliding window counters in cluster storage.- geoaccess: No direct dependency. CIDR enforcement is built into the SPIFFE
module itself using per-workload and global allowed_cidrs configuration.