Skip to content

Observability

Shardlyn exposes comprehensive metrics for monitoring the control plane, agents, and running instances/workloads.

Metrics Architecture

┌─────────────────┐     scrape      ┌────────────────┐
│   Control Plane │ ───────────────→│   Prometheus   │
│   :8080/metrics │                 │                │
└─────────────────┘                 └───────┬────────┘

┌─────────────────┐     scrape              │
│      Agent      │ ──────────────────→     │
│   :9100/metrics │                         │
└─────────────────┘                         │

                                   ┌─────────────────┐
                                   │     Grafana     │
                                   │   Dashboards    │
                                   └─────────────────┘

What to Monitor

Control Plane

  • Request volume: HTTP requests per second by endpoint
  • Latency: Response time percentiles (p50, p95, p99)
  • Error rate: 4xx and 5xx responses
  • Active WebSocket connections: Console and log streams

Nodes

  • Health status: Healthy, unhealthy, offline
  • Heartbeat timing: Last seen, heartbeat latency
  • Resource utilization: CPU, memory, disk usage

Instances

  • State distribution: Running, stopped, error counts
  • State transitions: Creates, stops, deletes per minute
  • Failure rates: Error states and recovery time

Dashboard Quick View

The Shardlyn Dashboard provides at-a-glance metrics:

Key Metrics Displayed

  • Total Nodes: Healthy vs total count
  • Total Instances: Running vs total count
  • System Status: Quick health indicators
  • Recent Activity: Latest instance deployments and state changes

Instance Resource Meters

Each running instance shows:

  • CPU usage percentage with visual meter
  • Memory usage vs limit
  • Network I/O activity
  • Uptime duration

Prometheus Metrics

Control Plane Metrics (/metrics)

text
# HTTP request metrics
shardlyn_http_requests_total{method, path, status}
shardlyn_http_request_duration_seconds{method, path}

# Provisioning metrics
shardlyn_provision_requests_total{provider, status}
shardlyn_provision_duration_seconds{provider}

# WebSocket metrics
shardlyn_websocket_connections_active{type}

Agent Metrics (:9100/metrics)

text
# Container metrics
shardlyn_container_cpu_usage_percent{instance_id}
shardlyn_container_memory_usage_bytes{instance_id}
shardlyn_container_memory_limit_bytes{instance_id}

# Node metrics
shardlyn_node_cpu_usage_percent
shardlyn_node_memory_usage_bytes
shardlyn_node_disk_usage_bytes

# Heartbeat metrics
shardlyn_heartbeat_duration_seconds
shardlyn_heartbeat_errors_total

Grafana Dashboards

Import the included dashboards from deploy/grafana/:

  1. Shardlyn Overview: High-level system health
  2. Nodes: Per-node resource utilization
  3. Instances: Runtime workload performance metrics
  4. Control Plane: API performance and errors

Alerting Recommendations

Configure alerts for:

  • Node offline: No heartbeat for > 5 minutes
  • High error rate: > 1% 5xx responses
  • Instance failures: Error state > 5 minutes
  • Resource exhaustion: CPU/memory > 90% sustained

Log Settings (Admin)

Configure log collection in Settings > Logs:

  • Retention period: How long to keep logs
  • Log level: Verbosity of control plane logs
  • External forwarding: Send to external log aggregator

Next Steps

Built for teams that want control of their own infrastructure.