Observability

Shardlyn exposes comprehensive metrics for monitoring the control plane, agents, and running instances/workloads.

Metrics Architecture

┌─────────────────┐     scrape      ┌────────────────┐
│   Control Plane │ ───────────────→│   Prometheus   │
│   :8080/metrics │                 │                │
└─────────────────┘                 └───────┬────────┘
                                            │
┌─────────────────┐     scrape              │
│      Agent      │ ──────────────────→     │
│   :9100/metrics │                         │
└─────────────────┘                         │
                                            ▼
                                   ┌─────────────────┐
                                   │     Grafana     │
                                   │   Dashboards    │
                                   └─────────────────┘

What to Monitor

Control Plane

Request volume: HTTP requests per second by endpoint
Latency: Response time percentiles (p50, p95, p99)
Error rate: 4xx and 5xx responses
Active WebSocket connections: Console and log streams

Nodes

Health status: Healthy, unhealthy, offline
Heartbeat timing: Last seen, heartbeat latency
Resource utilization: CPU, memory, disk usage

Instances

State distribution: Running, stopped, error counts
State transitions: Creates, stops, deletes per minute
Failure rates: Error states and recovery time

Dashboard Quick View

The Shardlyn Dashboard provides at-a-glance metrics:

Key Metrics Displayed

Total Nodes: Healthy vs total count
Total Instances: Running vs total count
System Status: Quick health indicators
Recent Activity: Latest instance deployments and state changes

Instance Resource Meters

Each running instance shows:

CPU usage percentage with visual meter
Memory usage vs limit
Network I/O activity
Uptime duration

Prometheus Metrics

Control Plane Metrics (`/metrics`)

text

# HTTP request metrics
shardlyn_http_requests_total{method, path, status}
shardlyn_http_request_duration_seconds{method, path}

# Provisioning metrics
shardlyn_provision_requests_total{provider, status}
shardlyn_provision_duration_seconds{provider}

# WebSocket metrics
shardlyn_websocket_connections_active{type}

Agent Metrics (`:9100/metrics`)

text

# Container metrics
shardlyn_container_cpu_usage_percent{instance_id}
shardlyn_container_memory_usage_bytes{instance_id}
shardlyn_container_memory_limit_bytes{instance_id}

# Node metrics
shardlyn_node_cpu_usage_percent
shardlyn_node_memory_usage_bytes
shardlyn_node_disk_usage_bytes

# Heartbeat metrics
shardlyn_heartbeat_duration_seconds
shardlyn_heartbeat_errors_total

Grafana Dashboards

Import the included dashboards from deploy/grafana/:

Shardlyn Overview: High-level system health
Nodes: Per-node resource utilization
Instances: Runtime workload performance metrics
Control Plane: API performance and errors

Alerting Recommendations

Configure alerts for:

Node offline: No heartbeat for > 5 minutes
High error rate: > 1% 5xx responses
Instance failures: Error state > 5 minutes
Resource exhaustion: CPU/memory > 90% sustained

Log Settings (Admin)

Configure log collection in Settings > Logs:

Retention period: How long to keep logs
Log level: Verbosity of control plane logs
External forwarding: Send to external log aggregator

Next Steps

Metrics Reference — Full list of Prometheus metrics exposed by the control plane and agents
Alerting & Notifications — Define alert rules based on metrics thresholds
Architecture — Understand how the control plane, agents, and monitoring fit together

Observability ​

Metrics Architecture ​

What to Monitor ​

Control Plane ​

Nodes ​

Instances ​

Dashboard Quick View ​

Key Metrics Displayed ​

Instance Resource Meters ​

Prometheus Metrics ​

Control Plane Metrics (/metrics) ​

Agent Metrics (:9100/metrics) ​

Grafana Dashboards ​

Alerting Recommendations ​

Log Settings (Admin) ​

Next Steps ​