Observability
Shardlyn exposes comprehensive metrics for monitoring the control plane, agents, and running instances/workloads.
Metrics Architecture
┌─────────────────┐ scrape ┌────────────────┐
│ Control Plane │ ───────────────→│ Prometheus │
│ :8080/metrics │ │ │
└─────────────────┘ └───────┬────────┘
│
┌─────────────────┐ scrape │
│ Agent │ ──────────────────→ │
│ :9100/metrics │ │
└─────────────────┘ │
▼
┌─────────────────┐
│ Grafana │
│ Dashboards │
└─────────────────┘What to Monitor
Control Plane
- Request volume: HTTP requests per second by endpoint
- Latency: Response time percentiles (p50, p95, p99)
- Error rate: 4xx and 5xx responses
- Active WebSocket connections: Console and log streams
Nodes
- Health status: Healthy, unhealthy, offline
- Heartbeat timing: Last seen, heartbeat latency
- Resource utilization: CPU, memory, disk usage
Instances
- State distribution: Running, stopped, error counts
- State transitions: Creates, stops, deletes per minute
- Failure rates: Error states and recovery time
Dashboard Quick View
The Shardlyn Dashboard provides at-a-glance metrics:
Key Metrics Displayed
- Total Nodes: Healthy vs total count
- Total Instances: Running vs total count
- System Status: Quick health indicators
- Recent Activity: Latest instance deployments and state changes
Instance Resource Meters
Each running instance shows:
- CPU usage percentage with visual meter
- Memory usage vs limit
- Network I/O activity
- Uptime duration
Prometheus Metrics
Control Plane Metrics (/metrics)
text
# HTTP request metrics
shardlyn_http_requests_total{method, path, status}
shardlyn_http_request_duration_seconds{method, path}
# Provisioning metrics
shardlyn_provision_requests_total{provider, status}
shardlyn_provision_duration_seconds{provider}
# WebSocket metrics
shardlyn_websocket_connections_active{type}Agent Metrics (:9100/metrics)
text
# Container metrics
shardlyn_container_cpu_usage_percent{instance_id}
shardlyn_container_memory_usage_bytes{instance_id}
shardlyn_container_memory_limit_bytes{instance_id}
# Node metrics
shardlyn_node_cpu_usage_percent
shardlyn_node_memory_usage_bytes
shardlyn_node_disk_usage_bytes
# Heartbeat metrics
shardlyn_heartbeat_duration_seconds
shardlyn_heartbeat_errors_totalGrafana Dashboards
Import the included dashboards from deploy/grafana/:
- Shardlyn Overview: High-level system health
- Nodes: Per-node resource utilization
- Instances: Runtime workload performance metrics
- Control Plane: API performance and errors
Alerting Recommendations
Configure alerts for:
- Node offline: No heartbeat for > 5 minutes
- High error rate: > 1% 5xx responses
- Instance failures: Error state > 5 minutes
- Resource exhaustion: CPU/memory > 90% sustained
Log Settings (Admin)
Configure log collection in Settings > Logs:
- Retention period: How long to keep logs
- Log level: Verbosity of control plane logs
- External forwarding: Send to external log aggregator
Next Steps
- Metrics Reference — Full list of Prometheus metrics exposed by the control plane and agents
- Alerting & Notifications — Define alert rules based on metrics thresholds
- Architecture — Understand how the control plane, agents, and monitoring fit together