Available on the Enterprise tier. If you don’t have access to the GCP Runner Terraform module, contact your sales representative.
gitpod-gcp-terraform repository under the monitoring/ directory and are designed to work with the Prometheus metrics your runner already exposes.
Before using these alerts and dashboards, you need to configure metrics collection on your runner. See Monitoring and Metrics for setup instructions.
Prerequisites
- A deployed GCP Runner with metrics collection enabled
- A Grafana instance (or compatible alerting system) connected to your Prometheus data source
- Access to the
gitpod-gcp-terraformrepository
Dashboard
The repository includes a Grafana dashboard atmonitoring/dashboards/gitpod-runner-overview.json.
What it covers
| Section | What it shows |
|---|---|
| Version & Replicas | Runner version tracking and replica count |
| Health Status | Health checks and active instance states by lifecycle stage |
| GCP Runner Kit Interface | Environment operation durations, function calls, error rates |
| GCP API Operations | API request metrics, success rates, error rates, latency heatmaps |
| KV Store Operations | Redis/key-value store operation rates and durations |
| PubSub Operations | Message processing, acknowledgments, connection health |
| Environment Operations | Compute environment operation rates and durations |
| System Metrics | Host-level CPU, memory, disk usage, and disk I/O |
| WRI | Workspace Runtime Interface performance metrics |
$project_id, $region, $runner_name, $instance) so you can filter by deployment.
Import the dashboard
- In Grafana, go to Dashboards → Import
- Upload
gitpod-runner-overview.jsonfrom the repository - Select your Prometheus data source
- Configure the template variables to match your deployment
Alerts
The repository includes 19 alert definitions atmonitoring/alerts/, each in its own folder with an alert.yaml (Grafana-compatible alert rule) and a runbook.md (troubleshooting steps).
Alert overview
Critical — immediate response required
These indicate a service outage or severe degradation.| Alert | Condition | Impact |
|---|---|---|
| Service Down | Runner or auth proxy up metric is 0 for >1 min | Complete outage — users cannot create or manage environments |
| High Error Rate | >10% of environment operations failing over 5 min | Users experiencing environment creation failures |
| High Latency | 95th percentile operation time >5 min | Slow environment operations |
| Goroutine Panics | Application panics detected | Potential service instability |
High — prompt attention required
These indicate degraded performance or functionality.| Alert | Condition |
|---|---|
| API Rate Limiting | Hitting GCP API rate limits |
| PubSub Backlog | >1000 unprocessed messages |
| PubSub Connection Health | PubSub connectivity issues |
| Circuit Breaker Open | Circuit breaker protecting system from cascading failures |
| Redis Connection Issues | Redis connectivity problems |
Medium — monitor and track
These indicate reduced capacity or resource constraints.| Alert | Condition |
|---|---|
| High CPU Usage | CPU usage >80% for extended period |
| High Memory Usage | Memory usage >80% for extended period |
| High Disk Usage | Disk usage >85% for extended period |
| Network Connection Health | Network connectivity issues |
| Network Errors | High rate of network errors |
| Registry Health | Container registry connectivity issues |
| Zone Capacity Issues | GCP zone unavailable or at capacity |
| Quota Exceeded | GCP resource quotas hit limits |
Info — optimization opportunities
| Alert | Condition |
|---|---|
| High Process Memory Usage | Process memory usage >1GB for extended period |
| High Goroutine Count | >1000 active goroutines |
Import alerts into Grafana
Each alert folder contains analert.yaml file:
- In Grafana, go to Alerting → Alert Rules
- Click Import
- Upload the
alert.yamlfrom the alert folder you want (e.g.,service-down/alert.yaml) - Configure notification channels for the alert’s severity level
Customize thresholds
The default thresholds work for most deployments. Adjust them based on your scale:- Smaller deployments may need lower thresholds to catch issues earlier
- Larger deployments may need higher thresholds to reduce noise
- Development environments may want less sensitive alerts
Runbooks
Each alert folder also contains arunbook.md with step-by-step troubleshooting instructions. Before using a runbook, set up these environment variables:
gcloud compute ssh commands to inspect the runner instance and include resolution steps and escalation procedures.
Notification channels
Configure notification channels in Grafana based on alert severity:| Severity | Suggested channels |
|---|---|
| Critical | PagerDuty, SMS, phone |
| High | Slack, email |
| Medium | Email, ticket creation |
| Info | Email, dashboard review |
Next steps
- Monitoring and Metrics — Enable metrics collection and see all available metrics
- Troubleshooting GCP Runners — Diagnose common runner issues
monitoring/on GitHub — Browse alert definitions and dashboard source