Skip to main content
Available on the Enterprise tier. If you don’t have access to the GCP Runner Terraform module, contact your sales representative.
Ona provides pre-built Grafana alerts and a dashboard for GCP Runners. These live in the gitpod-gcp-terraform repository under the monitoring/ directory and are designed to work with the Prometheus metrics your runner already exposes. Before using these alerts and dashboards, you need to configure metrics collection on your runner. See Monitoring and Metrics for setup instructions.

Prerequisites

Dashboard

The repository includes a Grafana dashboard at monitoring/dashboards/gitpod-runner-overview.json.

What it covers

SectionWhat it shows
Version & ReplicasRunner version tracking and replica count
Health StatusHealth checks and active instance states by lifecycle stage
GCP Runner Kit InterfaceEnvironment operation durations, function calls, error rates
GCP API OperationsAPI request metrics, success rates, error rates, latency heatmaps
KV Store OperationsRedis/key-value store operation rates and durations
PubSub OperationsMessage processing, acknowledgments, connection health
Environment OperationsCompute environment operation rates and durations
System MetricsHost-level CPU, memory, disk usage, and disk I/O
WRIWorkspace Runtime Interface performance metrics
The dashboard uses template variables ($project_id, $region, $runner_name, $instance) so you can filter by deployment.

Import the dashboard

  1. In Grafana, go to Dashboards → Import
  2. Upload gitpod-runner-overview.json from the repository
  3. Select your Prometheus data source
  4. Configure the template variables to match your deployment

Alerts

The repository includes 19 alert definitions at monitoring/alerts/, each in its own folder with an alert.yaml (Grafana-compatible alert rule) and a runbook.md (troubleshooting steps).

Alert overview

Critical — immediate response required

These indicate a service outage or severe degradation.
AlertConditionImpact
Service DownRunner or auth proxy up metric is 0 for >1 minComplete outage — users cannot create or manage environments
High Error Rate>10% of environment operations failing over 5 minUsers experiencing environment creation failures
High Latency95th percentile operation time >5 minSlow environment operations
Goroutine PanicsApplication panics detectedPotential service instability

High — prompt attention required

These indicate degraded performance or functionality.
AlertCondition
API Rate LimitingHitting GCP API rate limits
PubSub Backlog>1000 unprocessed messages
PubSub Connection HealthPubSub connectivity issues
Circuit Breaker OpenCircuit breaker protecting system from cascading failures
Redis Connection IssuesRedis connectivity problems

Medium — monitor and track

These indicate reduced capacity or resource constraints.
AlertCondition
High CPU UsageCPU usage >80% for extended period
High Memory UsageMemory usage >80% for extended period
High Disk UsageDisk usage >85% for extended period
Network Connection HealthNetwork connectivity issues
Network ErrorsHigh rate of network errors
Registry HealthContainer registry connectivity issues
Zone Capacity IssuesGCP zone unavailable or at capacity
Quota ExceededGCP resource quotas hit limits

Info — optimization opportunities

AlertCondition
High Process Memory UsageProcess memory usage >1GB for extended period
High Goroutine Count>1000 active goroutines

Import alerts into Grafana

Each alert folder contains an alert.yaml file:
  1. In Grafana, go to Alerting → Alert Rules
  2. Click Import
  3. Upload the alert.yaml from the alert folder you want (e.g., service-down/alert.yaml)
  4. Configure notification channels for the alert’s severity level

Customize thresholds

The default thresholds work for most deployments. Adjust them based on your scale:
  • Smaller deployments may need lower thresholds to catch issues earlier
  • Larger deployments may need higher thresholds to reduce noise
  • Development environments may want less sensitive alerts

Runbooks

Each alert folder also contains a runbook.md with step-by-step troubleshooting instructions. Before using a runbook, set up these environment variables:
export PROJECT_ID="your-gcp-project-id"
export REGION="your-region"          # e.g., us-central1
export RUNNER_ID="your-runner-id"    # from your Terraform configuration
The runbooks use gcloud compute ssh commands to inspect the runner instance and include resolution steps and escalation procedures.

Notification channels

Configure notification channels in Grafana based on alert severity:
SeveritySuggested channels
CriticalPagerDuty, SMS, phone
HighSlack, email
MediumEmail, ticket creation
InfoEmail, dashboard review

Next steps