GCP Runner Alerts and Dashboards

Available on the Enterprise plan. Contact sales to learn more.

Ona provides pre-built Grafana alerts and a dashboard for GCP Runners. These live in the terraform-google-ona-runner repository under the monitoring/ directory and are designed to work with the Prometheus metrics your runner already exposes. Before using these alerts and dashboards, you need to configure metrics collection on your runner. See Custom metrics pipeline for setup instructions.

Prerequisites

A deployed GCP Runner with metrics collection enabled
A Grafana instance (or compatible alerting system) connected to your Prometheus data source

Dashboard

The repository includes a Grafana dashboard at monitoring/dashboards/gitpod-runner-overview.json.

What it covers

Section	What it shows
Version & Replicas	Runner version tracking and replica count
Health Status	Health checks and active instance states by lifecycle stage
GCP Runner Kit Interface	Environment operation durations, function calls, error rates
GCP API Operations	API request metrics, success rates, error rates, latency heatmaps
KV Store Operations	Redis/key-value store operation rates and durations
PubSub Operations	Message processing, acknowledgments, connection health
Environment Operations	Compute environment operation rates and durations
System Metrics	Host-level CPU, memory, disk usage, and disk I/O
WRI	Workspace Runtime Interface performance metrics

The dashboard uses template variables ($project_id, $region, $runner_name, $instance) so you can filter by deployment.

Import the dashboard

In Grafana, go to Dashboards → Import
Upload gitpod-runner-overview.json from the repository
Select your Prometheus data source
Configure the template variables to match your deployment

Alerts

The repository includes 19 alert definitions at monitoring/alerts/, each in its own folder with an alert.yaml (Grafana-compatible alert rule) and a runbook.md (troubleshooting steps).

Alert overview

Critical — immediate response required

These indicate a service outage or severe degradation.

Alert	Condition	Impact
Service Down	Runner or auth proxy `up` metric is 0 for >1 min	Complete outage — users cannot create or manage environments
High Error Rate	>10% of environment operations failing over 5 min	Users experiencing environment creation failures
High Latency	95th percentile operation time >5 min	Slow environment operations
Goroutine Panics	Application panics detected	Potential service instability

High — prompt attention required

These indicate degraded performance or functionality.

Alert	Condition
API Rate Limiting	Hitting GCP API rate limits
PubSub Backlog	>1000 unprocessed messages
PubSub Connection Health	PubSub connectivity issues
Circuit Breaker Open	Circuit breaker protecting system from cascading failures
Redis Connection Issues	Redis connectivity problems

Medium — monitor and track

These indicate reduced capacity or resource constraints.

Alert	Condition
High CPU Usage	CPU usage >80% for extended period
High Memory Usage	Memory usage >80% for extended period
High Disk Usage	Disk usage >85% for extended period
Network Connection Health	Network connectivity issues
Network Errors	High rate of network errors
Registry Health	Container registry connectivity issues
Zone Capacity Issues	GCP zone unavailable or at capacity
Quota Exceeded	GCP resource quotas hit limits

Info — optimization opportunities

Alert	Condition
High Process Memory Usage	Process memory usage >1GB for extended period
High Goroutine Count	>1000 active goroutines

Import alerts into Grafana

Each alert folder contains an alert.yaml file:

In Grafana, go to Alerting → Alert Rules
Click Import
Upload the alert.yaml from the alert folder you want (e.g., service-down/alert.yaml)
Configure notification channels for the alert’s severity level

Customize thresholds

The default thresholds work for most deployments. Adjust them based on your scale:

Smaller deployments may need lower thresholds to catch issues earlier
Larger deployments may need higher thresholds to reduce noise
Development environments may want less sensitive alerts

Runbooks

Each alert folder also contains a runbook.md with step-by-step troubleshooting instructions. Before using a runbook, set up these environment variables:

export PROJECT_ID="your-gcp-project-id"
export REGION="your-region"          # e.g., us-central1
export RUNNER_ID="your-runner-id"    # from your Terraform configuration

The runbooks use gcloud compute ssh commands to inspect the runner instance and include resolution steps and escalation procedures.

Notification channels

Configure notification channels in Grafana based on alert severity:

Severity	Suggested channels
Critical	PagerDuty, SMS, phone
High	Slack, email
Medium	Email, ticket creation
Info	Email, dashboard review

Next steps

Custom metrics pipeline — Enable metrics collection and see all available metrics
Troubleshooting GCP Runners — Diagnose common runner issues
monitoring/ on GitHub — Browse alert definitions and dashboard source

Documentation Index

​Prerequisites

​Dashboard

​What it covers

​Import the dashboard

​Alerts

​Alert overview

​Critical — immediate response required

​High — prompt attention required

​Medium — monitor and track

​Info — optimization opportunities

​Import alerts into Grafana

​Customize thresholds

​Runbooks

​Notification channels

​Next steps

Prerequisites

Dashboard

What it covers

Import the dashboard

Alerts

Alert overview

Critical — immediate response required

High — prompt attention required

Medium — monitor and track

Info — optimization opportunities

Import alerts into Grafana

Customize thresholds

Runbooks

Notification channels

Next steps