Troubleshooting AWS runners

If you encounter any issues while setting up or operating a Runner, please follow these steps:

Review the common problems.
If the issue persists, reach out to support.

Contacting Support

To start a support chat, use the bubble icon located in the bottom right corner of the application. When contacting support, please include the following information:

Any error messages and relevant screenshots.
Runner ID and Version and AWS Region.
Runner Logs.
Report Issue

Copy Runner ID and Version

Navigate to Settings > Runners.
Locate your Runner card.
Click ... in the top right corner and select Copy ID.
The Runner Version is displayed as the last item in the menu.
Find Runner ID and Version

Find CloudFormation Stack

Navigate to Settings > Runners.
Open the Runner card to find the Stack Name, URL, and region.

Retrieve Runner Logs (ECS Task Logs)

You can adjust the log level of your Runner from the Runner Configuration section to get more detailed logs for troubleshooting. See AWS Runner setup for log level configuration options.

Using ECS Console

To view the logs for the Runner using the ECS console:

Navigate to the AWS ECS console.
Locate the cluster by the stack name.
Select the service associated with the Runner.
Go to the Tasks tab and find the most recent failed or active task.
Click the task ID to open the details.
Check the Logs tab or find the CloudWatch log stream.

Note that each task has two log groups: one for the Runner itself and another for Prometheus (monitoring); we need the former.

Using AWS CLI

To look up the cluster name and task ID using AWS CLI, follow these commands:

To list all clusters and find your cluster name by the stack name:

aws ecs list-clusters

To list tasks in a specific cluster and find your task ID:

aws ecs list-tasks --cluster <cluster-name>

Once you have the cluster name and task ID, you can view the logs for the Runner:

aws ecs describe-tasks --cluster <cluster-name> --tasks <task-id>

Monitoring and Metrics

If you have configured metrics collection, your monitoring system will receive Runner metrics. For information on configuring metrics collection, see AWS Runner setup.

Common Problems

Network misconfigurations are the most frequent causes of installation issues. Please refer to the infrastructure prerequisites to ensure all requirements are met. Below are common problems along with their diagnostics.

CloudFormation Stack Fails

Symptoms:
- Stack Event Status: ROLLBACK_COMPLETE or ROLLBACK_IN_PROGRESS due to missing VPC, availability zones, or subnets.
- Stack Event Status Reasons:
  - Parameter validation failed: parameter value for EC2RunnerInstancesSubnet does not exist.
  - Parameter validation failed: parameter value for parameter name EC2RunnerInstancesSubnet does not exist.
  - Parameter validation failed: parameter value for parameter name EC2RunnerAzs does not exist.
Diagnostics:
- On the initial page of the CloudFormation stack creation, ensure you select a VPC, at least one availability zone, and a subnet.
- Choose subnets across multiple availability zones for fault tolerance.

Runner Task Fails

Symptoms:
- Stack Event Status: CREATE_FAILED or ROLLBACK_IN_PROGRESS because the Runner task fails to launch or is stuck in a pending state.
- Stack Event Status Reason: ECS Deployment Circuit Breaker was triggered.
- Runner task fails initialization with errors such as ResourceInitializationError: ....
- Secrets Manager or other AWS services are inaccessible to the Runner.
- The Runner cannot pull container images or resolve DNS queries.
Diagnostics:
- Verify that the VPC has an Internet Gateway or NAT Gateway configured.
- Update the route tables to direct public subnets to the Internet Gateway and private subnets to the NAT Gateway.
- For private subnets, add VPC endpoints for services like Secrets Manager, S3, and ECR.
- Confirm that security groups allow outbound traffic to the required services.

Instance Type Not Available

If you encounter an error stating that the requested instance type is unavailable in a specific availability zone (e.g., “The selected instance type m6i.xlarge is not available in the automatically assigned zone us-east-1e”), this is often due to regional or zone-specific availability constraints within AWS.

Some zones, like us-east-1d and us-east-1e, have been reported to experience resource shortages more frequently. If possible, avoid using these zones exclusively and instead install your runners across multiple zones or regions.

Here’s how you can address this:

Install a Runner to a Different Region:
- Some instance types may be unavailable in certain regions or zones due to resource constraints. Refer to AWS instance type availability for details. If necessary, install runners to use a different AWS region that supports your preferred instance type.
Select Multiple Availability Zones:
- When installing a Runner using the AWS CloudFormation Stack, ensure that you select multiple subnets. For example, instead of restricting your Environment to only the subnet corresponding to us-east-1e, include subnets corresponding to us-east-1a and us-east-1b zones to improve availability.
  - You can also update the existing stack parameters.
Use an Alternate Instance Type:
- If the desired instance type (e.g., m6i.xlarge) is unavailable, consider using a different instance type, such as c5.xlarge, which may have better availability.
- To update, create a new Environment class using the alternate instance type and disable the existing class.
Retry Later:
- Instance availability can be transient. If none of the above options resolve the issue, wait and try again later, as AWS resources might become available after a brief period.

Unexpected Costs

Symptoms:
- You notice unexpected charges in your AWS bill that you believe are related to the Runner infrastructure.
- You continue receiving bills for resources even after deleting a Runner.
Diagnostics:
- Use the Controls for Managing Costs guide to investigate the specific AWS resources contributing to the charges.
- After deleting a runner, verify that the associated CloudFormation stack has been fully deleted. Additionally, check for any residual resources such as EC2 instances or EBS volumes associated with Environment IDs, and manually delete them if necessary to avoid ongoing costs.

AWS SSM Access Requirements

Symptoms:
- New Environments fail to start with error message: AWS account policy blocks ssm:SendCommand, which is required for starting Environments. See our docs for details on how to resolve this: https://www.gitpod.io/docs/ona/runners/aws/troubleshooting-runners#aws-ssm-access-requirements
- Runner is marked as degraded with the above error message
- Devcontainer build cache credentials cannot be set/refreshed, resulting in slower startup times
Diagnostics:
- Ona Environments require AWS Systems Manager (SSM) access to properly initialize and manage development Environments.
- The ssm:SendCommand permission is used to send the initial Environment configuration and refresh devcontainer build cache credentials in Environments, and ssm:GetCommandInvocation to verify the result.
- These permissions can be blocked by Service Control Policies (SCPs) at the AWS account level.
- Check if your AWS account has Service Control Policies (SCPs) that might be blocking SSM access. The Runner role (containing gitpodflexrunnerrole) must be able to run these commands against EC2 instances in the account.
- Test if SSM access is working by attempting to send a command to an EC2 instance or starting a new Environment.

Resolution:

Contact your AWS administrator to review the current SCP that’s blocking SSM access.

Request an exception for the Ona Runner’s IAM role to allow:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": ["ssm:SendCommand", "ssm:GetCommandInvocation"],
			"Resource": [
				"arn:aws:ec2:*:*:instance/*",
				"arn:aws:ssm:*:*:command/*"
			]
		}
	]
}

Alternatively, if your existing policy denies the permission for all accounts, add an exception for your Ona Runner account:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Deny",
			"Action": ["ssm:SendCommand", "ssm:GetCommandInvocation"],
			"Resource": [
				"arn:aws:ec2:*:*:instance/*",
				"arn:aws:ssm:*:*:command/*"
			],
			"Condition": {
				"StringNotEquals": {
					"aws:PrincipalAccount": [
						"<GITPOD_RUNNER_AWS_ACCOUNT_ID>"
					]
				}
			}
		}
	]
}

Security Note:
- The SSM commands are only used for Environment initialization and configuration.
- They are sent over encrypted channels and follow AWS security best practices.

Network Connectivity Issues

If you experience connectivity issues with your AWS Runner, follow these troubleshooting steps to diagnose and resolve common networking problems.

Common Network Issues

If you experience connectivity issues:

Verify security group configurations
- Ensure port 29222 is open for SSH access to development Environments
- Check that outbound rules allow HTTPS traffic to required endpoints
- Verify internal communication on port 22999 is allowed
Check route table configurations
- Confirm routes to internet gateway (for public subnets) or NAT gateway (for private subnets)
- Verify default routes are properly configured
Validate network ACL settings
- Ensure Network ACLs aren’t blocking required traffic
- Check both inbound and outbound rules
Confirm DNS resolution is working
- Test DNS resolution for app.gitpod.io and *.us01.gitpod.dev
- Verify VPC DNS resolution and DNS hostnames are enabled
Test connectivity to Ona services
- From an EC2 instance in your Runner’s subnet, test connectivity to required endpoints
- Use tools like curl or telnet to verify connectivity

Health Endpoint Connectivity Test

For Enterprise Runners, test the health endpoint to verify network connectivity and load balancer functionality:

# Test health endpoint connectivity (returns HTTP 200 on success)
curl -v https://<your-domain>/_health

Replace <your-domain> with your actual domain name configured during setup. A successful response returns HTTP 200 status code, indicating that:

DNS resolution is working correctly
Load balancer is accessible from your network
SSL/TLS certificate is properly configured
Basic network connectivity is established

If this test fails, check:

DNS configuration and propagation
Security group rules allowing HTTPS traffic
Load balancer health and target group status
SSL certificate validity and domain matching

Required Endpoints Connectivity Test

Test connectivity to these critical endpoints from your Runner’s subnet:

# Test HTTPS connectivity to Ona services
curl -I https://app.gitpod.io
curl -I https://api.us01.gitpod.dev

# Test connectivity to AWS services
curl -I https://public.ecr.aws
curl -I https://s3.amazonaws.com

Restarting the Runner After Networking Changes

After applying networking changes (such as security group updates, route table modifications, or VPC endpoint configurations), you may need to restart the Runner ECS task to ensure the changes take effect.

Using the AWS Console

Navigate to the AWS ECS console
In the left sidebar, click Clusters
Locate and click on the cluster with your stack name (found in Settings > Runners in Ona)
In the Services tab, click on the service associated with your Runner
Click the Update button
In the Deployment configuration section, check the box for Force new deployment
Click Update at the bottom of the page
ECS will start a new task with the updated networking configuration and gracefully stop the old one

Using AWS CLI

You can also restart the Runner using the AWS CLI:

# Get your cluster name and service name
aws ecs list-clusters
aws ecs list-services --cluster YOUR_CLUSTER_NAME

# Force a new deployment
aws ecs update-service \
  --cluster YOUR_CLUSTER_NAME \
  --service YOUR_SERVICE_NAME \
  --force-new-deployment

Verification Steps

After making networking changes and restarting the Runner:

Check Runner status in Ona
- Go to Settings > Runners in your Ona dashboard
- Verify the Runner shows as “Connected”
Test Environment creation
- Create a new Environment using the Runner
- Verify the Environment starts successfully
Monitor CloudWatch logs
- Check ECS task logs for any connectivity errors
- Look for successful connections to Ona services

Quickstart

Ona Environments

Ona Agents

Ona Guardrails

Deployment

Editors & IDEs

Organizations

Projects

Integrations

Source Control

Troubleshooting AWS runners

Contacting Support

Copy Runner ID and Version

Find CloudFormation Stack

Retrieve Runner Logs (ECS Task Logs)

Using ECS Console

Using AWS CLI

Monitoring and Metrics

Common Problems

CloudFormation Stack Fails

Runner Task Fails

Instance Type Not Available

Unexpected Costs

AWS SSM Access Requirements

Network Connectivity Issues

Common Network Issues

Health Endpoint Connectivity Test

Required Endpoints Connectivity Test

Restarting the Runner After Networking Changes

Using the AWS Console

Using AWS CLI

Verification Steps

Quickstart

Ona Environments

Ona Agents

Ona Guardrails

Deployment

Editors & IDEs

Organizations

Projects

Integrations

Source Control

​Contacting Support

​Copy Runner ID and Version

​Find CloudFormation Stack

​Retrieve Runner Logs (ECS Task Logs)

​Using ECS Console

​Using AWS CLI

​Monitoring and Metrics

​Common Problems

​CloudFormation Stack Fails

​Runner Task Fails

​Instance Type Not Available

​Unexpected Costs

​AWS SSM Access Requirements

​Network Connectivity Issues

​Common Network Issues

​Health Endpoint Connectivity Test

​Required Endpoints Connectivity Test

​Restarting the Runner After Networking Changes

Using the AWS Console

Using AWS CLI

​Verification Steps

Contacting Support

Copy Runner ID and Version

Find CloudFormation Stack

Retrieve Runner Logs (ECS Task Logs)

Using ECS Console

Using AWS CLI

Monitoring and Metrics

Common Problems

CloudFormation Stack Fails

Runner Task Fails

Instance Type Not Available

Unexpected Costs

AWS SSM Access Requirements

Network Connectivity Issues

Common Network Issues

Health Endpoint Connectivity Test

Required Endpoints Connectivity Test

Restarting the Runner After Networking Changes

Verification Steps