- Review the common problems.
- If the issue persists, reach out to support.
Contacting Support
To start a support chat, use the bubble icon located in the bottom right corner of the application. When contacting support, please include the following information:- Any error messages and relevant screenshots.
- Runner ID and Version and AWS Region.
-
Runner Logs.
Report Issue
Copy Runner ID and Version
- Navigate to Settings > Runners.
- Locate your Runner card.
-
Click
...
in the top right corner and selectCopy ID
. -
The Runner Version is displayed as the last item in the menu.
Find Runner ID and Version
Find CloudFormation Stack
- Navigate to Settings > Runners.
- Open the Runner card to find the Stack Name, URL, and region.
Retrieve Runner Logs (ECS Task Logs)
You can adjust the log level of your Runner from the Runner Configuration section to get more detailed logs for troubleshooting. See Enterprise AWS Runner setup for log level configuration options.Using ECS Console
To view the logs for the Runner using the ECS console:- Navigate to the AWS ECS console.
- Locate the cluster by the stack name.
- Select the service associated with the Runner.
- Go to the Tasks tab and find the most recent failed or active task.
- Click the task ID to open the details.
- Check the Logs tab or find the CloudWatch log stream.
Note that each task has two log groups: one for the Runner itself and another for Prometheus (monitoring); we need the former.
Using AWS CLI
To look up the cluster name and task ID using AWS CLI, follow these commands:- To list all clusters and find your cluster name by the stack name:
- To list tasks in a specific cluster and find your task ID:
- Once you have the cluster name and task ID, you can view the logs for the Runner:
Monitoring and Metrics
If you have configured metrics collection, your monitoring system will receive Runner metrics. For information on configuring metrics collection, see Enterprise AWS Runner setup.Common Problems
Network misconfigurations are the most frequent causes of installation issues. Please refer to the infrastructure prerequisites to ensure all requirements are met. Below are common problems along with their diagnostics.CloudFormation Stack Fails
-
Symptoms:
- Stack Event Status:
ROLLBACK_COMPLETE
orROLLBACK_IN_PROGRESS
due to missing VPC, availability zones, or subnets. - Stack Event Status Reasons:
Parameter validation failed: parameter value for EC2RunnerInstancesSubnet does not exist.
Parameter validation failed: parameter value for parameter name EC2RunnerInstancesSubnet does not exist.
Parameter validation failed: parameter value for parameter name EC2RunnerAzs does not exist.
- Stack Event Status:
-
Diagnostics:
- On the initial page of the CloudFormation stack creation, ensure you select a VPC, at least one availability zone, and a subnet.
- Choose subnets across multiple availability zones for fault tolerance.
Runner Task Fails
-
Symptoms:
- Stack Event Status:
CREATE_FAILED
orROLLBACK_IN_PROGRESS
because the Runner task fails to launch or is stuck in a pending state. - Stack Event Status Reason:
ECS Deployment Circuit Breaker was triggered.
- Runner task fails initialization with errors such as
ResourceInitializationError: ...
. - Secrets Manager or other AWS services are inaccessible to the Runner.
- The Runner cannot pull container images or resolve DNS queries.
- Stack Event Status:
-
Diagnostics:
- Verify that the VPC has an Internet Gateway or NAT Gateway configured.
- Update the route tables to direct public subnets to the Internet Gateway and private subnets to the NAT Gateway.
- For private subnets, add VPC endpoints for services like Secrets Manager, S3, and ECR.
- Confirm that security groups allow outbound traffic to the required services.
Instance Type Not Available
If you encounter an error stating that the requested instance type is unavailable in a specific availability zone (e.g., “The selected instance type m6i.xlarge is not available in the automatically assigned zone us-east-1e”), this is often due to regional or zone-specific availability constraints within AWS.Some zones, likeHere’s how you can address this:us-east-1d
andus-east-1e
, have been reported to experience resource shortages more frequently. If possible, avoid using these zones exclusively and instead install your runners across multiple zones or regions.
-
Install a Runner to a Different Region:
- Some instance types may be unavailable in certain regions or zones due to resource constraints. Refer to AWS instance type availability for details. If necessary, install runners to use a different AWS region that supports your preferred instance type.
-
Select Multiple Availability Zones:
- When installing a Runner using the AWS CloudFormation Stack, ensure that you select multiple subnets. For example, instead of restricting your Environment to only the subnet corresponding to
us-east-1e
, include subnets corresponding tous-east-1a
andus-east-1b
zones to improve availability.- You can also update the existing stack parameters.
- When installing a Runner using the AWS CloudFormation Stack, ensure that you select multiple subnets. For example, instead of restricting your Environment to only the subnet corresponding to
-
Use an Alternate Instance Type:
- If the desired instance type (e.g.,
m6i.xlarge
) is unavailable, consider using a different instance type, such asc5.xlarge
, which may have better availability. - To update, create a new Environment class using the alternate instance type and disable the existing class.
- If the desired instance type (e.g.,
-
Retry Later:
- Instance availability can be transient. If none of the above options resolve the issue, wait and try again later, as AWS resources might become available after a brief period.
Unexpected Costs
-
Symptoms:
- You notice unexpected charges in your AWS bill that you believe are related to the Runner infrastructure.
- You continue receiving bills for resources even after deleting a Runner.
-
Diagnostics:
- Use the Controls for Managing Costs guide to investigate the specific AWS resources contributing to the charges.
- After deleting a runner, verify that the associated CloudFormation stack has been fully deleted. Additionally, check for any residual resources such as EC2 instances or EBS volumes associated with Environment IDs, and manually delete them if necessary to avoid ongoing costs.
AWS SSM Access Requirements
-
Symptoms:
- New Environments fail to start with error message:
AWS account policy blocks ssm:SendCommand, which is required for starting Environments. See our docs for details on how to resolve this: https://www.gitpod.io/docs/ona/runners/aws/troubleshooting-runners#aws-ssm-access-requirements
- Runner is marked as degraded with the above error message
- Devcontainer build cache credentials cannot be set/refreshed, resulting in slower startup times
- New Environments fail to start with error message:
-
Diagnostics:
- Ona Environments require AWS Systems Manager (SSM) access to properly initialize and manage development Environments.
- The
ssm:SendCommand
permission is used to send the initial Environment configuration and refresh devcontainer build cache credentials in Environments, andssm:GetCommandInvocation
to verify the result. - These permissions can be blocked by Service Control Policies (SCPs) at the AWS account level.
- Check if your AWS account has Service Control Policies (SCPs) that might be blocking SSM access. The Runner role (containing
gitpodflexrunnerrole
) must be able to run these commands against EC2 instances in the account. - Test if SSM access is working by attempting to send a command to an EC2 instance or starting a new Environment.
-
Resolution:
- Contact your AWS administrator to review the current SCP that’s blocking SSM access.
- Request an exception for the Ona Runner’s IAM role to allow:
Alternatively, if your existing policy denies the permission for all accounts, add an exception for your Ona Runner account:
-
Security Note:
- The SSM commands are only used for Environment initialization and configuration.
- They are sent over encrypted channels and follow AWS security best practices.
Network Connectivity Issues
If you experience connectivity issues with your AWS Runner, follow these troubleshooting steps to diagnose and resolve common networking problems.Common Network Issues
If you experience connectivity issues:-
Verify security group configurations
- Ensure port 29222 is open for SSH access to development Environments
- Check that outbound rules allow HTTPS traffic to required endpoints
- Verify internal communication on port 22999 is allowed
-
Check route table configurations
- Confirm routes to internet gateway (for public subnets) or NAT gateway (for private subnets)
- Verify default routes are properly configured
-
Validate network ACL settings
- Ensure Network ACLs aren’t blocking required traffic
- Check both inbound and outbound rules
-
Confirm DNS resolution is working
- Test DNS resolution for
app.gitpod.io
and*.us01.gitpod.dev
- Verify VPC DNS resolution and DNS hostnames are enabled
- Test DNS resolution for
-
Test connectivity to Ona services
- From an EC2 instance in your Runner’s subnet, test connectivity to required endpoints
- Use tools like
curl
ortelnet
to verify connectivity
Health Endpoint Connectivity Test
For Enterprise Runners, test the health endpoint to verify network connectivity and load balancer functionality:<your-domain>
with your actual domain name configured during setup. A successful response returns HTTP 200 status code, indicating that:
- DNS resolution is working correctly
- Load balancer is accessible from your network
- SSL/TLS certificate is properly configured
- Basic network connectivity is established
- DNS configuration and propagation
- Security group rules allowing HTTPS traffic
- Load balancer health and target group status
- SSL certificate validity and domain matching
Required Endpoints Connectivity Test
Test connectivity to these critical endpoints from your Runner’s subnet:Restarting the Runner After Networking Changes
After applying networking changes (such as security group updates, route table modifications, or VPC endpoint configurations), you may need to restart the Runner ECS task to ensure the changes take effect.Using the AWS Console
- Navigate to the AWS ECS console
- In the left sidebar, click Clusters
- Locate and click on the cluster with your stack name (found in Settings > Runners in Ona)
- In the Services tab, click on the service associated with your Runner
- Click the Update button
- In the Deployment configuration section, check the box for Force new deployment
- Click Update at the bottom of the page
- ECS will start a new task with the updated networking configuration and gracefully stop the old one
Using AWS CLI
You can also restart the Runner using the AWS CLI:Verification Steps
After making networking changes and restarting the Runner:-
Check Runner status in Ona
- Go to Settings > Runners in your Ona dashboard
- Verify the Runner shows as “Connected”
-
Test Environment creation
- Create a new Environment using the Runner
- Verify the Environment starts successfully
-
Monitor CloudWatch logs
- Check ECS task logs for any connectivity errors
- Look for successful connections to Ona services