Be able to set resource limits/requests per service container for self hosted container runner
In a self hosted scenario, using container runner (kubernetes setup) and service containers, it would be great if you can specify different resource limits/requests instead of inheriting these specs from the first container. The current limitation causes a lot of resource waste
Label the task pod with the job name
It would be helpful to add a label to the pod that is running the task with the real job-name to easily reference which pod is running which task. These labels can then show up in our observability platform and we can easily aggregate across the same job type. Right now we can only do it by resource class
Develop a better way for Container Runner to clean up failed task pods
In some cases, container runner fails to clean up dead task pods before the GC loop takes effect. It would be great if the team can explore other options so as to allow rely on the GC loop as a fail back and decrease cluster resource needs.
Open source the self hosted runner agents
An ask to open source the self hosted runner agents so the community can work with the team on any issues
Support auto-scaling for self-hosted runners
Automatic scaling to handle changing workloads - AWS ASGs, Kubernetes, etc.
Allow customization of log formats for Container Runner
With the self-hosted Container Runner, we noted that the logs emitted from the container-agent is hard for monitoring services (e.g., DataDog) to parse and aggregate. It would be nice to be able to configure the following: include the log level attribute (currently not shown) allow for customization of the format (e.g., space-seperated key=value format) toggle colour outputs The attached screenshot is the logs I retrieved when fetching logs from the container-agent pod.
Add delay and retry mechanism for Container Runner
In some cases container agent may try and scheduled the pod before a node is alive and healthy. It would be great to add a small delay and/or retries to container agent to allow the node to fully come up.
Increase the timeout of livenessProbe and readinessProbe
To help with network congestion and throughput, it would be great if we could increase this timeout to better counter networks that are congested or take time to spin up.
Add ReadinessProbe to the task pods
Since we are already using livenessProbe it would be great if readinessProbe could be added as well to make sure the task pod has fully come up in case of network issues, etc.