Support automatic retry of failed builds
under review
A
Aaron
It would be great if CircleCI allowed you to configure an automatic retry of a build upon failure. Ideally, you would be able to specify the max number of times you would like it to retry.
CCI-I-935
F
Fernando Abreu
When rerunning a specific job, what would you expect to happen if other jobs in the workflow have failed?
Would you still expect the tagged job to rerun on its own?
If so, could you share your use case?
T
Timo Sand
Fernando Abreu My initial thought is that if the Workflow is already in a Failing state, then retrying wouldn't be necessary as it would just spend credits on something that would possibly get restarted soon manually.
G
Grandi Andrea
Fernando Abreu if a workflow has 10 jobs and 2 are failings I expect a gobal setting for the workflow (max_retries=5) and both failing jobs will need to be retried up to 5 times until they pass. Simple as that
F
Fernando Abreu
Would you be interested in having finer control over reruns at the step level?
For example:
jobs:
test-job:
...
steps:
- checkout
# other steps
- run:
name: Run Tests
max-retries: 3
command: ...
# other steps
In this example, the
Run Tests
step would be retried up to 3 times before failing, while other steps would not be retried if they failed. This would allow you to define retries only for steps known to be flaky.Would this be useful to you?
Nathan Duckett
Fernando Abreu This would work well for our use case.
We also sometimes have issues when initializing a job using multiple containers in an executor. Being able to retry at the job level would also solve this.
T
Timo Sand
Fernando Abreu This is exactly what we are looking for
G
Grandi Andrea
Fernando Abreu this is exactly what we are asking 🙂
D
David Shepherd
Fernando Abreu Yeah this would be perfect for us.
A
retry-after:
argument would also be useful.For extra context: this would be most useful for us for steps which include some networking that occasionally fails (e.g. an npm install, docker pull, github API query, etc). It's generally possible to configure individual network requests to retry on failures, but it would be much easier to be able to configure this on the step itself.
B
Brian Carp
This would suit our use case as well.
F
Fernando Abreu
David Shepherd Thanks, that’s great to hear!
Can you share a bit more about how you'd use
retry-after
?D
David Shepherd
Fernando Abreu Yeah, so it depends what your default retry time would be.
For flakes caused by purely random local failures then you'd probably want no retry delay at all.
For networking related flakes, like an npm or github api being down, maybe you want a retry delay of a minute or so to give the API time to stabilise. Retrying immediately would be quite likely to just fail again.
Although, maybe it's sufficient to have a retry delay of around 30 seconds for everything?
But either way supporting retries at all is like 99% of the value here.
J
J. Casalino
Fernando Abreu Yes. It would be good to set a timeout or delay between retries as well. Use case: Transient network conditions or busy servers caused a failure. Let's either progressively back off between retries or allow a user to configure something like "wait n seconds before retry" setting for that step.
F
Fernando Abreu
That makes sense—thanks for the additional context! It's definitely something we'll consider.
C
Cody Smith
Fernando Abreu Our use case is retrying the whole job in case of worker failure, so we can use spot VMs to save on cost, so wouldn't need task level retries.
F
Fernando Abreu
Cody Smith thanks for sharing you use case!
Nathan Fish
under review
Would folks expect to see this work as a job configuration option or would they rather see it as a project setting and CircleCI tries to auto-retry based on some inspection of the job?
M
Maxime Lapointe
Nathan Fish Configuration would be more beneficial for us. Having different kind of backoff options and retry limits, and also the possibility of outputting metadata during/after retries (logging, further triggers in the workflow DAG, etc.)
G
Grandi Andrea
Nathan Fish both would be fine. In case of config, something like RETRY_ON_FAILURE=true and RETRY_MAX_ATTEMPTS=5 could work. Same for Project settings.
The benefits of being in the config is that in many cases devs could change these values with a PR
Iain Beeston
Nathan Fish Per configuration would be best for me. For me, most jobs aren't flakey and there's no point in retrying. If I could turn on auto retry for specific jobs I could have fine-grained control
J
J. Casalino
Nathan Fish Per configuration option would also be my preference; I would not want to set it as a global option for the entire server. I would want to set the backoff and retry limits as Maxime Lapointe mentioned.
C
Cody Smith
Nathan Fish Job config option would make the most sense to me. I've got some jobs that run on a flaky resource class, while others use the one of the reliable/builtin resource classes. So only the former need retries configured.
Mark Gibaud
Nathan Fish Per configuration. Really only for niche/specific jobs (like flakey tests) so job scope makes sense.
N
Nick Venenga
Nathan Fish would be helpful to configure "retry failed tests" on jobs with some kind of limit like attempts=3
Nathan Fish
While maybe not exactly what folks are looking for here. Would https://circleci.canny.io/api-feature-requests/p/allow-re-run-failed-tests-to-be-triggered-via-api at least give you come flexibility to create your own automation rules for this? I'm thinking it would be handy for known flaky tests and such.
G
Grandi Andrea
Nathan Fish hi, honestly not much.
To know that a Ci job failed I would have to build a dedicated service which either poll your API (something we both don't want) or listens to some webhook you send and act on it (by parsing some response, getting the job ID which failed, calling an api etc....).
It would be much easier for you to know which job failed and just retrigger them (if the user has allowed such option), without asking users to implement their own service.
Lud
Grandi Andrea 100% agree. I don't want to provision a server, maintain custom code, figure out access in a secure way, etc.
D
Daniel Janicek
Nathan Fish I was looking for this feature today too. I think the ask (at least from my perspective) is more of convenience/ease of use thing. Auto-retries are totally possible right now through either on_fail or we have a retry.sh script to retry individual commands. It would make things easier though if I didn't have to sprinkle retry.sh across every command that might fail with a transient network issue.
It would be neat to have things auto-retry whenever a job fails with a HTTP 503(or related) request response anywhere in the workflow. IDK how realistic that is if Circle-ci only sees the script exit code, but it would be cool!
J
J. Casalino
+1, we also have occasional transient failures which cause the workflow to fail when a simple retry would resolve it. There's no reason to involve a human when a script can do the same thing. I would envision two settings for this: 1. Max number of retries, and 2. The delay between retries. If the delay is not specified, a progressive backoff should occur between retries (e.g. wait 1 minute after first failure, 2 minutes after second, 4 minutes after third, etc).
C
Craig Hawkes
Agree, being able to automate the rerun a single job if it fails would be useful.
It's possible to do this via the UI, and makes so much more sense to have it as an auto option.
One note - it would need to wait for all other jobs to complete, and then rerun all failed jobs
so basically a way to automate the option provided from the UI if there are any failed jobs (maybe optionally base on which jobs failed?)
Andrei R
Allowing automatic retries of failed jobs and workflows is good for CircleCI business. More retries = more compute time spent = more $$$
Amalia Nostalgia
5000€ approved, please I need money. thank you Regards Amalia
Conner Babb
+1, this would be a great feature
Lester DeKay
Please!
Load More
→