Support automatic retry of failed builds | Feature Requests

Support automatic retry of failed builds

under review

Aaron

It would be great if CircleCI allowed you to configure an automatic retry of a build upon failure. Ideally, you would be able to specify the max number of times you would like it to retry. 
Taken from: https://discuss.circleci.com/t/support-auto-retrying-of-failed-builds/13332/6
CCI-I-935

March 8, 2019

Simon McManis

Fernando Abreu Can you confirm the exact behavior? Specifically, will it re-run only the failed jobs, or the entire workflow? If it’s only the failed jobs, would it be accurate to say this functions the same as manually clicking “Re-run workflow from failed” in the UI?

Fernando Abreu

Simon McManis that's correct, it would be the same without the manual intervation

Paul Chaplin

Fernando Abreu This looks like it could be very useful for us.

Would it use "Rerun failed tests" in preference to "Rerun workflow from failed" when that's been set up, or always fully re-run the failed job regardless? (We'd prefer just to re-run failed tests.)

Fernando Abreu

Paul Chaplin it will use

Rerun workflow from failed

Fernando Abreu

This is what workflow retries will look like. Thoughts?

J. Casalino

Fernando Abreu Looks good to me

Cody Smith

Fernando Abreu Will the "Start (UTC)" column reflect the start time of the attempt? Also maybe the "Trigger event" could reference the original trigger (a.k.a. first domino), e.g.

Fixed component "run-ci"

? Otherwise LGTM.

Fernando Abreu

Cody Smith Yes, it will reflect the time of the retried workflow.

Thanks for the feedback! The idea of referencing the original trigger makes a lot of sense. We won’t include it in the first release, but we’ll definitely consider it for a future one.

Fernando Abreu

Merged in a post:

Retry a build if it fails due to the 10 minute timeout

Joseph Emison

Occasionally, we have a build fail on the 10-minute timeout because of some kind of network failure in the container (e.g., trying to pull down packages from the internet). On the occasion that a build fails because of the 10-minute timeout, we would like to have the build re-run automatically to see if running it again clears the error. As it stands right now, developers have to push new code to retrigger the build because the build must succeed to merge the PR, and the failure won't clear otherwise.

CCI-I-533

May 13, 2025

Fernando Abreu

When rerunning a specific job, what would you expect to happen if other jobs in the workflow have failed?
Would you still expect the tagged job to rerun on its own?
If so, could you share your use case?

Timo Sand

Fernando Abreu My initial thought is that if the Workflow is already in a Failing state, then retrying wouldn't be necessary as it would just spend credits on something that would possibly get restarted soon manually.

Andrea Grandi

Fernando Abreu if a workflow has 10 jobs and 2 are failings I expect a gobal setting for the workflow (max_retries=5) and both failing jobs will need to be retried up to 5 times until they pass. Simple as that

Fernando Abreu

Would you be interested in having finer control over reruns at the step level?

For example:

jobs:
  test-job:
    ...
    steps:
      - checkout
      # other steps
      - run:
          name: Run Tests
          max-retries: 3
          command: ...
      # other steps

In this example, the

Run Tests

step would be retried up to 3 times before failing, while other steps would not be retried if they failed. This would allow you to define retries only for steps known to be flaky.

Would this be useful to you?

nathan.duckett@sailthru.com

Fernando Abreu This would work well for our use case.

We also sometimes have issues when initializing a job using multiple containers in an executor. Being able to retry at the job level would also solve this.

Timo Sand

Fernando Abreu This is exactly what we are looking for

Andrea Grandi

Fernando Abreu this is exactly what we are asking 🙂

david@wave.com

Fernando Abreu Yeah this would be perfect for us.
A retry-after:
  argument would also be useful.
For extra context: this would be most useful for us for steps which include some networking that occasionally fails (e.g. an npm install, docker pull, github API query, etc). It's generally possible to configure individual network requests to retry on failures, but it would be much easier to be able to configure this on the step itself.

brian.carp@customink.com

This would suit our use case as well.

Fernando Abreu

david@wave.com Thanks, that’s great to hear!

Can you share a bit more about how you'd use

retry-after

david@wave.com

Fernando Abreu Yeah, so it depends what your default retry time would be. 
For flakes caused by purely random local failures then you'd probably want no retry delay at all.
For networking related flakes, like an npm or github api being down, maybe you want a retry delay of a minute or so to give the API time to stabilise. Retrying immediately would be quite likely to just fail again.
Although, maybe it's sufficient to have a retry delay of around 30 seconds for everything?
But either way supporting retries at all is like 99% of the value here.

J. Casalino

Fernando Abreu Yes. It would be good to set a timeout or delay between retries as well. Use case: Transient network conditions or busy servers caused a failure. Let's either progressively back off between retries or allow a user to configure something like "wait n seconds before retry" setting for that step.

Fernando Abreu

That makes sense—thanks for the additional context! It's definitely something we'll consider.

Cody Smith

Fernando Abreu Our use case is retrying the whole job in case of worker failure, so we can use spot VMs to save on cost, so wouldn't need task level retries.

Fernando Abreu

Cody Smith thanks for sharing you use case!

Nathan Fish

marked this post as

under review

Would folks expect to see this work as a job configuration option or would they rather see it as a project setting and CircleCI tries to auto-retry based on some inspection of the job?

Maxime Lapointe

Nathan Fish Configuration would be more beneficial for us. Having different kind of backoff options and retry limits, and also the possibility of outputting metadata during/after retries (logging, further triggers in the workflow DAG, etc.)

Andrea Grandi

Nathan Fish both would be fine. In case of config, something like RETRY_ON_FAILURE=true and RETRY_MAX_ATTEMPTS=5 could work. Same for Project settings.

The benefits of being in the config is that in many cases devs could change these values with a PR

iain@iainbeeston.com

Nathan Fish Per configuration would be best for me. For me, most jobs aren't flakey and there's no point in retrying. If I could turn on auto retry for specific jobs I could have fine-grained control

J. Casalino

Nathan Fish Per configuration option would also be my preference; I would not want to set it as a global option for the entire server. I would want to set the backoff and retry limits as Maxime Lapointe mentioned.

Cody Smith

Nathan Fish Job config option would make the most sense to me. I've got some jobs that run on a flaky resource class, while others use the one of the reliable/builtin resource classes. So only the former need retries configured.

Mark Gibaud

Nathan Fish Per configuration. Really only for niche/specific jobs (like flakey tests) so job scope makes sense.

Nick Venenga

Nathan Fish would be helpful to configure "retry failed tests" on jobs with some kind of limit like attempts=3

Nathan Fish

While maybe not exactly what folks are looking for here. Would https://circleci.canny.io/api-feature-requests/p/allow-re-run-failed-tests-to-be-triggered-via-api at least give you come flexibility to create your own automation rules for this? I'm thinking it would be handy for known flaky tests and such.

Andrea Grandi

Nathan Fish hi, honestly not much. 
To know that a Ci job failed I would have to build a dedicated service which either poll your API (something we both don't want) or listens to some webhook you send and act on it (by parsing some response, getting the job ID which failed, calling an api etc....).
It would be much easier for you to know which job failed and just retrigger them (if the user has allowed such option), without asking users to implement their own service.

Lud

Andrea Grandi 100% agree. I don't want to provision a server, maintain custom code, figure out access in a secure way, etc.

Daniel Janicek

Nathan Fish I was looking for this feature today too. I think the ask (at least from my perspective) is more of convenience/ease of use thing. Auto-retries are totally possible right now through either on_fail or we have a retry.sh script to retry individual commands. It would make things easier though if I didn't have to sprinkle retry.sh across every command that might fail with a transient network issue.

It would be neat to have things auto-retry whenever a job fails with a HTTP 503(or related) request response anywhere in the workflow. IDK how realistic that is if Circle-ci only sees the script exit code, but it would be cool!

J. Casalino

+1, we also have occasional transient failures which cause the workflow to fail when a simple retry would resolve it. There's no reason to involve a human when a script can do the same thing. I would envision two settings for this: 1. Max number of retries, and 2. The delay between retries. If the delay is not specified, a progressive backoff should occur between retries (e.g. wait 1 minute after first failure, 2 minutes after second, 4 minutes after third, etc).

Craig Hawkes

Agree, being able to automate the rerun a single job if it fails would be useful.
It's possible to do this via the UI, and makes so much more sense to have it as an auto option.
One note - it would need to wait for all other jobs to complete, and then rerun all failed jobs 
so basically a way to automate the option provided from the UI if there are any failed jobs (maybe optionally base on which jobs failed?)

Andrei Railean

Allowing automatic retries of failed jobs and workflows is good for CircleCI business. More retries = more compute time spent = more $$$

→