Rerun failed Parallel Runs | Feature Requests

Rerun failed Parallel Runs

under review

Justin Houck

We have parallelism set to 28 to run our tests. If a single parallel run fails, we currently have to rerun the entire job. It would be great to just re-run the parallel run that failed.

CCI-I-1329

January 22, 2020

Minjun Seong

Hi folks, I'm a product manager at CircleCI. I would love to loop back in to better understand the failure scenarios you're experiencing. Could you elaborate on what causes your parallel nodes to fail and why the "rerun failed test" feature as mentioned by Sebastian does not (or does) solve your pain point?

Ben Brinckerhoff

Minjun Seong In our case, we see cases where e.g. there are network issues so libraries cannot be downloaded or code cannot be checked out. The machine fails prior to the step that runs specs, so we get into the situation where

There are no failed specs
But the job has failed because some containers failed at a previous step

Daniele Formichelli

Minjun Seong "rerun failed test" requires some additional setup to work, and anyway the machines are started even if all tests are passed.
Instead, If I run rerun failed
, I would expect non failed parallel machines to not be rerun as if they were each a different job (so only failed ones would be rerun)
Also this would be a more general solution, which would apply also to non-test workflows.

Liam Sharp

Minjun Seong In our case we have 25 machines running tests in parallel that all individually do various setup tasks that can fail, and when they do it's very early on. Each node takes about 3 mins to setup, and another 20 to run the actual tests. In this 3 mins there are various network related tasks that can randomly fail: Downloading previous tests results, downloading Java, setting up docker (most common failure) and installing python requirements. I'd say we face this issue a few times a week.

Nick Chursin

https://ideas.circleci.com/cloud-feature-requests/p/test-rerun-ui-button-on-job-dashboard

Sebastian Lerner

Hi folks, we have a feature currently in Closed Preview (beta) that we believe addresses the root of the issue in this post: https://circleci.com/docs/rerun-failed-tests-only/. We're hoping to make the feature available to all in an Open Preview soon. You can start making the minor updates to your config.yml now to be prepared for when the feature is available to all users, details in the docs linked above.

If you have a use case that requires you to rerun the full parallel run instead of just running the failed tests, I'd love to hear about it.

Gregory Haddow

Sebastian Lerner: would love to give this a try we have our own work around for this currently that still requires us to rerun all of the tests assigned to an individual test node in a parallel job. I am interested in seeing how this will work with our pipeline. Am concerned the partial run may be problematic for coverage and and other test metrics. Is there an option that will duplicate our own behaviour which is to rerun only nodes but that will still result in running all tests assigned to that node?

Roman Ivanov

Sebastian Lerner: please active for me or my company too.

Donald Tyler

Sebastian Lerner: thanks for the update. This feature addresses some of the scenarios that this feature request is needed for, but not all.
A node can fail for reasons other than a failed test. Including, but not limited to:
* Problems with third party services, e.g. pulling images from GCR, Docker Hub, etc
* An issue with CircleCI's infrastructure
So we would still like the ability to explicitly request that a certain node within the parallel job be rerun from scratch. Not just the failed tests.

Bastian Krol

Sebastian Lerner: I have set up our tests as described in the docs (https://github.com/instana/nodejs/pull/779) , but I don't see the "Rerun failed tests only" option. I assume it is still in closed beta? Is there any chance you could activate this for our account (github/instana) as well?

Sebastian Lerner

Hey folks, an FYI that the "rerun failed tests only" functionality is now available to any CircleCI user. Feel free to reach out if there are any questions. https://discuss.circleci.com/t/product-launch-re-run-failed-tests-only-circleci-tests-run/47775/51

Donald Tyler thanks for clarifying, makes total sense. This is something that we're evaluating how to enable, it is unfortunately not trivial.

Matt Rubin

Please add this. It's a huge product deficiency and painpoint in our pre-merge testing.

Liam Sharp

Totally agree. We're using 20 machines on a job that takes about 15 mins. If 1 of the parallel runs fails (due to some flakiness out of our control) re running just that parallel run vs all 20 is the difference between £3 and £0.15 in terms of credits, so maybe this is why this hasn't been addressed yet.

Jeff Fairley

Hi everyone. I just wanted to share that I've been using a job matrix rather than parallelism for the stated issues. (screenshot attached)

Using

<<parameters.index>>

in my jobs has been a great functional equivalent to

$CIRCLE_NODE_INDEX

If you need the typical environments provided with parallelism (maybe for the

circleci tests

cli command), they can be provided like so:

echo 'export CIRCLE_NODE_INDEX=<<parameters.index>>' >> $BASH_ENV

echo 'export CIRCLE_NODE_TOTAL=<<parameters.total>>' >> $BASH_ENV

I hope this helps others, and I hope CircleCI implements individual parallel job restart soon!

William Tait

+1 Thanks for the update Dawit. It would be great to hear from the CircleCI team about some plans that address the original feature request (specifically for parallel runs)

Dawit Gebregziabher

Hi everyone, thanks for the feedback. We understand this is an issue and we have been exploring solutions. We are currently working on a feature to help your jobs fail fast in the event of a failure in your test suite. This will save time and credits and should be available in Preview soon. I'll be following up here with updates. 
In the future, we plan to explore solutions like rerunning only failed tests and running failed tests first. While these features won't necessarily be scoped at the parallel run level, the goal is to improve overall test suite run efficiency. 
As always, your feedback is very important to us so please continue to upvote and comment here with your feedback and questions!

Jake Cozart

Dawit Gebregziabher: For what it's worth... we really just want the ability to re-run any job step (failed or succeeded). Our use case is we user circle CI to kick off deployments (spins up a machine with the build number and talks to AWS to kick off code deploy / monitor for success). Occasionally when deploying there is a manual step (migrating the database). Once completed we would like to just kick off that step again to finish the deploy. OR it would be nice to kick off a build step on a completed run to rollback software if needed.

We can already do this with SSH but it leaves a machine running in the background that we have to kill in CircleCI. If we could simply re-run a step without SSH that would be amazing!

Mike LaRocca

Dawit Gebregziabher: Yeah this solution worries me a bit because you are limiting data gathering. Fail fast means you only get signal on a specific % of your tests. I want to know the whole picture but only want to follow up on what needs to be followed on.

I'm sure it's an opt-in feature but not sure it really addresses the root cause at all (aside from cost savings)

Chang Wang

Dawit Gebregziabher: This doesn't quite help our situation. 
we unfortunately have some flakey e2e tests
different parts of the test might flake on each run
would like to rerun only the failed parallel runs and prevent the successful ones from rerunning

Donald Tyler

Dawit Gebregziabher: Thanks for the info, but unfortunately I am in the same situation as Chang Wang. We want to retry flaky tests, which failing fast won't help with.

jake.y.scott@gmail.com

+1 can you all please do the right thing here? This results in a colossal waste of time and energy (both in terms of human energy and electrical). I don't want to have to manually implement retryable parallel jobs using a dynamic config.

ismail.jattioui1@gmail.com

+ 1 pls

ismail.jattioui1@gmail.com

Barbara Nichols hi are there any updates on this pls ?

→