Adding retry after config and formatted logging#707
Conversation
|
The rationale behind that change is the following: We are running our agents on top of EKS + kube2iam, and I'm seeing an occasional issue where runners can't seem to perform properly their API calls. The runner shows a suspicious error when that occurs: When it happens, the runner usually reconnects fine. However, it occasionally happens at the first call, during configuration. If that happens, the configuration fails but the entrypoint script carries through without handling that situation with an early exit. The runner then enters the Note: the configuration is correct, the RUNNER_* environment variables are all present, and this runner has worked before so this is not a setup issue. What happens next is annoying: the runner is provisioned from an autoscaling perspective, but is not busy. When the horizontal runner autoscaler (PercentBusy-based to limit the number of API calls) kicks in, it finds that the runner is doing nothing so scales down, even though lots of jobs are queuing. We are actively working in debugging the issue and why occasional calls are being dropped, but in the meantime other community members could benefit from a simple retry at configuration time, given that if it fails the runner cannot be used. |
|
I'm not sure whether the tests are failing because I don't have access to this secret: Or whether the secret is missing altogether 😅 |
|
@sledigabel Hey! Thanks for the PR. I believe this is a great addition, but on the other hand, I have a concern about how I could quickly test this. For anything written in Go, we have unit tests written in Go. Would it be possible to write some unit test for this |
|
Thank you very much for your review @mumoshu! I'll look into what we can do surrounding testing, thanks for raising this. Quick question, are the diff printouts here still needed? They seem to come from this PR a while back as the upstream runner was modified to allow ephemerality, not sure if it's still desired from a logging perspective? I'm missing the history on this so not sure. |
@sledigabel Good question! Yes, I believe that's still a nice-to-have, as the upstream runsvc.sh still does not support |
|
Added a new commit with unit tests for I browsed a few unit test framework for bash but I found nothing suitable as the entrypoint isn't broken down in functions, so ended up organising something in bash. The unit tests are simulating a run for entrypoint. It creates some Unfortunately the entrypoint.sh contains some sections that are not Testing includes for now:
Also tested the entrypoint.sh on a real runner, still works as expected. |
There was a problem hiding this comment.
RUNNER_HOME helps with the mocking. Defaults to /runner as before.
There was a problem hiding this comment.
We can't do this during unit testing, so had to add a conditional here ... not sure how to improve it but open to suggestions.
|
Will add a github action for this shortly. |
783a7f7 to
fabc68e
Compare
There was a problem hiding this comment.
This is hard to mock, so had to add a conditional here.
There was a problem hiding this comment.
same as line 73, This is hard to mock, so had to add a conditional here.
There was a problem hiding this comment.
Ah, good catch!
I would rather make this a standard feature behind a variable like RUNNER_CREATE_EXTERNALS_DIR_FROM so that L141 to L144 would look like:
RUNNER_CREATE_EXTERNALS_DIR_FROM=${RUNNER_CREATE_EXTERNALS_DIR_FROM:-./externalstmp}
if [ ! -z "${RUNNER_CREATE_EXTERNALS_DIR_FROM}" ]; then
mkdir ./externals
# Hack due to the DinD volumes
mv ${RUNNER_CREATE_EXTERNALS_DIR_FROM}/* ./externals/
fi
Similarly, runsvc.sh and RunnerService.js patching can be behind a variable so that it looks like:
RUNNER_PATCH_SERVICE_FILES=${RUNNER_PATCH_SERVICE_FILES:-true}
if [ ! -z "${RUNNER_PATCH_SERVICE_FILES}" ]; then
for f in runsvc.sh RunnerService.js; do
diff {bin,patched}/${f} || :
sudo mv bin/${f}{,.bak}
sudo mv {patched,bin}/${f}
done
fi
RUNNER_CREATE_EXTERNALS_DIR_FROM and RUNNER_PATCH_SERVICE_FILES would be unnecessary in unit tests so I expect you to unset those variables in tests.
There was a problem hiding this comment.
since the runner is actually not installed during the unit tests, i had to skip that one too.
There was a problem hiding this comment.
This is the global unit test script
|
@mumoshu when you have a chance could you please have a look? |
Adding a basic retry loop during configuration. If configuration fails, the runner will just straight into a retry loop and will continuously fail until it dies after a while. This change will retry 10 times and will exit if the configuration wasn't successful. Also, changed the logging format, adding a bit of color in the event of success or failure.
The unit tests are simulating a run for entrypoint. It creates some dummy config.sh and runsvc.sh and makes sure the logic behind entrypoint.sh is correct. Unfortunately the entrypoint.sh contains some sections that are not mockable so I had to put some logic in there too. Testing includes for now: - the normal scenario - the normal non-ephemeral scenario - the configuration failure scenario Also tested the entrypoint.sh on a real runner, still works as expected.
... and adding safety mechanism in UNITTEST handling.
011b532 to
0d3bdab
Compare
| if [ -z "${UNITTEST:-}" ]; then | ||
| sudo chown -R runner:docker ${RUNNER_HOME} | ||
| mv /runnertmp/* ${RUNNER_HOME}/ | ||
| fi |
There was a problem hiding this comment.
@sledigabel Hey! Thanks for all your efforts. I like the general direction we're heading in this PR.
And this part (L73-76) is the only blocker before merging for me.
Can we somehow remove this unit-test-only conditional? I generally prefer not to have test code leaking into the implementation.
My quick idea would be to inject all the dependencies (the user and the group to be chown-ed, the directory which contains initial data for the RUNNER_HOME dir):
| if [ -z "${UNITTEST:-}" ]; then | |
| sudo chown -R runner:docker ${RUNNER_HOME} | |
| mv /runnertmp/* ${RUNNER_HOME}/ | |
| fi | |
| sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME} | |
| mv ${RUNNER_INIT}/* ${RUNNER_HOME}/ | |
| fi |
You'd also need to have defaults for these new variables, perhaps it would look like:
RUNNER_USER=${RUNNER_USER:-runner}
DOCKER_USER=${DOCKER_USER:-docker}
RUNNER_INIT=${RUNNER_INIT:-/runnertmp}
|
Apologies on the delay on this, I'll post an updated PR later today |
|
@mumoshu I continued working on this, however I stumbled upon the problem of the sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}The issue with sudo is that it would require sudo rights to run the tests continuously. It's ok in the SaaS runners but it's not necessarily on personal laptops. I would suggest two courses of actions:
There is also the case of the last bit of the setup: I set this up behind the $UNITTEST variable, are you happy with that setup? Let me know what you think. |
|
@sledigabel Thanks a lot for your patience and efforts!
That would work! But I took some time to think about it and realized that I'd rather use something like It should default to In unit tests it should be unset with e.g.
Ah, thanks for confirming! I overlooked it. Yeah, I'd prefer making it look like a standard feature if possible. I've added my thoughts at https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#r695291063. WDYT? |
|
Thanks @mumoshu it all looks good to me, I'll amend the PR tonight |
|
@sledigabel Hey! Thanks again for your contribution. Mainly due to my delay, this took too much time to settle. I'm merging this as-is now. Then we can reduce potential conflicts with other changes, and we can still add any necessary changes on top later. |
|
@mumoshu sorry it took so long to get back with a more decent management around the $UNITTEST, i'll prepare another PR to get rid of it |
This was something that was missing in actions#707. Adding a new test to make sure the ephemeral feature flag from upstream is set up correctly by the script.
This was something that was missing in #707. Adding a new test to make sure the ephemeral feature flag from upstream is set up correctly by the script.

Adding a basic retry loop during configuration. If configuration fails,
the runner will just straight into a retry loop and will continuously
fail until it dies after a while.
This change will retry 10 times and will exit if the configuration
wasn't successful.
Also, changed the logging format, adding a bit of color in the event of
success or failure.