Adding retry after config and formatted logging by sledigabel · Pull Request #707 · actions/actions-runner-controller

sledigabel · 2021-07-27T14:17:30Z

Adding a basic retry loop during configuration. If configuration fails,
the runner will just straight into a retry loop and will continuously
fail until it dies after a while.

This change will retry 10 times and will exit if the configuration
wasn't successful.

Also, changed the logging format, adding a bit of color in the event of
success or failure.

sledigabel · 2021-07-27T14:26:53Z

The rationale behind that change is the following:

We are running our agents on top of EKS + kube2iam, and I'm seeing an occasional issue where runners can't seem to perform properly their API calls. The runner shows a suspicious error when that occurs:

Cannot assign requested address

When it happens, the runner usually reconnects fine.

However, it occasionally happens at the first call, during configuration. If that happens, the configuration fails but the entrypoint script carries through without handling that situation with an early exit. The runner then enters the runsvc.sh and loops infinitely since it's not configured:


--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------

# Authentication

√ Connected to GitHub

# Runner Registration

A runner exists with the same name
√ Successfully replaced the runner
Cannot assign requested address
16c16
< ./externals/node12/bin/node ./bin/RunnerService.js &
---
> ./externals/node12/bin/node ./bin/RunnerService.js $* &
20c20
< wait $PID
---
> wait $PID
\ No newline at end of file
19c19
< var runService = function () {
---
> var runService = function() {
23c23
< if (!stopping) {
---
> if(!stopping) {
27c27
< listener = childProcess.spawn(listenerExePath, ['run'], { env: process.env });
---
> listener = childProcess.spawn(listenerExePath, ['run'].concat(process.argv.slice(3)), { env: process.env });
30c30
< listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'], { env: process.env });
---
> listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'].concat(process.argv.slice(2)), { env: process.env });
33c33
< console.log(`Started listener process, pid: ${listener.pid}`);
---
> console.log('Started listener process');
43,46d42
< listener.on("error", (err) => {
< console.log(`Runner listener fail to start with error ${err.message}`);
< });
<
64c60
< if (!stopping) {
---
> if(!stopping) {
69c65
< } catch (ex) {
---
> } catch(ex) {
78c74
< var gracefulShutdown = function (code) {
---
> var gracefulShutdown = function(code) {
95c91
< });
---
> });
\ No newline at end of file
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/runner/.local/bin
Starting Runner listener with startup type: service
Started listener process
Started running service
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured

Note: the configuration is correct, the RUNNER_* environment variables are all present, and this runner has worked before so this is not a setup issue.

What happens next is annoying: the runner is provisioned from an autoscaling perspective, but is not busy. When the horizontal runner autoscaler (PercentBusy-based to limit the number of API calls) kicks in, it finds that the runner is doing nothing so scales down, even though lots of jobs are queuing.

We are actively working in debugging the issue and why occasional calls are being dropped, but in the meantime other community members could benefit from a simple retry at configuration time, given that if it fails the runner cannot be used.

sledigabel · 2021-07-27T14:49:45Z

I'm not sure whether the tests are failing because I don't have access to this secret:
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/.github/workflows/build-and-release-runners.yml#L40

Or whether the secret is missing altogether 😅

mumoshu · 2021-07-30T11:00:31Z

@sledigabel Hey! Thanks for the PR. I believe this is a great addition, but on the other hand, I have a concern about how I could quickly test this. For anything written in Go, we have unit tests written in Go.

Would it be possible to write some unit test for this entrypoint.sh retrying logic? If so, what would be your idea for the implementation?

sledigabel · 2021-07-30T15:11:08Z

Thank you very much for your review @mumoshu!

I'll look into what we can do surrounding testing, thanks for raising this.

Quick question, are the diff printouts here still needed? They seem to come from this PR a while back as the upstream runner was modified to allow ephemerality, not sure if it's still desired from a logging perspective? I'm missing the history on this so not sure.

mumoshu · 2021-07-31T05:02:17Z

Quick question, are the diff printouts here still needed? They seem to come from this PR a while back as the upstream runner was modified to allow ephemerality, not sure if it's still desired from a logging perspective? I'm missing the history on this so not sure.

@sledigabel Good question! Yes, I believe that's still a nice-to-have, as the upstream runsvc.sh still does not support --once yet and we aren't sure how long this situation persists. Without that diff log, we have no way to notice what changes are made to it by the upstream in which circumstance.

sledigabel · 2021-08-02T09:20:16Z

Added a new commit with unit tests for entrypoint.sh.

I browsed a few unit test framework for bash but I found nothing suitable as the entrypoint isn't broken down in functions, so ended up organising something in bash.

The unit tests are simulating a run for entrypoint. It creates some
dummy config.sh and runsvc.sh and makes sure the logic behind
entrypoint.sh is correct.

Unfortunately the entrypoint.sh contains some sections that are not
mockable so I had to put some logic in there too.

Testing includes for now:

the normal scenario
the normal non-ephemeral scenario
the configuration failure scenario

Also tested the entrypoint.sh on a real runner, still works as expected.

sledigabel · 2021-08-02T09:22:58Z

RUNNER_HOME helps with the mocking. Defaults to /runner as before.

sledigabel · 2021-08-02T09:23:50Z

We can't do this during unit testing, so had to add a conditional here ... not sure how to improve it but open to suggestions.

sledigabel · 2021-08-02T09:26:52Z

Will add a github action for this shortly.

sledigabel · 2021-08-02T10:01:33Z

This is hard to mock, so had to add a conditional here.

sledigabel · 2021-08-02T10:01:47Z

same as line 73, This is hard to mock, so had to add a conditional here.

Ah, good catch!

I would rather make this a standard feature behind a variable like RUNNER_CREATE_EXTERNALS_DIR_FROM so that L141 to L144 would look like:

RUNNER_CREATE_EXTERNALS_DIR_FROM=${RUNNER_CREATE_EXTERNALS_DIR_FROM:-./externalstmp} if [ ! -z "${RUNNER_CREATE_EXTERNALS_DIR_FROM}" ]; then mkdir ./externals # Hack due to the DinD volumes mv ${RUNNER_CREATE_EXTERNALS_DIR_FROM}/* ./externals/ fi

Similarly, runsvc.sh and RunnerService.js patching can be behind a variable so that it looks like:

RUNNER_PATCH_SERVICE_FILES=${RUNNER_PATCH_SERVICE_FILES:-true} if [ ! -z "${RUNNER_PATCH_SERVICE_FILES}" ]; then for f in runsvc.sh RunnerService.js; do diff {bin,patched}/${f} || : sudo mv bin/${f}{,.bak} sudo mv {patched,bin}/${f} done fi

RUNNER_CREATE_EXTERNALS_DIR_FROM and RUNNER_PATCH_SERVICE_FILES would be unnecessary in unit tests so I expect you to unset those variables in tests.

sledigabel · 2021-08-02T10:02:22Z

since the runner is actually not installed during the unit tests, i had to skip that one too.

sledigabel · 2021-08-02T10:03:03Z

This is the global unit test script

sledigabel · 2021-08-06T08:39:35Z

@mumoshu when you have a chance could you please have a look?
It's considerably more code with the unit testing, but it works for me in prod if that can be of any help :-)

Adding a basic retry loop during configuration. If configuration fails, the runner will just straight into a retry loop and will continuously fail until it dies after a while. This change will retry 10 times and will exit if the configuration wasn't successful. Also, changed the logging format, adding a bit of color in the event of success or failure.

The unit tests are simulating a run for entrypoint. It creates some dummy config.sh and runsvc.sh and makes sure the logic behind entrypoint.sh is correct. Unfortunately the entrypoint.sh contains some sections that are not mockable so I had to put some logic in there too. Testing includes for now: - the normal scenario - the normal non-ephemeral scenario - the configuration failure scenario Also tested the entrypoint.sh on a real runner, still works as expected.

... and adding safety mechanism in UNITTEST handling.

sledigabel · 2021-08-11T11:44:16Z

@mumoshu I was going to write some unit tests specifically for RUNNER_FEATURE_FLAG_EPHEMERAL. Since #724 the tests now pass too.
Are you ok generally with the direction this PR is going?

mumoshu · 2021-08-12T02:39:13Z

+if [ -z "${UNITTEST:-}" ]; then
+  sudo chown -R runner:docker ${RUNNER_HOME}
+  mv /runnertmp/* ${RUNNER_HOME}/
+fi


@sledigabel Hey! Thanks for all your efforts. I like the general direction we're heading in this PR.

And this part (L73-76) is the only blocker before merging for me.

Can we somehow remove this unit-test-only conditional? I generally prefer not to have test code leaking into the implementation.

My quick idea would be to inject all the dependencies (the user and the group to be chown-ed, the directory which contains initial data for the RUNNER_HOME dir):

Suggested change

if [ -z "${UNITTEST:-}" ]; then

sudo chown -R runner:docker ${RUNNER_HOME}

mv /runnertmp/* ${RUNNER_HOME}/

fi

sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}

mv ${RUNNER_INIT}/* ${RUNNER_HOME}/

fi

You'd also need to have defaults for these new variables, perhaps it would look like:

RUNNER_USER=${RUNNER_USER:-runner} DOCKER_USER=${DOCKER_USER:-docker} RUNNER_INIT=${RUNNER_INIT:-/runnertmp}

sledigabel · 2021-08-17T09:55:34Z

Apologies on the delay on this, I'll post an updated PR later today

sledigabel · 2021-08-20T20:24:39Z

@mumoshu I continued working on this, however I stumbled upon the problem of the sudo line 74:

sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}

The issue with sudo is that it would require sudo rights to run the tests continuously. It's ok in the SaaS runners but it's not necessarily on personal laptops. I would suggest two courses of actions:

we run line 74 only if RUNNER_USER and DOCKER_USER is defined (that would be my preferred option, we would set it to empty during the unit tests)
we force the sudo and accept it can only be run in environments where sudo rights are granted.

There is also the case of the last bit of the setup:
https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#diff-79187d0921cc40dbb8cc87debf9261df16066b0c4dd267dc4329312745ae3a98R141

I set this up behind the $UNITTEST variable, are you happy with that setup?

Let me know what you think.

mumoshu · 2021-08-24T23:54:55Z

@sledigabel Thanks a lot for your patience and efforts!

we run line 74 only if RUNNER_USER and DOCKER_USER is defined (that would be my preferred option, we would set it to empty during the unit tests)

That would work! But I took some time to think about it and realized that I'd rather use something like RUNNER_SUDO so that you can control whether to prepend sudo in various places. So the snippet would look like:

${RUNNER_SUDO} chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}

It should default to sudo in a standard setup so the definition would look like

RUNNER_SUDO=${RUNNER_SUDO:-sudo}

In unit tests it should be unset with e.g. RUNNER_SUDO=.

There is also the case of the last bit of the setup:
https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#diff-79187d0921cc40dbb8cc87debf9261df16066b0c4dd267dc4329312745ae3a98R141

I set this up behind the $UNITTEST variable, are you happy with that setup?

Ah, thanks for confirming! I overlooked it. Yeah, I'd prefer making it look like a standard feature if possible.

I've added my thoughts at https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#r695291063. WDYT?

sledigabel · 2021-08-25T11:50:14Z

Thanks @mumoshu it all looks good to me, I'll amend the PR tonight

mumoshu · 2021-09-05T23:50:52Z

@sledigabel Hey! Thanks again for your contribution. Mainly due to my delay, this took too much time to settle. I'm merging this as-is now. Then we can reduce potential conflicts with other changes, and we can still add any necessary changes on top later.

sledigabel · 2021-09-09T13:53:06Z

@mumoshu sorry it took so long to get back with a more decent management around the $UNITTEST, i'll prepare another PR to get rid of it

This was something that was missing in actions#707. Adding a new test to make sure the ephemeral feature flag from upstream is set up correctly by the script.

This was something that was missing in #707. Adding a new test to make sure the ephemeral feature flag from upstream is set up correctly by the script.

sledigabel force-pushed the retry_config branch from 970ce6c to 9522f78 Compare July 30, 2021 15:14

sledigabel commented Aug 2, 2021

View reviewed changes

sledigabel force-pushed the retry_config branch 2 times, most recently from 783a7f7 to fabc68e Compare August 2, 2021 09:59

sledigabel commented Aug 2, 2021

View reviewed changes

Comment thread runner/entrypoint.sh Outdated

Copy link
Copy Markdown

Contributor Author

sledigabel Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to mock, so had to add a conditional here.

sledigabel commented Aug 2, 2021

View reviewed changes

sledigabel added 3 commits August 11, 2021 12:31

Adding github action for entrypoint unit test

0d3bdab

... and adding safety mechanism in UNITTEST handling.

sledigabel force-pushed the retry_config branch from 011b532 to 0d3bdab Compare August 11, 2021 11:34

mumoshu reviewed Aug 12, 2021

View reviewed changes

mumoshu merged commit a98729b into actions:master Sep 5, 2021

sledigabel mentioned this pull request Sep 13, 2021

Adding unit test for ephemeral feature flag #808

Merged

mumoshu pushed a commit that referenced this pull request Sep 14, 2021

Adding unit test for ephemeral feature flag

bf35c51

This was something that was missing in #707. Adding a new test to make sure the ephemeral feature flag from upstream is set up correctly by the script.

Conversation

sledigabel commented Jul 27, 2021

Uh oh!

sledigabel commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sledigabel commented Jul 27, 2021

Uh oh!

mumoshu commented Jul 30, 2021

Uh oh!

sledigabel commented Jul 30, 2021

Uh oh!

mumoshu commented Jul 31, 2021

Uh oh!

sledigabel commented Aug 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sledigabel commented Aug 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sledigabel commented Aug 6, 2021

Uh oh!

sledigabel commented Aug 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sledigabel commented Aug 17, 2021

Uh oh!

sledigabel commented Aug 20, 2021

Uh oh!

mumoshu commented Aug 24, 2021

Uh oh!

sledigabel commented Aug 25, 2021

Uh oh!

mumoshu commented Sep 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sledigabel commented Sep 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sledigabel commented Jul 27, 2021 •

edited

Loading

mumoshu commented Sep 5, 2021 •

edited

Loading