Skip to content

Adding retry after config and formatted logging#707

Merged
mumoshu merged 3 commits intoactions:masterfrom
sledigabel:retry_config
Sep 5, 2021
Merged

Adding retry after config and formatted logging#707
mumoshu merged 3 commits intoactions:masterfrom
sledigabel:retry_config

Conversation

@sledigabel
Copy link
Copy Markdown
Contributor

Adding a basic retry loop during configuration. If configuration fails,
the runner will just straight into a retry loop and will continuously
fail until it dies after a while.

This change will retry 10 times and will exit if the configuration
wasn't successful.

Also, changed the logging format, adding a bit of color in the event of
success or failure.

@sledigabel
Copy link
Copy Markdown
Contributor Author

sledigabel commented Jul 27, 2021

The rationale behind that change is the following:

We are running our agents on top of EKS + kube2iam, and I'm seeing an occasional issue where runners can't seem to perform properly their API calls. The runner shows a suspicious error when that occurs:

Cannot assign requested address

When it happens, the runner usually reconnects fine.

However, it occasionally happens at the first call, during configuration. If that happens, the configuration fails but the entrypoint script carries through without handling that situation with an early exit. The runner then enters the runsvc.sh and loops infinitely since it's not configured:


--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------

# Authentication

√ Connected to GitHub

# Runner Registration

A runner exists with the same name
√ Successfully replaced the runner
Cannot assign requested address
16c16
< ./externals/node12/bin/node ./bin/RunnerService.js &
---
> ./externals/node12/bin/node ./bin/RunnerService.js $* &
20c20
< wait $PID
---
> wait $PID
\ No newline at end of file
19c19
< var runService = function () {
---
> var runService = function() {
23c23
< if (!stopping) {
---
> if(!stopping) {
27c27
< listener = childProcess.spawn(listenerExePath, ['run'], { env: process.env });
---
> listener = childProcess.spawn(listenerExePath, ['run'].concat(process.argv.slice(3)), { env: process.env });
30c30
< listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'], { env: process.env });
---
> listener = childProcess.spawn(listenerExePath, ['run', '--startuptype', 'service'].concat(process.argv.slice(2)), { env: process.env });
33c33
< console.log(`Started listener process, pid: ${listener.pid}`);
---
> console.log('Started listener process');
43,46d42
< listener.on("error", (err) => {
< console.log(`Runner listener fail to start with error ${err.message}`);
< });
<
64c60
< if (!stopping) {
---
> if(!stopping) {
69c65
< } catch (ex) {
---
> } catch(ex) {
78c74
< var gracefulShutdown = function (code) {
---
> var gracefulShutdown = function(code) {
95c91
< });
---
> });
\ No newline at end of file
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/runner/.local/bin
Starting Runner listener with startup type: service
Started listener process
Started running service
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured
Runner listener exited with error code 2
Runner listener exit with retryable error, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
An error occurred: Not configured

Note: the configuration is correct, the RUNNER_* environment variables are all present, and this runner has worked before so this is not a setup issue.

What happens next is annoying: the runner is provisioned from an autoscaling perspective, but is not busy. When the horizontal runner autoscaler (PercentBusy-based to limit the number of API calls) kicks in, it finds that the runner is doing nothing so scales down, even though lots of jobs are queuing.

We are actively working in debugging the issue and why occasional calls are being dropped, but in the meantime other community members could benefit from a simple retry at configuration time, given that if it fails the runner cannot be used.

@sledigabel
Copy link
Copy Markdown
Contributor Author

I'm not sure whether the tests are failing because I don't have access to this secret:
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/.github/workflows/build-and-release-runners.yml#L40

Or whether the secret is missing altogether 😅

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Jul 30, 2021

@sledigabel Hey! Thanks for the PR. I believe this is a great addition, but on the other hand, I have a concern about how I could quickly test this. For anything written in Go, we have unit tests written in Go.

Would it be possible to write some unit test for this entrypoint.sh retrying logic? If so, what would be your idea for the implementation?

@sledigabel
Copy link
Copy Markdown
Contributor Author

Thank you very much for your review @mumoshu!

I'll look into what we can do surrounding testing, thanks for raising this.

Quick question, are the diff printouts here still needed? They seem to come from this PR a while back as the upstream runner was modified to allow ephemerality, not sure if it's still desired from a logging perspective? I'm missing the history on this so not sure.

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Jul 31, 2021

Quick question, are the diff printouts here still needed? They seem to come from this PR a while back as the upstream runner was modified to allow ephemerality, not sure if it's still desired from a logging perspective? I'm missing the history on this so not sure.

@sledigabel Good question! Yes, I believe that's still a nice-to-have, as the upstream runsvc.sh still does not support --once yet and we aren't sure how long this situation persists. Without that diff log, we have no way to notice what changes are made to it by the upstream in which circumstance.

@sledigabel
Copy link
Copy Markdown
Contributor Author

Added a new commit with unit tests for entrypoint.sh.

I browsed a few unit test framework for bash but I found nothing suitable as the entrypoint isn't broken down in functions, so ended up organising something in bash.

The unit tests are simulating a run for entrypoint. It creates some
dummy config.sh and runsvc.sh and makes sure the logic behind
entrypoint.sh is correct.

Unfortunately the entrypoint.sh contains some sections that are not
mockable so I had to put some logic in there too.

Testing includes for now:

  • the normal scenario
  • the normal non-ephemeral scenario
  • the configuration failure scenario

Also tested the entrypoint.sh on a real runner, still works as expected.

image

Comment thread runner/entrypoint.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNNER_HOME helps with the mocking. Defaults to /runner as before.

Comment thread runner/entrypoint.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this during unit testing, so had to add a conditional here ... not sure how to improve it but open to suggestions.

@sledigabel
Copy link
Copy Markdown
Contributor Author

Will add a github action for this shortly.

@sledigabel sledigabel force-pushed the retry_config branch 2 times, most recently from 783a7f7 to fabc68e Compare August 2, 2021 09:59
Comment thread runner/entrypoint.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to mock, so had to add a conditional here.

Comment thread runner/entrypoint.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as line 73, This is hard to mock, so had to add a conditional here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch!

I would rather make this a standard feature behind a variable like RUNNER_CREATE_EXTERNALS_DIR_FROM so that L141 to L144 would look like:

RUNNER_CREATE_EXTERNALS_DIR_FROM=${RUNNER_CREATE_EXTERNALS_DIR_FROM:-./externalstmp}
if [ ! -z "${RUNNER_CREATE_EXTERNALS_DIR_FROM}" ]; then
  mkdir ./externals
  # Hack due to the DinD volumes
  mv ${RUNNER_CREATE_EXTERNALS_DIR_FROM}/* ./externals/
fi

Similarly, runsvc.sh and RunnerService.js patching can be behind a variable so that it looks like:

RUNNER_PATCH_SERVICE_FILES=${RUNNER_PATCH_SERVICE_FILES:-true}
if [ ! -z "${RUNNER_PATCH_SERVICE_FILES}" ]; then
  for f in runsvc.sh RunnerService.js; do
    diff {bin,patched}/${f} || :
    sudo mv bin/${f}{,.bak}
    sudo mv {patched,bin}/${f}
  done
fi

RUNNER_CREATE_EXTERNALS_DIR_FROM and RUNNER_PATCH_SERVICE_FILES would be unnecessary in unit tests so I expect you to unset those variables in tests.

Comment thread runner/entrypoint.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the runner is actually not installed during the unit tests, i had to skip that one too.

Comment thread test/entrypoint/entrypoint_unittest.sh Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the global unit test script

@sledigabel
Copy link
Copy Markdown
Contributor Author

@mumoshu when you have a chance could you please have a look?
It's considerably more code with the unit testing, but it works for me in prod if that can be of any help :-)

Adding a basic retry loop during configuration. If configuration fails,
the runner will just straight into a retry loop and will continuously
fail until it dies after a while.

This change will retry 10 times and will exit if the configuration
wasn't successful.

Also, changed the logging format, adding a bit of color in the event of
success or failure.
The unit tests are simulating a run for entrypoint. It creates some
dummy config.sh and runsvc.sh and makes sure the logic behind
entrypoint.sh is correct.

Unfortunately the entrypoint.sh contains some sections that are not
mockable so I had to put some logic in there too.

Testing includes for now:
- the normal scenario
- the normal non-ephemeral scenario
- the configuration failure scenario

Also tested the entrypoint.sh on a real runner, still works as expected.
... and adding safety mechanism in UNITTEST handling.
@sledigabel
Copy link
Copy Markdown
Contributor Author

@mumoshu I was going to write some unit tests specifically for RUNNER_FEATURE_FLAG_EPHEMERAL. Since #724 the tests now pass too.
Are you ok generally with the direction this PR is going?

Comment thread runner/entrypoint.sh
Comment on lines +73 to +76
if [ -z "${UNITTEST:-}" ]; then
sudo chown -R runner:docker ${RUNNER_HOME}
mv /runnertmp/* ${RUNNER_HOME}/
fi
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sledigabel Hey! Thanks for all your efforts. I like the general direction we're heading in this PR.

And this part (L73-76) is the only blocker before merging for me.

Can we somehow remove this unit-test-only conditional? I generally prefer not to have test code leaking into the implementation.

My quick idea would be to inject all the dependencies (the user and the group to be chown-ed, the directory which contains initial data for the RUNNER_HOME dir):

Suggested change
if [ -z "${UNITTEST:-}" ]; then
sudo chown -R runner:docker ${RUNNER_HOME}
mv /runnertmp/* ${RUNNER_HOME}/
fi
sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}
mv ${RUNNER_INIT}/* ${RUNNER_HOME}/
fi

You'd also need to have defaults for these new variables, perhaps it would look like:

RUNNER_USER=${RUNNER_USER:-runner}
DOCKER_USER=${DOCKER_USER:-docker}
RUNNER_INIT=${RUNNER_INIT:-/runnertmp}

@sledigabel
Copy link
Copy Markdown
Contributor Author

Apologies on the delay on this, I'll post an updated PR later today

@sledigabel
Copy link
Copy Markdown
Contributor Author

@mumoshu I continued working on this, however I stumbled upon the problem of the sudo line 74:

sudo chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}

The issue with sudo is that it would require sudo rights to run the tests continuously. It's ok in the SaaS runners but it's not necessarily on personal laptops. I would suggest two courses of actions:

  1. we run line 74 only if RUNNER_USER and DOCKER_USER is defined (that would be my preferred option, we would set it to empty during the unit tests)
  2. we force the sudo and accept it can only be run in environments where sudo rights are granted.

There is also the case of the last bit of the setup:
https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#diff-79187d0921cc40dbb8cc87debf9261df16066b0c4dd267dc4329312745ae3a98R141

I set this up behind the $UNITTEST variable, are you happy with that setup?

Let me know what you think.

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Aug 24, 2021

@sledigabel Thanks a lot for your patience and efforts!

we run line 74 only if RUNNER_USER and DOCKER_USER is defined (that would be my preferred option, we would set it to empty during the unit tests)

That would work! But I took some time to think about it and realized that I'd rather use something like RUNNER_SUDO so that you can control whether to prepend sudo in various places. So the snippet would look like:

${RUNNER_SUDO} chown -R ${RUNNER_USER}:${DOCKER_USER} ${RUNNER_HOME}

It should default to sudo in a standard setup so the definition would look like

RUNNER_SUDO=${RUNNER_SUDO:-sudo}

In unit tests it should be unset with e.g. RUNNER_SUDO=.

There is also the case of the last bit of the setup:
https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#diff-79187d0921cc40dbb8cc87debf9261df16066b0c4dd267dc4329312745ae3a98R141

I set this up behind the $UNITTEST variable, are you happy with that setup?

Ah, thanks for confirming! I overlooked it. Yeah, I'd prefer making it look like a standard feature if possible.

I've added my thoughts at https://github.com/actions-runner-controller/actions-runner-controller/pull/707/files#r695291063. WDYT?

@sledigabel
Copy link
Copy Markdown
Contributor Author

Thanks @mumoshu it all looks good to me, I'll amend the PR tonight

@mumoshu
Copy link
Copy Markdown
Collaborator

mumoshu commented Sep 5, 2021

@sledigabel Hey! Thanks again for your contribution. Mainly due to my delay, this took too much time to settle. I'm merging this as-is now. Then we can reduce potential conflicts with other changes, and we can still add any necessary changes on top later.

@mumoshu mumoshu merged commit a98729b into actions:master Sep 5, 2021
@sledigabel
Copy link
Copy Markdown
Contributor Author

@mumoshu sorry it took so long to get back with a more decent management around the $UNITTEST, i'll prepare another PR to get rid of it

sledigabel added a commit to sledigabel/actions-runner-controller that referenced this pull request Sep 13, 2021
This was something that was missing in actions#707.
Adding a new test to make sure the ephemeral feature flag from upstream
is set up correctly by the script.
mumoshu pushed a commit that referenced this pull request Sep 14, 2021
This was something that was missing in #707.
Adding a new test to make sure the ephemeral feature flag from upstream
is set up correctly by the script.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants