Kubernetes should use a scheduled cleaner to ensure all related resources are cleaned up

This is Part 2 of this proposal. Find part 1 at #4184 (closed)

Summary

If a Runner is abruptly shut down it doesn't get the chance to do cleanup. After a restart, the runner is unaware of previously created resources. This issue moves this burden away from GitLab Runner into a separate garbage collector/cleaner.

Proposal

In an external project create a cleaner that is ran through a Kubernetes CronJob every X amount of time. The cleaner's job would be to clean up any stale resources marked with specific labels.

Specification:

cleaner that is running on a cron every X time.
1. Search for pods annotated with cleaner.kubernetes.gitlab.org/ttl: 'X'. Where X is the number in seconds of how long the pod should be old for, for example, 36000 (1hr)
2. Get the pod, it can be running/failed/initalizing whatever the state, check its creation time and see if the creation time is older then the ttl if it is delete it.
Have GitLab Runner specify pod_annotations inside of the config.toml
1. If the timeout of a job is 3 hours, they can update the config.toml to specify cleaner.kubernetes.gitlab.org/ttl: '11700' (3hrs15min). The extra 15min are there in case pod deletion requested by GitLab Runner takes a long time.
2. Users are also able to override annotations

Example:

cleaner runs and finds a pod that was created 3hrs20min ago with the annotation cleaner.kubernetes.gitlab.org/ttl: '11700'. It deletes the pod
cleaner runs and finds a pod that was created 1hr ago with the annotation cleaner.kubernetes.gitlab.org/ttl: '11700'. It ignores the pod.

Specification document:

https://gitlab.com/azzsteve/podgc/-/blob/3a4fea99acf31386a2a3156a3c4bf5d761057c35/SPEC.md

Distribution:

Helm: Most of the Kubernetes executor users use the Helm Chart. We could easily add a CronJob resource to the templates, which will be created and managed by helm. With a few configuration options, such as cron job image and cron job expression we should be good to go.
Runner Operator: The Operator, just like the Helm chart can deploy and manage the CronJob.
Others: For other deployments we can provide a simple CronJob yaml in the docs, which users can use to manage it themselves.

Future iterations

The above specification is a great first iteration. Future improvements might or might not include:

Automatically setting the ttl label to be the same(with a few minutes/seconds leeway) as the job's duration
Terminate pods through liveness probes and let them wait for cleanup in a failed state. This way they will consume less resources

Edited Aug 31, 2021 by Darren Eastman