Geo multi-node deployment upgrade: investigate order when upgrading non-deploy nodes

Currently in the zero-downtime upgrade instructions for multi-node Geo deployments, we do not specify a specific order when upgrading non-deploy (non-Gitaly) nodes. This issue investigates how the order of upgrading non-deploy (non-Gitaly) nodes impacts downtime (500 errors, readiness checks, failed end-to-end tests). Specifically, check if upgrading Sidekiq nodes before Rails web nodes reduces/eliminates errors seen when the opposite order is used.

This investigation is separate from looking at downtime during reconfigure and hot reload of Web nodes, which is covered here.

Background: During an upgrade from 12.10.12 to 13.0.10 of multi-node Geo deployment when nodes were upgraded one-by-one, we observed 500 errors after one of two Rails nodes on the Primary site was upgraded and reconfigured, but before the one online Sidekiq node was upgraded (the other Sidekiq node was the deploy node and was not handling requests). We did not observe failures in the readiness checks.

The 500 failures were related to creating a project via API:

 Failure/Error:
       @project = Resource::Project.fabricate_via_api! do |project|
         project.name = 'project'
       end
     
     QA::Resource::ApiFabricator::InternalServerError:
       Failed to GET http://gjsl9-primary.gogitlab.ml/api/v4/groups/looping-pipeline?private_token=[****] - (500)

For the Secondary site upgrade, after Gitaly node and deploy node we upgraded the online Sidekiq node first, and then the Rails nodes in tandem. We did not observe any 500 errors or readiness check failures.

These observations don't prove any causation but prompted this issue to explore further.

If the order of upgrading non-deploy nodes does impact downtime, we should update our instructions accordingly.

Edited Jul 14, 2020 by Jennifer Louie