What is a Merge Queue?
When your team wants to ensure all tests are passing before their changes are merged into main/master/trunk it can be helpful to have an automated process to trigger those tests and ensure they pass before merging. A lot of Pull Request systems can do this for you with a check that the tests must pass before merging. There isn’t any reason to consider a merge queue unless the test take a long time to run. Waiting 5 minutes between updating a pull request and merging it in is usually fine but waiting a hour can become a problem. Especially when there are a lot of other people working on the same code base.
A Merge Queue can help with this by allowing you to submit your change as ready to merge and it will be merged in once the tests pass. It also forms a FIFO queue so that if you’re second in the queue the tests are not started until the prior changes are merged in. No point running tests for anyone later in the queue which might hit conflicts or other things to cause a passing situation to turn into a failing one.
Lesson one: You probably don’t need to make your own Merge Queue.
I was not easily convinced that our team needed one. However the slow tests wasn’t the only issues we had. People would get lost in trying to resolve conflicts and block others from merging for hours or even days. Before commiting to creating a Merge Queue I wanted to consider all the other options first. Can we make our tests run faster? Do we really need all the tests run and passing before merging? And, of course, is there software that we can buy or get for the Merge Queue.
In the end we couldn’t find one that would work with our BitBucket server and Jenkins for tests. Even something as simple as adding Slack to our mix of technologies might have changed things. But I do plan to replace our Merge Queue with the service being built now at GitHub. What we created was always intended to be replaced; in part or as a whole.
Lesson two: Adoption will be challenging despite early interest.
We did a well planned rollout and tested every scenario we could imagine. I was really impress just how much we were able to test automatically but we also ran test scenarios manually to double check things. And when a manual test found a failure we’d add a new automated test. After all that we were ready to introduce the tool to the team. At first the people working on it used it themselves while carefully negotiating the system established for manual merges. Someone would take the ‘merging pig’ (pictured below) to their desk until the merge had completed.
When we asked the wider team to start using the tool only about a third started using it while the rest kept doing the old manual method. It took a lot of one-on-one follow ups to find out why. Mostly they didn’t want to learn a new system and hadn’t taken the time to see how simple we made it. We also had to re-assure people on how it works despite explaining things in full many times during it’s development. How would it handle conflicts? Would it mess up my branch? Does it support sub-modules? All things we had already resolved.
Eventually we had convinced everyone to use it except one person who was still pushing in changes manually for “quick fixes”. I had to force even him to use it and restrict his access to master because his quick fix would break a merge in progress. We would check that both branches hadn’t changed between the test started and finished before doing the merge so his quick fix was a painful delay for someone else.
Lesson Three: Speed still matters.
During development we needed to test our Merge Queue without waiting an hour for a result. So we would often setup special environments with quick tests. Once deployed the merges would still typically take an hour each and on many occasions we had a queue long enough that the final change would go through late at night. But when things were quiet the hour wait was still frustrating. And many would want to run the tests before presenting their changes for review so there was still pressure to speed things up. It’s an ongoing struggle as we try make things faster but the new features add more testing load on the system. Ultimately the only change that can have a big impact on the time is to re-architect the system into smaller parts or change what is expected of our testing.
Lesson Four: The education on how it works continues for months.
Even after people were using the system for months we’d still get questions from senior engineers on why their merge failed. All the information was there but with the Merge Queue running in Jenkins triggering their tests in Jenkins it can take a while to get to the specific failure. A mix of impatience and false assumptions would leave people stumped as to why things failed for them. It has eased off now but there was a long period of pointing people in the right direction.
Lesson Five: Welcome the end.
The Merge Queue has been a massive benefit. We got it up and running only shortly before COVID forced everyone to work from home. Passing around a stuffed pig clearly would not have worked for us. We also no longer hear of people stuck for days trying to resolve conflicts. It also helped the team stay productive as it nearly doubled in size! But the end is on the horizon.
We will move systems from BitBucket to GitHub and will soon make their Merge Queue available for general use. I imagine the switch will bring its own challenges but it’ll mean we can stop maintaining and supporting a tool that is not our core business. I’m sure there will also be some resistance to learning a new system again. Maybe we’ll release the tool as open source and let it find a new life with those that still have that niche need.
What internal tools are crticial to your team?