Git subtrees for Perforce users


For many years, I was a happy Perforce user. Despite clearly not fitting their precise model, I had a three-user license which allowed me and my bots to appropriately work on my code base. I have a number of pretty complex projects, which often have overlapping code and I took advantage of their evolving code sharing mechanisms. Initially by using a single repository with workspaces that included code from different locations, and then moving to the more powerful (and, in my mind, more easily understood) streams paradigm.

Moving slowly to Git

I generally follow an active trunk (master) strategy where new development is done on the trunk and branches are used to pull off releases. Generally speaking, I tend to develop like I've got a team even when it's just me.

As Perforce evolved, they tried to expand their offerings to include more git-like capabilities, in particular having local copies of your repositories. Obivously, this carried very similar benefits and disadvantages of git, but since lightweight branching and offline use were becoming more important, the perforce model needed to adapt.

It was through this lens that I started to move my repositories over to git. Frankly, I'd come up with adaptations which allowed me to make use of a central server system without compromising mobility (in particular, for years my primary server ran on my laptop, with a mirror running on my CI server), but over time the effort to maintain the synchronized servers became more significant. Perforce has a nice git front end for their server (git-fusion) which allows you to read and write Perforce repos using a git interface, giving you most of the advantages of both systems, but it falls down right where I tended to need it, on handling incorporated streams. The git submodule overlay was tedious and finicky at best and eventually convinced me to move fully to git.

At some other point in time, I'll discuss gitolite, which is what I'm using as a git server right now, but that won't be important for the rest of this story.

Adoption of git in my complex code base

Initially, I thought I was going to be able to use git-fusion and submodules to pull my Perforce streams out of the repositories and keep things in sync, but that proved problematic. Maybe it was the number of files or interesting merge tactics I'd experimented with, but in the end, I found that my large, complex repositories were best exported without Perforce's help with the submodules. The result, though, was that I had huge, monolithic, git repositories each of which contained complete histories of all of my submodules.

You'll note that I'm already using the word 'submodule' here, and that's not an accident. I tend to develop my shared code modules with their own projects, their own tests, and separate versioning and branching. With Perforce, this minimized the amount of duplication in the code base, and kept the system clean. Further, it lead to my creation of a CI environment that would test all of my modules individually as well as inside the larger projects. All told, I'm fond of it.

So, as I moved to a git lifestyle, I followed instructions by others on how to split my repositories using git filter-branch and successfully teased out the submodules that I shared between my major projects. It took a bit of trial and error, but it also proved a great experiment into the general forgiveness of git. All told, not having to commit to a central server makes it a lot easier to verify your full operation before you make a big mistake.

Git submodules and subtrees

When I started working my code into git, I'd read a number of pieces on the various advantages and disadvantages of git submodules and git subtrees. The basic feel of most of these articles was "Why you should never use [insert the other technique here]". Generally speaking, I didn't see a lot of benefit to the use of subtrees. The vast majority of the complaints seemed to be that submodules were a pain (they can be complex), and they're difficult to deal with if you need to make a lot of changes to other people's code. In my case, my submodules were almost exclusively internal. Thus, the issue with OPC wasn't a big deal.

Until I got to a library that I like to festoon with my own framework. In particular, I'm talking about the venerable GDAL, a widely-used library for raster and vector I/O used by much of the GIS industry. My macOS framework is a pretty large and complex beast, with multiple subcomponents (GDAL has a variety of optional libraries) and some private modifications. When using it in Perforce, I'd used a multi-stream system: an //import/gdal repo that carried an exact copy of the GDAL source and a //GDALFramework stream that was used to build the framework. The latter repo was imported into my application workspace using the stream functions. So, I'd grab the latest GDAL (which at the beginning was using svn) and then check that in to my //import/gdal repo, which I'd then merge into my //GDALFramework stream and fix any requirements or compilation/test errors, then I'd check that in and update the stream in my main application. It wasn't horrific, but getting to the point where I could run my application tests against it was a long track, so I didn't update as frequently as I'd like.

After I moved to git, my framework remained an intrinsic part of my source, along with its included gdal library. That worked fine, as long as I didn't need to update anything, but the cross-stream import branch information was long out of date, and I couldn't reasonably import a new version of GDAL without a lot of care. Enter git subtrees.

I had considered using GDAL as a git submodule and just forking it for the few changes I would need to make. That would work reasonably well, in theory, but the build environment that I use isn't the same as the standard environment and that meant that I couldn't reasonably test the code before committing into master, which I deem a no-no (ok, I could, but it would mean coordinating separate submodules with a set of special branches, which is do-able, but a pain in the neck). By using git subtrees instead, the imported code remains tightly coupled with the surrounding code, but isolated into an appropriate subdirectory. I can still submit changes back to the GDAL repo when appropriate, but I also get to fully test the code before I put it onto my master branch.

Subtrees are easy to work with, especially in comparison to submodules. As long as you're willing to commit to forward momentum in lock-step with your updated subtrees, the mechanism works great. Add a subtree in a subdirectory:

git subtree add --prefix GDALFramework/gdal gdal v2.4.2 --squash

and you've got the code you want, right where you asked for it. Explaining that line a bit, it adds the code from the gdal origin (which I added using git remote add), from the v2.4.2 tag in the subdirectory GDALFramework/gdal and uses the --squash option to limit the commits brought into the local repo. Once that's done, you can make changes to your heart's content, and merge in changes from the original repo by doing:

git subtree pull --prefix GDALFramework/gdal gdal <Branch id> --squash

where <branch id> is whatever branch or commit id you want to update to. Do your tests, make your changes, verify that everything is working and you're good. Commit your changes when you're ready.

When you've got something that you want to commit back to the community, you will need to fork the original repository and push to that origin so that you can prepare your changes for assimilation:

git subtree push --prefix=GDALFramework/gdal mygdalfork master

I'm happy with my decision to use a combination of submodules and subtrees. I'm sure that either method could be used for the purposes I'm using the other method for, but I find the distinction is useful. In particular, I can easily experiment with code branches between my apps using submodules, and I can work easily with code from others which needs some adaptation using subtrees.