I wasn't originally going to write this up on the blog, but considering that we've just finished our transition from our old backup software (BRU, no link) to Bacula community edition and considering that it's World Backup Day, it seemed like it would make sense.
As many of you are likely aware, ClueTrust hosts equipment at a top-tier datacenter for providing services to our datacenter customers and our software customers alike.
From Retrospect to BRU
Since a lot of data that exists at the datacenter is not easily replaceable, we've had on-site backups since early on in our operation of the racks of servers. At the time that we started our backup journey, that was an LTO library hooked up to an Apple Xserve (RIP) via Fibre Channel. Because of those particulars (Macintosh-based server, clients of a variety of types, tape library), we had an extremely limited choice of backup software, and BRU was basically it. At the time (December, 2006) we had been through a couple of tumultuous years (literally, starting in December of 2003) evaluating various versions of BRU while they got their MacOS X Server ducks in a row, and while Retrospect (our previous backup software provider) didn't even seem interested in the Macintosh market at the time.
The journey with BRU was always a bit strained (I won't go into it here, but I believe we're both happier to be out of that dysfunctional relationship). My expectations for their responsiveness and customer-orientation were rarely met, and although there was a lot of work on the MacOS X platform in 2003-2010, the release cadence for BRU Server from that point seemed to grind to a near-halt. With that said, we have never lost a file with BRU, the backups were always readable, and the format was simple enough that it gave us confidence that even if an archive became corrupted, we could retrieve most of the data from it.
Taking it to the cloud
By 2014, our preferred method of sending backups off-site no longer required me to take my car to the datacenter and pull tapes out of the rack. Instead, we were moving to an offsite storage mechanism that used "cloud storage". In our case, that meant AWS Glacier.
There was no direct support for Glacier (or any other off-site backup mechanism) built in to BRU, but they did have a disk-to-disk-to-tape model that could be run without the "to-tape" part, which lead to my creating a bespoke Python solution for uploading our archives to Glacier. I would not recommend that to most people, as the process is a bit arduous and maintaining your own critical backup software is not recommended if you don't have the discipline to regularly test it (especially when you don't control the server).
The solution we put together took advantage of the mostly self-contained nature of of the BRU archives to shoot the data (encrypted after the fact, but otherwise unchanged) to Glacier.
By 2015, as I mentioned in SmartOS, Postfix and IPv6, we were in the process of shutting down our Xserves and replacing them with SmartOS. Although BRU Client worked fine on the Solaris variant, we were never able to get the licensing module to work with SmartOS, despite attempts to work with the BRU engineers. As such, we ended up running our backup server in a sub-optimal configuration, a KVM-based Ubuntu environment with a raw disk partition for scratch. Obviously, this would have been much better if we'd been able to run on SmartOS with a LOFS partition directly taking advantage of ZFS, but that was something we were never able to achieve.
Since certain catalog data wasn't readily extracted from the per-machine archives, I re-engineered our custom solution in 2015 to make sure that we were storing all of the salient metadata (type of backup, date, machine) in a way that would be more easily addressed. This allowed us to find and remove old incrementals and so forth in Glacier.
So, at this point, we had a custom off-site storage solution, hand-baked encryption, and we were running in a KVM machine instead of running directly on the OS. Not an optimal solution. Especially so when we had little hope that our chosen OS would be moving forward in BRU-land. Things were working, but it required a lot of work to keep it up.
Heading into the future
As 2020 dawned, Rob and I are working on a number of datacenter initiatives, including moving to a new SmartOS hardware platform and establishing beachheads in some other locations. As part of this, I was looking to see what options we had for self-hosting our off-site backups. Glacier wasn't hideously expensive (and its price pre byte decreases occasionally), but if we're going to have off-site hardware, why not put our off-site backups there.
The prospect of multiple sites also started me thinking about our current choice of a commercial software solution. BRU wasn't unreasonably priced, but running a second server would be a separate instance and that'd be a separate license. We could run the backups over the internet from our other datacenter(s), but that would be a weird configuration and likely not a performant one.
At this point, the idea hit that it was time to evaluate a solution that meets our 2020 needs, not our 2003 needs. As such, the requirements were:
- Open Source solution (if possible)
- Support for a wide variety of OS, including SmartOS, Linux, macOS
- Well-documented storage format
- Classic Full, Incremental (optionally Differential) backups
- Off-site cloud storage with compatibility with open-source storage solutions
- Built-in public-private key encryption (preferably e2e from the device being backed up)
- Built-in transport encryption and positive identification
- Zero client trust required
- Easy scripted installations of client and server
- Flexible and scriptable configuration
I looked around at a number of solutions, including the eventual winner, Bacula, and stalwarts such as Amanda, as well as a ton of other, younger, solutions. Many of the newer versions were either cloud-first or cloud-required, often they trusted the client too much (such as handing the could credentials to the client), and almost none of them had old-school multi-level backups, instead going for the much more modern, Time-Machiney approach of a perpetually fresh backup.
I'm a big fan of Time Machine on macOS, but it's not the only backup I choose to use and if I'm going to have a single backup mechanism, it's not going to be one where the loss of some kind of long-term incrementally-updated database will result in sadness. As it stands I've watched multiple times in the last decade as my Time Machine backups became corrupt or needed to be moved off of older hardware. It's an extremely convenient capability, but it's also brittle.
The choice: Bacula
So, after all that looking around, I turned my sights on Bacula as the leading contender.
- It has an open source version (yes, there have been some issues in the past with the update cadence of the open source version, which lead to a fork named bareos)
- OpenSolaris is a supported OS, as are all of our other required OS
- There is storage format documentation
- Backups are of the traditional Full, Incremental, Differential variety (although it also supports creating new synthetic Full backups)
- Recent versions directly support S3-compatible off-site storage (including Minio, and with Minio's help, Backblaze)
- Encryption is end-to-end (except for attributes1 ) and uses public-private key encryption with optional multiple keys
- TLS transport encryption and unique passwords for identifying each component to each other component for positive identification
- Clients are not in control and not allowed to contact the director directly
- Installation from source, or a binary package (available for some platforms directly from their website) is simple and easily scripted
- Configuration parameters are all stored in text files which can be scripted easily
All told, it hit all of our specifications and came in at a great price ($0), with available commercial support if necessary and fully open source code.
Testing went well and I was able to script the building and packaging process as well as the installation process on both the client and server end without difficulty.
In fact, one of the side-effects of a free solution is that we're now able to run a complete test setup which mirrors our production setup and allows for easy validation of configuration changes and upgrades.
There has been some difficulty with the built-in cloud support, but at least some of that was owing to my problems getting the Minio-Backblaze gateway going. Now that's functional, things seem to be working better. In addition, the mechanism for uploading data to Backblaze (or S3, or Minio) is straightforward enough that uploading manually using rclone and downloading using the restore process in Bacula was completely successful.
By the way, performance has been excellent. It's not extremely fast when dealing with large numbers of very small files (presumably file attribute overhead there), but it is highly performant on large files and even the small file performance is acceptable. Because of the text-based configuration, I've been able to do quite a bit of experimentation and our nightly backup incremental across 18 different machines finishes in 8 minutes. Obviously, a full takes substantially longer, but through the use of separate "tape changers" we're able to keep the administratively-separate data separated while still running concurrent backups.
Bacula encrypts the data, but not attributes such as filenames, dates, modes, owners, etc. Although contents of your backups are protected, frequently the metadata can be just as important as the data in the file itself. As such, this begs for some kind of further encryption if you are sending this data offsite for third-party storage. ↩