Bacula 6 months on


It's been about six months since I originally wrote Welcome Bacula, describing our transition to Bacula from our previous solution (and a bit of history even before that). If you haven't read it, it might be worth a read.

Although not quite 6 months since I wrote the first piece, it's now been over 6 months since we started using Bacula. The results have been extremely good:

  • Performance has been excellent
  • The backup mechanism has been highly reliable
  • Locally-cached cloud backups propagate to the cloud easily (and reliably)
  • Pre-transmission compression and encryption have improved performance and security
  • Text-based configuration files have improved automation of clients and servers

Performance

I'm going to start with performance, which has been an unexpected (and uncontested) win in comparison to our previous solution. Nightly incremental backups are finishing in 10-12 minutes after backing up a bit over 3GB in 2500 files across 17 machines (remember that a lot of systems we have don't store persistant data). Weekly differentials take a few minutes longer and tend to contain about double the number of files and data. Full backups take around 21 hours, backing up 350GB in 3 million files (including local storage and the push to the offsite storage).

With our previous solution, we carved out a 10 hour window for full backups for each of 4 backup sets covering 13 systems (the ease of automation in Bacula has resulted in our backing up a few more machines), and about an hour a day for each of the 4 backup sets. They didn't always take that long, but running the backup sets in parallel was not a good idea™. Full backups took about 41 hours (total, not running in parallel) to back up about 325GB. These backups were compressed, but not encrypted. In addition, the push to offsite storage was a separate operation and itself took a substantial amount of time (including requiring an encryption step). Generally, we could expect full backups to be completed, encrypted, and replicated to offsite within 48-72 hours of the start of the cycle.

Reliability

We've had excellent reliability out of Bacula. Error messages are delivered in a timely fashion (via email mostly) and status information during the job is readily accessible. The previous solution never had any reliability problems (that we were aware of), although getting real-time status information was always a bit of a chore. The GUI (which Bacula does not have) was a underwhelming and vague, but the CLI was too machine-friendly. In this case, Bacula aligns much better with our needs and desires. I'm very comfortable with CLIs and although GUIs are nice when they're done really well, for a facility like this I'd rather have a good CLI any day.

Cloud experience

I need to give the caveat here that we don't use what most people would call "the cloud" these days, as we have enough geographic diversity that we replicate to our own equipment in another data center. However, the concepts are the same, and since we're using an S3-compatible storage mechanism, I think the comparison to S3 or B2 is reasonable.

Bacula uses a fairly intelligent cloud cache which uploads backups in chunks as they are completed. I'm still not entirely certain whether this stops the backup process in order to upload or whether it uploads in parallel. Given that the backup isn't considered finished until the cloud send has been attempted, it doesn't make much difference to us. You'll note that I said "has been attempted". In the event that the cloud send fails, the backup continues and an error is logged. You can attempt to upload the parts later if they aren't completed before the backup.

It's worth noting that the cloud backups are just simple syncs of the directories from the cache, so you can actually use any mechanism you like to send them off site. However, using the built-in drivers also allows the system to pull the backups (piecemeal as required) during a restore, which is a nice feature if you're bandwidth constrained either in network or pricing.

I'll note here that the part.0 of the backup (inside of the "volume" directory) is the label and is required for the automatic pull to work. If that's deleted for some reason and you need to restore from a volume that's completely offline, you'll need to at least pull the part.0 file back to the storage server's cache to get the automatic pull to work.

Automating configuration

As should be clear from the rest of this blog, Rob and I use Ansible for building basically everything we run. In many ways, the most significant advantage of the move to Bacula was being able to automate the configuration of both clients and servers without difficulty. As such, we have test and production environments and we're able to validate new versions and configuration ideas when we need to.

Conclusion

All told, as with any good migration of a long-standing system, the main take-away is: I wish I'd done this sooner. Bacula may be too fiddly for some people, but our environment is complex and highly automated. As such, we constrained the fiddling mostly to our initial configuration and have been able to craft a solution that is well suited to our environment and needs.