So much LDAP, so little time

The background

Many years ago, all of my systems were pets. I tried to make them easier to manage by standardizing on a single operating system (MacOS X Server at the time) and used management tools that were part of that suite.

As time moved forward, Apple decided to concentrate on the iPhone instead of the Xserve as the next big thing and reduced their efforts on the server front. First, the hardware platform I was using (Xserve) disappeared, and then MacOS X Server started taking big hits in functionality.

Meanwhile, Rob and I were moving our systems to be much more like cattle than pets. We had standardized on SmartOS systems for running lightweight zones, and had standardized on Ansible for reducing the overhead of rebuilding systems. In as many ways as we could, all information on the servers, except for data which was rapidly changing or under user control, was moved into a highly-reproducible configuration management system that allowed us to try out new versions, run tests (some automated) and keep everything up to date by nuking and paving the servers and rebuilding them from scratch each quarter or so.

The primary directory management system for MacOS X is LDAP. The implementation, called Open Directory, was compatible with opensource and closed-source LDAP servers and scaled pretty well in large environments. In small environments, the GUI was good enough that it was not painful.

As we moved to SmartOS, though, we lost that built-in integration, and I had to create a set of Ansible roles to connect SmartOS to LDAP in order to continue to use the same severs (originally) and replicas of them based on OpenLDAP later. This all worked pretty well, except that the LDAP servers themselves were pretty hand-tuned.

State of affairs, pre-pandemic

It took me a few years of spare time to get all of my systems under management, and the last few systems to go were a set of LDAP servers that ran multiple domains worth of user configuration for mail systems that I was running. These LDAP servers ran in an HA configuration and generally worked really well. They also had securely hashed passwords, which I couldn't un-hash and used an algorithm incompatible with our selected OS (this becomes important later).

As a part of my final push to get things under control, I decided to finally bite the bullet and move the LDAP servers forward to the latest releases and get them on the automation train. It required serious care to make sure that everything worked correctly, but our habit of running separate instances for testing helped markedly in finding problems with the system before taking it live.

I'll note here that I did have some other systems that served mail to other groups of users that were built more recently. These used standard password hashes and were uncomplicated by the use of LDAP.

Quarantine Administration

During the end of March and the beginning of April, I finally got the testing systems running to my satisfaction. It seemed like the transition to production would be easy enough, since the systems using the servers hardly ever changed, users generally weren't resetting their passwords, and besides, there was no self-serve interface for that function anyway.

On the appointed day, I took one of the production servers offline (leaving it in a dormant state that could be resuscitated quickly) and brought up one of the new servers in the HA configuration, with the database of the previous production server. All seemed fine on the servers I was testing on, but then I noticed a small number of users were having trouble logging in.

I looked at the LDAP logs and there were no suspicious entries, but I noted in the database itself that the users having trouble had no password entries whatsoever. I rolled back the servers, but unfortunately the new data had been propagated to the other server. Rebuilding from the most recent backup had the same problem... as far as I can tell, the issue at this point was that a small number of users had something in their password records that was failing the ldap data dump that I had used to reload things. Unfortunately, this was the same dump I was using to rebuild the database, and that meant the old passwords were effectively gone.

The absence of complexity

Although it's pretty clear that the problem was one of bad backups, I needed to get the users back up and running, and I had no idea why the backups were bad. It might have been something about the ancient MacOS X Server LDAP schema that I'd pulled forward, or some change in the underlying configuration, but at this point, I didn't have time to figure it out. I needed to get things back up and running.

Here is where I made a fine choice, with some prodding by Rob. Seeing that I was going to need to reset passwords anyway for a number of users, I reached out to my user base and requested new hashed passwords from them. But, since I was going to have to request these passwords, I broadened the scope and got new hashes from everyone. This meant that I could remove the complication of LDAP from my systems.

At the end of the day, I had everybody back up and running and a set of systems that are even easier to operate than they were before.

I don't want anyone to take away from this that LDAP is bad. It isn't, it has places where it's definitely the right solution. However, a small datacenter application with a slowly-changing userbase and people with habits of good password hygiene is not one of those.

Let this serve as a reminder that we should always be open to making the right big changes when given the opportunity.