SSH probe bad behaviors and the sshd settings that make them worse

Over the past few weeks, RS and I have noticed an increasing number of unexplained failures logging in over SSH with both manual and automated means. On most of the servers it was just an inconvenience, but as it started to become more frequent, it became a significant issue for some of our automation scripts.

However, the most significant effects were to our git server which only allows access over ssh. Not surprisingly, this server gets a frequent, short-duration connections from our CI server as well as individual git clients that are checking for the status against their remotes. As such, the sudden spate of failures caused us to look into the issue.

Upon looking at the machine in question, it was clear that the usual ssh background probing was going on, but that the bot that was probing us was getting confused and leaving the connection open. This was causing a build-up of ESTABLISHED connections which was running up against the default limit of 10 on SmartOS. Individually, these were not coming in very quickly (probably in order to not trigger banning software), but since they were taking a very long time to transition from ESTABLISHED to CLOSED, they were taking up space in the table of 10 and causing additional inbound connections (from our legitimate users) to be unceremoniously shut down.

Upon further investigation of the sshd_config file, I noted that the LoginGraceTime in SmartOS is set to a default value of 600. Each connection had 10 minutes to wait around until a successful login occurred or until it was disconnected. That seems a bit long even if you're allowing password authentication, but in our key-only environment it is extreme, and so we cut it back to 8.

Interestingly, this caused a fair number of connections to be stuck in FIN_WAIT_1, an indication that the bot that was probing us was just terminating the connection on its end and moving on, without sending any indication to our side. That might just be a crash on their side, or it might be a tuned bot. The distinction would be hard to make, but fortunately, the sshd process quits immediately upon closing its side of the connection due to LoginGraceTime expiring, so the result is a sufficient reduction in the number of outstanding sshd's in ESTABLISHED, which fixed our immediate problem.

It was a bit of a surprise this hasn't bit us on other machines, but considering that most of our automations completely destroy and rebuild the VM over sshd, the likelihood that a significant number of abandoned connections build up during the build process is pretty low.

We will be updating our standard sshd configuration to take care of this in the next build cycle and I've updated machines which have seen this problem repeatedly on an individual basis.