Over the past few weeks, RS and I have noticed an increasing number of unexplained failures logging in over SSH with both manual and automated means. On most of the servers it was just an inconvenience, but as it started to become more frequent, it became a significant issue for some of our automation scripts.
However, the most significant effects were to our git server which only allows access over ssh. Not surprisingly, this server gets a frequent, short-duration connections from our CI server as well as individual git clients that are checking for the status against their remotes. As such, the sudden spate of failures caused us to look into the issue.
Upon looking at the machine in question, it was clear that the usual ssh background probing
was going on, but that the bot that was probing us was getting confused and leaving the
connection open. This was causing a build-up of
ESTABLISHED connections which was running
up against the default limit of 10 on SmartOS. Individually, these were not coming in very
quickly (probably in order to not trigger banning software), but since they were taking
a very long time to transition from
CLOSED, they were taking up space
in the table of 10 and causing additional inbound connections (from our legitimate users)
to be unceremoniously shut down.
Upon further investigation of the
sshd_config file, I noted that the
in SmartOS is set to a default value of 600. Each connection had 10 minutes to wait
around until a successful login occurred or until it was disconnected. That seems a bit
long even if you're allowing password authentication, but in our key-only environment it
is extreme, and so we cut it back to 8.
Interestingly, this caused a fair number of connections to be stuck in
an indication that the bot that was probing us was just terminating the connection on its
end and moving on, without sending any indication to our side. That might just be a crash
on their side, or it might be a tuned bot. The distinction would be hard to make, but
fortunately, the sshd process quits immediately upon closing its side of the connection
LoginGraceTime expiring, so the result is a sufficient reduction in the number
of outstanding sshd's in
ESTABLISHED, which fixed our immediate problem.
It was a bit of a surprise this hasn't bit us on other machines, but considering that most of our automations completely destroy and rebuild the VM over sshd, the likelihood that a significant number of abandoned connections build up during the build process is pretty low.
We will be updating our standard sshd configuration to take care of this in the next build cycle and I've updated machines which have seen this problem repeatedly on an individual basis.