ASSTR Down - News

Found this and thought I’d share with you guys about what’s up with asstr (and my Broken_Arrow site on asstr):

The short story is that our main server unexpectedly crashed earlier this week.  This is the server that hosts all the ASSM admin stuff and is the machine that hosts the FTP server and author management stuff.

We have a piece of equipment that lets us remotely reboot (power cycle) our servers in case something like this happens, but we can no longer access that system for some unknown reason.

We also have a secondary web server that is supposed to act as a hot (ready-to-go) standby for the primary, but that machine died a few years ago and we have yet to diagnose and repair it.

We contacted our ISP to reboot the server and rewire the network connection to the remotely controllable power strip.  The latter effort failed, and the former effort resulted in the server coming back up briefly and then crashing again shortly thereafter.

Our other admin used a standby server to bring the web site back up, but apparently the backup he used was quite old.  I’m not sure if we have a more recent backup.  We have a server that’s supposed to perform daily automated backups, but neither of us monitor the server regularly, so it’s possible it stopped working years ago and we never noticed.

Essentially we have experienced a series of hardware issues and failures over the years, and we have managed to work around them up until this point.  Presumably the hard drive in the primary server is still good so that worst case we can get the data off of it even if we can’t immediately get the server itself back operational.

The reason we haven’t repaired the hardware failures is primarily due to gross inconvenience and to a lesser degree cost.  The hardware is located in the Bay Area, but none of the admins live anywhere near there.  We chose that location due to low cost and reliability, but it requires a plane ticket, hotel, car rental, usually taking a vacation day from work, etc. to make a trip across the country to do in-person repairs.  We used to try to get out every year or two, but our most recent trip was several years ago.  Our ISP can help to an extent, but there’s a limit to how much diagnosing and repairing they can do.

At this point we’re not sure why the main server kept crashing.  It could be anything from a dead CPU fan causing the machine to overheat to a hacker figuring out a remote exploit that is causing the operating system to crash.

Much like upgrading the site’s software, we have hopes and dreams of moving the site into a cloud-based architecture, but this is something that will take a bit of time and effort to pull off.  If/when we get it in the cloud, we should be immune from most of these kinds of hardware failures.

The good news here is that we probably did not suffer any data loss; we just need to get the main server fixed (again, unsure if the problem is hardware or software), or we need to pull off the data from the hard drive and move all the services from the old server to a new one.

Unlike many of our systems, we get alerted automatically when the web site goes down, so we were made aware of the server problem as soon as it occurred.  We’ve been working on it ever since, but the challenges of day jobs and not being physically present w/ the hardware create a difficult situation.

I will try to keep everyone updated via the newsgroups.

Rey del Sexo