Sunday, August 19, 2007

Beginning of the longest day.

I have no idea how I am going to finish everything I need to do today.

There is one main rule I stick with at work, and almost never push aside. If there is a problem that rises up in a radmind loadset, FIND OUT WHAT CAUSED IT, and FIGURE OUT HOW TO AVOID THE PROBLEM, or HOW TO FIX IT.

Friday morning, I disregarded that self-imposed rule, for the sake of expediency in getting the new intel Macs ready for the school year.

You see, Adobe After Effects CS3 was not playing well. It would not hold the serial number once the loadset was applied to the machines, not on the MacPros, nor on the older G5s. Not in the admin account or in the user account. One of my attempts at fixing it actually cause the MacPro to lose the ability to be logged in. When attempting to reset the password (thinking it maybe had been wiped out during testing), I got an error that no users were available on that volume.

Oh crap.

Time was wasting, though, so I ended up just reloading the machine using netrestore, and reverted back to an earlier version of after effects, that I already had working earlier this summer. It was only one machine, after all.

I applied the fix, and it worked perfectly on both types of machine.

Whew.

I had one more software package to install - Maya 8.5. I sat at my master machine, a G5, and created the loadset. I applied it to all of the G5s, six at a time, as it was a large package, and I didn't want to overload the server and slow things down to a crawl. I then moved to the MacPros, and began applying it there.

Here is where I made the mistake of forgetting that I was dealing with two different types of machines.

Friday at 5:40 pm, after thinking everything was perfect, I went to the MacPros to start the apply on the last six machines, with the aim to going home, and just finishing up on Monday (they would have needed just a reboot to be ready for users).

I sat down at one of the finished machines, to test the overload. I was pretty confident it had worked correctly.

I couldn't login.

I couldn't login on any account at all.

The damn bug had been replicated, in the worst possible way.

I quickly stopped the last six from running, and ran to the server to remove that loadset from that command file. I ran radmind again on those six, and managed to salvage them.

But the other 14 will have to be reloaded today. They take 26 minutes to reload. Normally, I have two firewire drives, but I loaned one to my boss. I have one drive, and 14 machines. This was to be my last day off for the next month, between work and the faire.

This is the price to pay when not tracking down why something failed, because you think you don't have enough time. If I had spent even an hour figuring out what files were causing the problem in the first failed machine, or if I had tested one machine after applying that loadset (which I normally do, but I honestly thought there would be no problems with the maya loadset), I wouldn't have go to work today, on the worst possible day, to reload more than half a lab.

I am writing this out, in the hopes that it will remind me to never, ever make that mistake again.

No comments: