I used to buy Dell machines with the expectation that if there was a problem with them I was essentially on my own. For the price, that was OK and I bought them when that thinking was appropriate.
Very early Tuesday morning one of my inbound mail servers (a Dell PowerEdge 840) stopped responding completely. Since its failure wasn’t a critical problem (there are backups for everything that machine runs) I didn’t get to it until about 6pm Tuesday. It turns out the power supply had failed (the PS fan wouldn’t turn on and the PS put out no power at all). This particular server has been on constantly since new for just shy of 2 years. On Saturday it was off for the first time (due to an extended power outage) for 12 hours. I guess it got used to being on and failed in protest of being turned off.
Not in the mood for phone support hell, I tried Dell’s on-line support chat around 9pm. In 9 minutes I had a dispatch and case number and a power supply on the way. A tiny bit slower than the same conversation would have taken with HP but it was a simple “my power supply is dead, I’ve already tested it, it’s shot, send me a replacement”… “OK, it’s on the way” as I’m accustomed to with HP and not so much with Dell. So I’m happy, maybe I’ll start buying more Dell servers for semi-critical infrastructure.
Wednesday at 9:40pm I get an email from the support rep I had chatted with the previous day saying “the power supply is back-ordered, you should get it *next* Wednesday”. I reply that “a week for a power supply is a joke, HP would have one to me next day or the day after at the latest”. He responds, “Understood, I’ve noted your comments in my notes”. At this point I’m never buying a Dell machine for any non highly redundant server system again. The next day at work I spend $10k on some HP equipment that I was considering going with Dell on.
Shortly after buying the HP equipment Purolator shows up with my replacement power supply from Dell. It’s a refurb (like HP would be after 30 or 90 days), but it works and the server seems to be OK.
So, now I’m not sure if I’ll be buying Dell for non-redundant systems. It seems their support is getting up to par with HP but they have no idea about when you’ll actually get replacement parts (unless they do and complaining made it show up faster). I’ll likely keep buying Dell for grid computing clusters simply on a price basis. A failure of a cluster node is far less important than price and operating cost. Anything else though… I have no idea about Dell… I know HP is safe.
November 15th, 2008
Scotiabank’s ScotiaOnline online banking application has a major flaw. If you schedule a bill to be paid in the future (it’ll be listed under “Pending Transactions”) and then after 6pm on the day that transaction is to occur delete that pending transaction it’ll still be processed as if you hadn’t deleted the pending transaction at all. This makes it clear that they’re not using the database that you see and the database they are using to actually process transactions can fall out of sync (many hours out of sync in fact).
I found this out on Friday when I deleted a pending transaction… a Visa bill payment for a few thousand dollars that was to come from my line of credit. Instead, since I had the cash, I deleted the pending transaction and paid it from my chequing account. The next day I found that the deleted pending transaction went through anyway… so Visa (well, TD Bank) got another few thousand dollars from me (I payed off my entire balance twice). Thanks Scotiabank… I didn’t have a use for a few grand anyway. Yet another reason why I probably shouldn’t bank at Scotiabank.
May 25th, 2008
One thing I’ve forgotten, twice now, to do before attempting to run a job on Hadoop clusters that I’ve setup in a hurry to demo something is to make sure that all of the nodes are registered in DNS or at least have entires in their hosts files about every node (datanodes, the namenode, the master and all the slaves) in the cluster.
If you start out with the master not knowing what name the slaves’ IPs map to the slaves won’t be able to connect to the master, even if you use IPs in the conf/slaves file. This seems silly to me, but that’s the way it is, at least as of 0.14.4. You’ll discover and fix this first. The slaves will then connect and the first level of links will start to work in the the master’s web interface.
Now the TaskTrackers on the slave nodes will successfully run tasks and will probably complete the map stage. If the nodes have varying performance levels or your data isn’t well distributed on your HDFS file system the map stage may appear to hang (or repeat the same percentage(s) over and over). If you make it through the map stage the reduce stage will fail to complete, for the same reason the map stage may fail, if you didn’t also configure each of the slaves nodes to know the names for all of the other slave nodes. As soon as a slave node needs data off of another datanode (either for a map task or a reduce task, etc) it’ll face the same problem it initially had in contacting the master node (but this time other slave nodes) before you configured DNS for it.
So… make sure that all machines involved in the cluster know the hostname (and it’s IP) of every other machine in the cluster. Configuring just the master to know all the slave’s names/IPs or just all of the slaves to know the master name/IP will not work, you need both — even if you haven’t used a single hostname in either of the conf/masters or conf/slaves files.
February 8th, 2008