February 8th, 2008
One thing I’ve forgotten, twice now, to do before attempting to run a job on Hadoop clusters that I’ve setup in a hurry to demo something is to make sure that all of the nodes are registered in DNS or at least have entires in their hosts files about every node (datanodes, the namenode, the master and all the slaves) in the cluster.
If you start out with the master not knowing what name the slaves’ IPs map to the slaves won’t be able to connect to the master, even if you use IPs in the conf/slaves file. This seems silly to me, but that’s the way it is, at least as of 0.14.4. You’ll discover and fix this first. The slaves will then connect and the first level of links will start to work in the the master’s web interface.
Now the TaskTrackers on the slave nodes will successfully run tasks and will probably complete the map stage. If the nodes have varying performance levels or your data isn’t well distributed on your HDFS file system the map stage may appear to hang (or repeat the same percentage(s) over and over). If you make it through the map stage the reduce stage will fail to complete, for the same reason the map stage may fail, if you didn’t also configure each of the slaves nodes to know the names for all of the other slave nodes. As soon as a slave node needs data off of another datanode (either for a map task or a reduce task, etc) it’ll face the same problem it initially had in contacting the master node (but this time other slave nodes) before you configured DNS for it.
So… make sure that all machines involved in the cluster know the hostname (and it’s IP) of every other machine in the cluster. Configuring just the master to know all the slave’s names/IPs or just all of the slaves to know the master name/IP will not work, you need both — even if you haven’t used a single hostname in either of the conf/masters or conf/slaves files.