5. Set Up Slave Nodes
Get your network cables out. Install Linux on the first non-head
node. Follow these steps for each non-head node.
5.1. Base Linux Install
Going with my example node names and IP addresses, this is what I
chose during setup:
Workstation
auto partition
remove all partitions on system
use LILO as the boot loader
put boot loader on the MBR
host name wolf01
ip address 192.168.0.101
add the user "wolf"
same password as on all other nodes
NO firewall |
The ONLY package installed: network servers. Un-select all other
packages.
It doesn't matter what else you choose; this is the minimum that
you need. Why fill the box up with non-essential software you will never
use? My research has been concentrated on finding that minimal
configuration to get up and running.
Here's another very important point: when you move on to an
automated install and config, you really will NEVER log in to the box.
Only during setup and install do I type anything directly on the
box.
5.2. Hardware
When the computer starts up, it will complain if it does not have
a keyboard connected. I was not able to modify the BIOS, because I had
older discarded boxes with no documentation, so I just connected a
"fake" keyboard.
I am in the computer industry, and see hundreds of keyboards come
and go, and some occasionally end up in the garbage. I get the old dead
keyboard out of the garbage, remove JUST the cord with the tiny circuit
board up there in the corner, where the num lock and caps lock lights
are. Then I plug the cord in, and the computer thinks it has a complete
keyboard without incident.
Again, you would be better off modifying your bios, if you are
able to. This is just a trick to use in case you don't have the bios
program.
5.3. Post Install Commands
After your newly installed box reboots, log on as root again,
and...
Up to this point, we are pretty much the same as the head node. I
do NOT do the modification of the exports file.
Also, do NOT add this line to the .bash_profile:
5.4. SSH On Slave Nodes
Recall that on the head node, we created a file "authorized_keys".
Copy that file, created on your head node, to the ~/.ssh directory on
the slave nodes. The HEAD node will log on the all the SLAVE
nodes.
The requirement, as stated in the LAM user manual, is that there
should be no interaction required when logging in from the head to any
of the slaves. So, copying the public key from the head node into each
slave node, in the file "authorized_keys", tells each slave
that "wolf
user on wolf00 is allowed to log on here without any password; we know
it is safe."
However you may recall that the documentation states that the
first time you log on, it will ask for confirmation. So only once, after
doing the above configuration, go back to the head node, and type ssh
wolfnn where "wolfnn" is the name of your newly configured slave node.
It will ask you for confirmation, and you simply answer "yes" to it, and
that will be the last time you will have to interact.
Prove it by logging off, and then ssh back to that node, and it
should just immediately log you in, with no dialog whatsoever.
5.5. NFS Settings On Slave Nodes
As root, enter these commands:
cat >> /etc/fstab
wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0
<control d> |
What we did here was automatically mount the exported directory we
put in the /etc/exports file on the head node. More discussion regarding
nfs later in this document.
5.6. Lilo Modifications On Slave Nodes
Then modify /etc/lilo.conf.
The 2nd line of this file says
Modify that line to say:
After it is modified, we invoke the changes. You type
"/sbin/lilo", and it will display back "added linux *" to confirm that
it took the changes you made to the lilo.conf file:
Why do I do this lilo modification? If you were researching
Beowulf on the web, and understand everything I have done so far, you
may wonder, "I don't remember reading anything about lilo.conf."
All my Beowulf nodes share a single power strip. I turn on the
power strip, and every box on the cluster starts up immediately. As the
startup procedure progresses, it mounts file systems. Seeing that the
non-head nodes mount the shared directory from the head node, they all
will have to wait a little bit until the head node is up, with NFS ready
to go. So I make each slave node wait 2 minutes in the lilo step.
Meanwhile, the head node comes up, and making the shared directory
available. By then, the slave nodes finally start booting up because
lilo has waited 2 minutes.