Rocks Cluster Setup
From NCEAS Knowledge Base
Contents |
Steps for setting up the Rocks 5.0 software running on the NCEAS Microcluster
Installation of Frontend
- Download and burn the Rocks 5.0 i386 DVD
- Boot from the DVD, enter 'frontend" as the boot options to start the install
- Click on the local rolls 'CD/DVD-based rolls' button
- Choose area51, base, bio, ganglia, hpc, java, kernel, os, sge, and web-server rolls for installation, add the backup roll if upgrading
- Enter information for the cluster
- FQHN: catbert.nceas.ucsb.edu
- Name: NCEAS MicroCluster
- CO: UCSB
- CL: Santa Barbara
- CS: California
- CC: US
- Contact: whoever is setting this up...
- URL: http://catbert.nceas.ucsb.edu
- Lat/Long: ?
- Enter information for eth0
- IP address: 10.1.1.1
- Netmask: 255.0.0.0
- Enter information for eth1
- IP address: 128.111.220.156
- Netmask: 255.255.255.0
- Enter Misc network settings
- Gateway: 128.111.220.1
- DNS: 128.111.220.18,128.111.220.16,128.111.1.2
- Enter root password
- enter time zone/server
- Partition disk, Auto for new installs, manual for upgrade
- Follow instructions for loading rolls, installation will then start
Post-install Frontend Configuration
- modify iptables to allow viewing ganglia from hosts other than 127.0.0.1
- edit /etc/sysconfig/iptables (uncomment the line to allow http port 80 access)
- note: using other utilities to modify the firewall such as 'setup' or 'lokkit' will break Rocks
- restrict viewing of web server to specific addresses by editing /etc/httpd/conf.d/rocks.conf
- install extra software
- download RPMs, copy them to /home/install/contrib/5.0/i386/RPMS/
- cp /home/install/site-profiles/5.0/nodes/skeleton.xml to /home/install/site-profiles/5.0/nodes/extend-compute.xml
- add package names to extend-compute.xml
- rebuild kickstart image: cd /home/install && rocks-dist dist
- rebuild nodes: cluster-fork /boot/kickstart/cluster-kickstart
Node Installation
Node Prep
- go into the bios
- enable PXE booting
- set boot order to CDROM → PXE → HDD
- disable hyperthreading
- some Optiplex GX240 machines hang when starting the anaconda installer. For these machines, disable the floppy drive.
Frontend Prep
- Login to the Frontend and execute "insert-ethers"
- boot each node after being configured according to "Node Prep" and watch them appear in the insert-ethers status box as they are installed
- Quit insert-ethers
General Admin Tasks
Rebuilding Nodes
- run cluster-fork /boot/kickstart/cluster-kickstart as root from the Frontend to cause all nodes to reboot and load a new system image.
Forcibly Rebuilding Nodes
Nodes should restart successfully after any kind of reboot. With the 5.0 release, hard reboots appear to hang on many of our nodes.
Option 1:
- Login to the Frontend and set the trouble node to "reinstall on boot" status
% rocks set host pxeboot compute-0-0 action=install
- Boot the node
Option 2:
- Boot troubled nodes using the Rocks boot CD/DVD (typically available sitting on the nodes)
- The boot process takes 5-10 minutes, after which time the CD will eject. Removed the CD from the tray so the nodes are not stuck in a loop
- Approximately 30 minutes later the node should be available on the status page, http://catbert.nceas.ucsb.edu/ganglia/
Adding Users
- Run useradd username and setup a user like normal
- Run rocks sync users to sync the account across all nodes
Replacing a Node
note: this doesn't work
- Run the following as root to replace compute-0-20
% rocks remove host compute-0-20 % rocks sync config % insert-ethers --replace=compute-0-20
Sync Ganglia with the Node List
Ganglia sometimes gets out of sync, just restart it to sync it.
% service gmetad stop % service gmond stop % service gmond start % service gmetad start
