Rocks Cluster Setup

From NCEAS Knowledge Base

Jump to: navigation, search

Contents

Steps for setting up the Rocks 5.0 software running on the NCEAS Microcluster

Installation of Frontend

  1. Download and burn the Rocks 5.0 i386 DVD
  2. Boot from the DVD, enter 'frontend" as the boot options to start the install
  3. Click on the local rolls 'CD/DVD-based rolls' button
  4. Choose area51, base, bio, ganglia, hpc, java, kernel, os, sge, and web-server rolls for installation, add the backup roll if upgrading
  5. Enter information for the cluster
    • FQHN: catbert.nceas.ucsb.edu
    • Name: NCEAS MicroCluster
    • CO: UCSB
    • CL: Santa Barbara
    • CS: California
    • CC: US
    • Contact: whoever is setting this up...
    • URL: http://catbert.nceas.ucsb.edu
    • Lat/Long: ?
  6. Enter information for eth0
    • IP address: 10.1.1.1
    • Netmask: 255.0.0.0
  7. Enter information for eth1
    • IP address: 128.111.220.156
    • Netmask: 255.255.255.0
  8. Enter Misc network settings
    • Gateway: 128.111.220.1
    • DNS: 128.111.220.18,128.111.220.16,128.111.1.2
  9. Enter root password
  10. enter time zone/server
  11. Partition disk, Auto for new installs, manual for upgrade
  12. Follow instructions for loading rolls, installation will then start

Post-install Frontend Configuration

  1. modify iptables to allow viewing ganglia from hosts other than 127.0.0.1
    • edit /etc/sysconfig/iptables (uncomment the line to allow http port 80 access)
    • note: using other utilities to modify the firewall such as 'setup' or 'lokkit' will break Rocks
  2. restrict viewing of web server to specific addresses by editing /etc/httpd/conf.d/rocks.conf
  3. install extra software
    • download RPMs, copy them to /home/install/contrib/5.0/i386/RPMS/
    • cp /home/install/site-profiles/5.0/nodes/skeleton.xml to /home/install/site-profiles/5.0/nodes/extend-compute.xml
    • add package names to extend-compute.xml
    • rebuild kickstart image: cd /home/install && rocks-dist dist
    • rebuild nodes: cluster-fork /boot/kickstart/cluster-kickstart

Node Installation

Node Prep

  • go into the bios
    • enable PXE booting
    • set boot order to CDROM → PXE → HDD
    • disable hyperthreading
  • some Optiplex GX240 machines hang when starting the anaconda installer. For these machines, disable the floppy drive.

Frontend Prep

  • Login to the Frontend and execute "insert-ethers"
  • boot each node after being configured according to "Node Prep" and watch them appear in the insert-ethers status box as they are installed
  • Quit insert-ethers

General Admin Tasks

Rebuilding Nodes

  • run cluster-fork /boot/kickstart/cluster-kickstart as root from the Frontend to cause all nodes to reboot and load a new system image.

Forcibly Rebuilding Nodes

Nodes should restart successfully after any kind of reboot. With the 5.0 release, hard reboots appear to hang on many of our nodes.

Option 1:

  • Login to the Frontend and set the trouble node to "reinstall on boot" status
% rocks set host pxeboot compute-0-0 action=install
  • Boot the node

Option 2:

  • Boot troubled nodes using the Rocks boot CD/DVD (typically available sitting on the nodes)
  • The boot process takes 5-10 minutes, after which time the CD will eject. Removed the CD from the tray so the nodes are not stuck in a loop
  • Approximately 30 minutes later the node should be available on the status page, http://catbert.nceas.ucsb.edu/ganglia/

Adding Users

  • Run useradd username and setup a user like normal
  • Run rocks sync users to sync the account across all nodes

Replacing a Node

note: this doesn't work

  • Run the following as root to replace compute-0-20
% rocks remove host compute-0-20
% rocks sync config
% insert-ethers --replace=compute-0-20

Sync Ganglia with the Node List

Ganglia sometimes gets out of sync, just restart it to sync it.

% service gmetad stop
% service gmond stop
% service gmond start
% service gmetad start
Personal tools