CADI Compute Cluster Administration

CADI Cluster Members

The CADI compute cluster is comprised of:

  • (2) Dell Power Edge 1435 (acl-primary, acl-storage)
  • (8) Dell Power Edge 860s with Pentium D processors (acl-cadi-pentd-[1-8])
  • (9) Dell Power Edge 1950s with Intel Zeon processors (acl-cadi-xeon-[1-9])

These systems are on their own private 192.168.0 network. acl-primary serves as the gateway.

Head Node (acl-primary)

acl-primary serves these roles:

  1. NAT Gateway/Firewall services between the 43.129 Network and the 192.168.0 network.
  2. TFTP boot services
  3. DHCP services (needed only for kick start installations)
  4. Redhat kick start services
  5. Cluster System Administration
  6. Batch system (torque) head node
  7. Scratch space storage
  8. Head node for running distributed scripts
  9. Backups for configuration files and shared space.

NAT Gateway/Firewall

acl-primary has two network interfaces: Eth0 is the public facing 128.205.43.0 address while eth1 is the internal 192.168.0.0 address. Name resolution as defined by /etc/host.conf is such that a host name will first be resolved by the /etc/hosts file, then by DNS. The compute cluster nodes exist only within the internal hosts file namespace. The /etc/hosts file is copied down to the compute nodes as part of the hourly croned updates copied to all machines.

Network Address Translation (NAT) places private IP subnetworks behind one or a small pool of public IP addresses, masquerading all requests to one source rather than several. The Linux kernel has built-in NAT functionality through the Netfilter kernel subsystem.

The Linux kernel features a powerful networking subsystem called Netfilter. The Netfilter subsystem provides stateful or stateless packet filtering as well as NAT and IP masquerading services. Netfilter also has the ability to mangle IP header information for advanced routing and connection state management. Netfilter is controlled using the iptables tool.

To activate the iptables service...

root@myServer ~] # service iptables restart
[root@myServer ~] # chkconfig --level 345 iptables on

By default, the IPv4 policy in Red Hat Enterprise Linux kernels disables support for IP forwarding. This prevents machines that run Red Hat Enterprise Linux from functioning as dedicated edge routers. To enable IP forwarding, edit the /etc/sysctl.conf from net.ipv4.ip_forward = 0 to read as follows: net.ipv4.ip_forward = 1. To enable the change to the sysctl.conf file us [root@myServer ~ ] # sysctl -p /etc/sysctl.conf

The rules are stored in /etc/sysconfig/iptables. The rules needed to enable IP Masquerading and NATing are...

[root@myServer ~ ] # iptables -A FORWARD -i eth1 -j ACCEPT
[root@myServer ~ ] # iptables -A FORWARD -o eth1 -j ACCEPT
[root@myServer ~ ] # iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
[root@myServer ~ ] # service iptables save

TFTP

TFTPD is needed in conjunction with Redhat Kickstart services. TFTP uses the User Datagram Protocol (UDP)and provides no security features. It is often used by servers to boot diskless workstations, X-terminals, and routers. tftp is an xinetd-based service; start it with the following commands:

/sbin/chkconfig --level 345 xinetd on /sbin/chkconfig --level 345 tftp on the root is located in /tftpboot. The boot images are stored in /tftpboot/linux-install/RHEL. The configuration files are stored in /tftpboot/linux-install/pxelinux.cfg. The files must be in the hex form of the IP for the boor loader program to locate.

DHCP

DHCP manages allocation of IP addresses based on IP address. Because you typically want a server to have a static IP we only use DHCP for PXE boot support.

The dhcp.conf file is located in /etc. Changes need to be made here. Whenever changes are made DHCP should be restarted with /etc/init.d/dhcpd restart. The leases file is located in /usr/share/doc/dhcp-/dhcpd.conf.sample. The log file is located in /var/log/messages

Because DHCP really should only be used for pxe booting, all new clients can go into the group { section. An entry will look simlar to:

        host acl-cadi-xeon-2 {
        hardware ethernet 00:19:b9:f2:af:c1;
        fixed-address 192.168.0.11;
        }

Because the DHCP server should only be started on eth1, configure the DHCP server to start only on that device. In /etc/sysconfig/dhcpd, add the name of the interface to the list of DHCPDARGS:

# Command line options here 
DHCPDARGS=eth1

Kickstart

Kickstart is Red Hat's take on remote installation. The following occurs...

  1. A client it set to PXE boot. On the Dell Poweredges this is done by hitting “F12” on boot. Some older clients may need to be booted by CD that has the PXE boot client on it.
  2. The client sends a DHCP request for an IP. DHCP responds with an address (as specified in /etc/dhcpd.conf. The dhcp directives specify what tftp server should be used and what bootloader program should be used with... next-server 192.168.0.1; #tftp server filename "linux-install/pxelinux.0"; #bootloader program the filename is relative to /tftpboot on the sever.
  3. The bootloader program looks for a configuration file named for the hex octet of the IP address in /tftpboot/linux-install/pxelinux.cfg.
  4. Based on the configuration file, the installer program then uses an “answer” file to tailor the install.
    default RHEL5
    
    label RHEL5
        kernel RHEL5/vmlinux-rhel5-as_64
        append initrd=RHEL5/initrd-rhel5-as_64.img ramdisk_size=8192 ks=nfs:192.168.0.1:/export/kickstart/cadi-pe860.cfg ksdevice=eth0
    
    this specifies the boot images, the kernel, and the location of the kickstart answer file. The install media ISOs are located in /export
  5. The answer file is located in /export/kickstart and takes care of things such as root password, graphics settings, partitioning information, install keys, etc...
  6. The last thing the answer file does is calls a finish script. The script and all needed files are found in /export/local_tree. The finish script is a shell script which is easily customizable. This does things such as presets accounts, firewall setups, ssh keys, groups, and registers the machines on the Red Hat Network.

Cluster System Administration

acl-primary is the central place where account, password, groups, hosts, and system administrator setting should be changed. Changes should take place using the standard tools such as vipw (for accounts and passwords), visudo (for changes to sudoers), and vi (for changes to the groups and hosts files). Every hour, the /var/local/adm/update_nis script checks to see if changes have been made to these files, and if so will scp them to all of the cluster and storage machines.

Batch system (torque) head node

See Torque page.

Scratch space storage

acl-primary exports a filesystem called /local which is then mounted by all machines in the cluster. This space is 320Gb. This space is a Linux logical volume and is spread across two physical drives (part of a 250 Gb, and a 250 Gb drive). This space should be treated as scratch space, and as a way to distribute any needed application.

Distributed software

Several scripts run changes from acl-primary against all the nodes in the cluster. These scripts run updates on every node, reboot, power down, and push down configuration changes. Find the distributed scripts in:

acl-primary:/var/local/adm/distributed_scripts/

When locally installing software on all nodes, model the new install script on these existing ksh scripts:

acl-primary:/var/local/adm/distributed_scripts/blcr/install_blcr
acl-primary:/var/local/adm/distributed_scripts/blcr/install_torque_client

References

  1. https://wiki.cse.buffalo.edu/services/content/berkeley-lab-checkpointres...
  2. https://wiki.cse.buffalo.edu/services/content/torque
  3. http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL51...
  4. http://www.linuxtopia.org/online_books/rhel5/rhel5_administration/rhel5_...
  5. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Insta...
  6. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Deplo...
  7. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Insta....
  8. http://en-US/Red_Hat_Enterprise_Linux/5.2/html/Deployment_Guide/index.html