UPS Setup
soon to come
Raid Setup
The disk setup is such that there are 14 drives (0-13) and one hot spare (14). There are two cold spares located within the server rack. The 14 drives are in a RAID5, A roundrobin parity raid device configuration. This offerers redundancy if a single disk fails. Build times last roughly 3 hours.
Upon failure a hot spare will step in right away and begin rebuilding. See "Drive Failures" for more details.
Drive Failures
A failure checking script is located in /var/local/adm/cec-chk-coraid.sh
. This script checks the outputs of console commands against known good values every half an hour. If a deviation is detected the script will email the cse-cadi-admin mail alias.
After a drive has been rebuilt the system should be halted at the next earliest convenient time. These drives are not hot swappable. After the system has been halted it can be safely shut down by holding the power button. The faulted drive should be slid out and a new cold spare mounted and re-inserted into the chassis. From there the newly inserted drive can be designated as the new hot spare using the spare
command. Once a hot spare is used is is not considered a stand-in. It will remain part of the lblade even when a new drive is slid in. If it is desired that the hot spare be kept as disk 14, the fail
command can be used . This will simulate another disk failure and force the newly inserted drive to be rebuilt to (after it, itself has been made a hot spare). The replace
command can be used to force a rebuild on to the newly inserted drive.
replace lun.part.drive shelf.slot
EX...
SR shelf 1> list l
0 500.108GB offline
0.0 500.108GB raid1 degraded
0.0.0 normal 500.108GB 1.0
0.0.1 failed 500.108GB 1.2
SR shelf 1> replace 0.0.1 1.1
SR shelf 1> list l
0 500.108GB offline
0.0 500.108GB raid1 recovering,degraded 0.05%
0.0.0 normal 500.108GB 1.0
0.0.1 replaced 500.108GB 1.1
SR shelf 1> fail 0.0.1
SR shelf 1> list l
0 500.108GB offline
0.0 500.108GB raid1 degraded
0.0.0 normal 500.108GB 1.0
0.0.1 failed 500.108GB 1.1
SR shelf 1> replace 0.0.1 1.1
SR shelf 1> list l
0 500.108GB offline
0.0 500.108GB raid1 recovering,degraded 0.14%
0.0.0 normal 500.108GB 1.0
0.0.1 replaced 500.108GB 1.1
SR shelf 1>
Backups
Currently both acl-storage and acl-primary are entirely backed up using CIT's TVM tape backup system and facilities. Documentation, install and restore instructions, SLAs and FAQs can all be found at https://ubfs.buffalo.edu/ubfs/collab/cit/tks_public/TSM/index.shtml.