This is mostly common sense for and SAN, but it help to have it written somewhere;
Symptoms;
Databases down or slow.
Server response slow
Server down or showing SCSI errors
HBA errors and timeouts ;
Cluster disk resource Disk M:: is corrupt. Running ChkDsk /F to repair problems.
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume D:.
The device, \Device\Scsi\ql23001, did not respond within the timeout period.
An error was detected on device \Device\Harddisk1\DR1 during a paging operation.
Cluster resource Disk X: timed out. If the pending timeout is too short for this resource, then you should consider increasing the pending timeout value.
Recovery
- Back-up and data that can be accessed
- Shutdown all servers that attach to the SAN
- Idle the switch
- Log into SAN switch and back-up switch config, ensure config settings have been saved from memory
- Save any log info to a test file (particularly any errors)
- Save Zone configs (critical)
- Ensure SAN switch traffic is minimal, if any (shutting down servers attached to switch should accomplish this
- Shut down SAN switch
- Log into SAN Disk Array
- Ensure no active synch jobs, array rebuilds, etc are active (unless the jobs are hung)
i. Note; Large SAN jobs can take hours to days to perform, are SAN memory intensive, but usually only slow down the SAN. These could be the cause. Find the chuclehead who started the jobs…
- Save the log file and array configs
- Restart the SAN (from the SAN management), never powercycle unless the SAN management is un-responsive (this can cause data loss, especially if the controller battery has failed)
- If the array is not visible after restart, you may have to “import” the array (an array synch may occur after restart)
- Restart the SAN switch
- Restart the affected servers
- Run checkdsk as neccessary