Sun SBus Fibre Channel Notes:

Below is a core dump from fixing scully after a Fibre Channel disk went crazy. Since then Ive had pretty good luck by running:

fsck o f

on each of the partitions, then mounting them.

Should the need arise, there are two spare SBus FC cards. A new one in the box on the shelf at the north end of the computer room below the tapes and the other in crux.

 

Let me take this opportunity to redocument scully and go over what I've done

on this problem.

When I came in Wed. morning, I hooked the vt420 up to the serial A port on

the back of scully using a DB25 gender changer and a DB25 to RJwhatever

plug. I couldn't even get a login prompt so I hit the F5 key to get to the

ok prompt and did a

sync

to reboot. scully came up fine and I logged in expecting to find the /

partition full. This was not the case so I started scrolling through

/var/adm/messages. Before I could get very far scully locked up again, so I

did F5 and this time I powered down scully and the bottom three racks of

fibre channel disks that scully serves. Note that each rack has TWO power

switches and both must be turned off. After the power had been off for a

minute it started powering on the disks. I started at the bottom, hit both

power switches and waited about 15 seconds before moving on to the next.

When all the disks had spun up (the lights on the front of the rack stopped

flashing) I powered on scully and it came up fine. When I logged in the

first thing I did was run format to check if all the disks showed up, and

they did. Then I was going through /var/adm/messages when it locked up yet

again. This time it noticed two things: first there were no obvious errors

in /var/adm/messages and the access light on one of the disks (second from

the left, third rack up) was going crazy. I did F5-sync again and this time

hit F5 as just as it started booting to get to the ok prompt. This time I

did a

boot -s

to boot in single user mode. In single user mode the machine was stable so I

was able to go through /var/adm/messages carefully. I found no panics except

those associated with the F5-sync and no disk errors, except for a single,

nonfatal SCSI transport error. So I commented out all mounts of the fibre

channel drives from /etc/vfstab (they are /dev/dsk/c1t*) and did ^D to boot

multiuser. After I was sure scully was stable in multiuser I started

mounting the disks one at a time. First I uncommented out the

/export/dataNNN line in vfstab. Since the mount points are all shared

(exported) I needed to do a

unshare /export/dataNNN

then

mount /export/dataNNN

then I executed the appreciate line in /etc/dfs/dfstab to reshare the

directory. I mounted all the disks the lower two racks and the first and

third in the upper rack (skipping the suspicious drive) and scully was still

fine so I got greedy. I assumed the suspicious drive was really the problem

so I uncommented out the lines for the remaining drives and rebooted,

expecting scully to come back up with all drives but the suspicious one

mounted. This did not happen, at least right away. scully wouldn't boot,

even into single-user, because the vfstab had been corrupted. When this

happens the thing to do is boot off the OS CD, fsck then mount the root

partition on some generic mount point e.g. /mnt, and fix the offending file.

The problem is scully doesn't have an internal CD drive so it must be booted

off colossus (which is how I initially installed the OS). The procedure is

as follows:

put the Solaris OS CD in the CDROM drive of colossus

do:

ls /cdrom/sol_7_599_sparc_sun_srvr/s0

to make sure the CD is properly spun up. If there arn't any files there

eject the CD and reinsert it (this happens sometimes).

Then share the CD. There is a line, commented out, in /etc/dfs/dfstab to do

that:

#share -F nfs -o ro,anon=0 /cdrom/sol_7_599_sparc_sun_srvr/s0

Then at the ok prompt on scully do a:

boot net

you should get 3 to 5 lines like this:

Timeout waiting for ARP/RARP packet...

then it should start to boot. This is imperfect technology however so

sometimes you need to do it more than once, after rebooting colossus. There

is lots of info on this in the Solaris Advanced Installation Guide that

comes with each copy of the OS.

Once it boots, the OS install starts. I've found you need to answer the

first couple questions then do a bunch of ^C's and you can get to a sh

prompt. Once I did that it was simply a matter of:

fsck /dev/rdsk/c0t0d0s0

mount /dev/dsk/c0t0d0s0 /mnt

cd /mnt/etc

cp vfstab_nofcal vfstab

reboot

and scully came up multiuser with no fibre channel drives mounted. I quickly

mounted the 18 drives that worked earlier and scully remained stable. Now I

figured it was not the "suspicious" drive causing the problems so I mounted

it and things remained good. Then I used format to test the remaining drives

before I mounted them. Unfortunately, all the drives passed the format =>

analyze => read test, but when I mounted /export/data183 scully froze soon

after, so I now think that is the problem. As of now all the drives on

scully except data183 (c1t22d0s6). And they all lived happily ever after.

The End.

---------------------------------------------------------------

James Wilson | phone: (805) 893-7366

Systems Group | FAX: (805) 893-2578

ICESS | email: jwilson@icess.ucsb.edu

University of California |

Santa Barbara, CA 93106-3060 |