Friday, October 14, 2005

Problem replacing disk in StorEdge T3

Problem replacing disk in StorEdge T3

At work we have a T3 where all disks are configured for RAID5. One of
the disks has failed, which means that accessing the data on the T3 is
really slow.

When I entered the replacement disk, it seemed to be taken in use
automatically (proc list showed some progress), but then it failed with
a 0D status (see vol stat, fru stat etc below).

I noticed that the disk is not exactly the same as the other, could this
be the reason? It is a proper replacement disk bought from Sun with the
proper bracket and everything, so it should work or what? What can I do
to fix this?

--
- Erlend Leganger

T300 Release 1.17b 2001/05/31 17:47:22
Copyright (C) 1997-2001 Sun Microsystems, Inc.
All Rights Reserved.

bigdaddy:/:<1>vol stat

v0 u1d1 u1d2 u1d3 u1d4 u1d5 u1d6 u1d7 u1d8
u1d9
mounted 0D 0 0 0 0 0 0 0 0

bigdaddy:/:<2>fru list
ID TYPE VENDOR MODEL REVISION SERIAL
------ ----------------- ----------- ----------- -------- --------
u1ctr controller card SLR-MI 375-0084-02- 0210 022813
u1d1 disk drive SEAGATE ST336605FSUN A338 3FP0H63D
u1d2 disk drive SEAGATE ST336704FSUN A42D 3CD0VFBL
u1d3 disk drive SEAGATE ST336704FSUN A42D 3CD0T89W
u1d4 disk drive SEAGATE ST336704FSUN A42D 3CD0VCZ4
u1d5 disk drive SEAGATE ST336704FSUN A42D 3CD0VF5L
u1d6 disk drive SEAGATE ST336704FSUN A42D 3CD0TG33
u1d7 disk drive SEAGATE ST336704FSUN A42D 3CD0TT8G
u1d8 disk drive SEAGATE ST336704FSUN A42D 3CD0VD4T
u1d9 disk drive SEAGATE ST336704FSUN A42D 3CD0TXQF
u1l1 loop card SLR-MI 375-0085-01- 5.02 Flash 033179
u1l2 loop card SLR-MI 375-0085-01- 5.02 Flash 030038
u1pcu1 power/cooling unit TECTROL-CAN 300-1454-01( 0000 028800
u1pcu2 power/cooling unit TECTROL-CAN 300-1454-01( 0000 028799
u1mpn mid plane SLR-MI 370-3990-01- 0000 021282
bigdaddy:/:<3>fru stat
CTLR STATUS STATE ROLE PARTNER TEMP
------ ------- ---------- ---------- ------- ----
u1ctr ready enabled master - 30.5

DISK STATUS STATE ROLE PORT1 PORT2 TEMP VOLUME
------ ------- ---------- ---------- --------- --------- ---- ------
u1d1 ready disabled data disk ready ready 30 v0
u1d2 ready enabled data disk ready ready 33 v0
u1d3 ready enabled data disk ready ready 34 v0
u1d4 ready enabled data disk ready ready 32 v0
u1d5 ready enabled data disk ready ready 33 v0
u1d6 ready enabled data disk ready ready 33 v0
u1d7 ready enabled data disk ready ready 36 v0
u1d8 ready enabled data disk ready ready 32 v0
u1d9 ready enabled data disk ready ready 32 v0

LOOP STATUS STATE MODE CABLE1 CABLE2 TEMP
------ ------- ---------- ------- --------- --------- ----
u1l1 ready enabled master - - 27.0
u1l2 ready enabled slave - - 27.5

POWER STATUS STATE SOURCE OUTPUT BATTERY TEMP FAN1 FAN2
------ ------- --------- ------ ------ ------- ------ ------ ------
u1pcu1 ready enabled line normal fault normal normal
normal
u1pcu2 ready enabled line normal fault normal normal
normal
bigdaddy:/:<4>exit
Connection closed by foreign host.

Reply

> to fix this?

I'd say, complain @ Sun.
Searching google, I found the documentation from Seagate. Among other
things, it lists this:
ST336605: 29,549 cyl / 4 heads / 71,687,371 data blocks
ST336704: 14,100 cyl / 12 heads / 71,687,369 data blocks
I don't know whether these differences are a problem in this case. Sun
should be able to tell...
Maybe the issue can be fixed with a firmware update on the new drive (or
on all the old ones)?

You need to take a look at the syslog file right after the rebuild
fails. There should be more information in there. I have had this
happen before where the rebuild fails because of a read error on
another disk...

Reply

1) Your boot firmware is very old.
2) Your disk firmware is way out of date.
3) Both the batteries in your PCUs are expired.

The latest boot firmware is 1.18.04 and you're at 1.17b.
That's at least 3 years out-of-date!

The latest disk firmware for the ST336605FSUN is A838
The latest disk firmware for the ST336704FSUN is AE26

If you're lucky you'll be able to recover. The 'proc list' command will show
if the new disk is being reconstructed to. Otherwise hopefully you have a
way to backup the data. If so you can get the batteries replaced, upgrade
all the firmware and reinitialize the volume and restore the data.

Reply

> another disk...

Thanks for the tip. I have now learnt that the disk should be OK, so I
will try this again tomorrow and watch the syslog as you suggest. I will
be back with the result.

--
- Erlend Leganger

Reply

> all the firmware and reinitialize the volume and restore the data.

I guess this is what happens when you have a device that works OK, you
just forget about it... The batteries have been replaced though, we had
ordered them in.

I was able to copy the data from the T3 to other disk areas on the
server, so I'm OK with the files (I also have a backup on tape made
before it failed). I haven't RTFM yet, but are there any tips I should
be aware of when upgrading boot and disk firmware? What to do first?
Where do I get hold of the firmware updates?

> ordered them in.

You have to do more than just replace the batteries or the T3 won't know
anything has changed. Commands need to be ran to reset the dates back to
zero so the errors will go away.

This InfoDoc should explain the procedures:

http://www.sunshack.org/data/sh/2.1/infoserver.central/data/syshbk/co...

Also the batteries should now last 3 years instead of 2 years per Sun.

In the same patch you would use to upgrade the boot and disk firmware:

http://sunsolve.sun.com/pub-cgi/pdownload.pl?target=109115-17&method=h

there is a T3extender program that will run commands to set the battery
expiration life to 36 months instead of 24 months.

> another disk...

You were 100% correct. The warning light was lit on disk u1d1, so this
disk was replaced and attempted rebuilt. The rebuild failed after a
while, with a note of multiple disk errors in the syslog - it seems as
u1d4 has a problem as well. I was fooled by vol stat only showing error
on on u1d1 - I will check the syslog more carefully in the future.

--
- Erlend Leganger

> http://sunsolve.sun.com/pub-cgi/pdownload.pl?target=109115-17&method=h

Excellent, thank you. I need to wait for my second replacement disk, but
after reading up on the patch installation method, it doesn't seem too
difficult to do.

> there is a T3extender program that will run commands to set the battery
> expiration life to 36 months instead of 24 months.

I had a look at the T3extender program code and I decided that using
this patch is an extreme overkill (creating a long perl script and even
include perl itself in the patch) to do a small job: I only made two
".id write blife <pcu> 36" commands which seems to do the trick (see
below). Of course, if you have a room full of racks fully populated with
T3s, the script would be handy...

--
- Erlend Leganger

bigdaddy:/:<48>
bigdaddy:/:<48>id read u1pcu1
Revision : 0000
Manufacture Week : 00442000
Battery Install Week : 00412005
Battery Life Used : 0 days, 2 hours
Battery Life Span : 730 days, 12 hours
Serial Number : 028800
Battery Warranty Date: 20051010082149
Battery Internal Flag: 0x00000000
Vendor ID : TECTROL-CAN
Model ID : 300-1454-01(50)
bigdaddy:/:<49>
bigdaddy:/:<49>id read u1pcu2
Revision : 0000
Manufacture Week : 00442000
Battery Install Week : 00412005
Battery Life Used : 0 days, 2 hours
Battery Life Span : 730 days, 12 hours
Serial Number : 028799
Battery Warranty Date: 20051010082152
Battery Internal Flag: 0x00000000
Vendor ID : TECTROL-CAN
Model ID : 300-1454-01(50)
bigdaddy:/:<50>
bigdaddy:/:<50>
bigdaddy:/:<50>.id write blife u1pcu1 36
bigdaddy:/:<51>.id write blife u1pcu2 36
bigdaddy:/:<52>
bigdaddy:/:<52>
bigdaddy:/:<52>id read u1pcu1
Revision : 0000
Manufacture Week : 00442000
Battery Install Week : 00412005
Battery Life Used : 0 days, 2 hours
Battery Life Span : 1095 days, 18 hours
Serial Number : 028800
Battery Warranty Date: 20051010082149
Battery Internal Flag: 0x00000000
Vendor ID : TECTROL-CAN
Model ID : 300-1454-01(50)
bigdaddy:/:<53>
bigdaddy:/:<53>
bigdaddy:/:<53>id read u1pcu2
Revision : 0000
Manufacture Week : 00442000
Battery Install Week : 00412005
Battery Life Used : 0 days, 2 hours
Battery Life Span : 1095 days, 18 hours
Serial Number : 028799
Battery Warranty Date: 20051010082152
Battery Internal Flag: 0x00000000
Vendor ID : TECTROL-CAN
Model ID : 300-1454-01(50)
bigdaddy:/:<54>
bigdaddy:/:<54>
bigdaddy:/:<54>

0 Comments:

Post a Comment

<< Home