Summary This document
outlines IBM recommendations on obtaining high availability of
an IBM server in a RAID environment. The IBM SCSI-2 Fast/Wide
PCI-Bus RAID Adapter and the IBM Fast/Wide Streaming Adapter/A
are covered.
NOTE: This document is intended for
use by system administrators.
Date: December
1997
Preventive Measures to Help
Obtain High Availability IBM recommends the
following precautions in order to help obtain high
availability of the RAID subsystem:
- Define
a Hot Spare
- Install
NetFinity Manager
- Data
Scrub Drives Weekly
- Apply
All Updates
- Install
and Use RAID Administration and Monitoring Utilities
- Ensure
Current Backup of RAID Configuration is Available
- Have a
RAID Configuration Utility Diskette Available
Define a Hot Spare Defining a hot
spare drive minimizes the length of time a server operates
with degraded performance when a defunct drive occurs. The hot
spare also allows the "inconsistent" drive to be easily
recognized in the event of a multiple defunct drive failure
such that recovery procedures require much less technical
expertise. The section below explains this advantage in
greater detail:
Hot Spare
Advantages When a system has a drive that
becomes defunct, data is not written to this DDD drive, but
data is written to the other drives in the array. Therefore
that DDD becomes "inconsistent" with the rest of the drives in
the array. When multiple drives appear DDD, the first and most
critical task is defining the "inconsistent" drive correctly.
The "inconsistent" drive must be the last drive replaced since
it requires rebuilding (and, if truly defective, may need
physical replacement). If the "inconsistent" drive is software
replaced (See Software Replace vs. Physical Replace) first
when a multiple DDD failure occurs, the "inconsistent" data
will be used to rebuild another drive. This eventually
corrupts the other drives (and data) on the
system. However, when an HSP is defined, you are protected
from rebuilding another drive from an "inconsistent" drive.
This is because of the way the RAID adapter marks the states
of drives. When a system has a defined HSP, as soon as the HSP
takes over for the DDD drive, the RAID Adapter marks the DDD
drive in its configuration as the HSP drive. The adapter does
not visually change the status of the drive to HSP. Yet if you
perform a software replace or physical replace, the RAID
Adapter starts the drive and changes the DDD state to HSP. The
RAID Adapter does not allow this drive to be brought back to
ONL status.
When the HSP takes over for the DDD drive,
the HSP is rebuilt to replace the DDD drive. During the
rebuilding of the HSP drive, it appears in the OFL state. The
OFL state changes to ONL once this drive is completely rebuilt
and fully operational for the DDD drive. The DDD drive remains
DDD.
If a HSP is not defined or multiple drives appear
DDD before the HSP is completely rebuilt, then this is not the
case. You must read the RAID log to determine the
"inconsistent" drive. Then for the IBM SCSI-2 F/W PCI-Bus RAID
Adapter and the IBM F/W Streaming RAID Adapter/A, you must
ensure that the software replace option is selected on each
drive bay in the correct order such that the "inconsistent"
drive is brought online last and rebuilt.
If a HSP
drive was defined but did not complete the rebuild, then it is
much easier to identify the "inconsistent" drive. The
"inconsistent" drive will remain in OFL status.
When
multiple drives appear defunct, as long as the logical drive
is not in the OFL state, the user may select the Replace
option to change the state of any of the DDD drives. Order
does not matter with logical drives in the CRT state because
the "inconsistent" drive will appear as OFL or DDD to the
user. If the logical drive is in the OFL state, the user may
attempt to recover by identifying the "inconsistent" drive,
software replacing all drives except the "inconsistent" drive,
and then rebuilding the "inconsistent" drive.
Install and Use NetFinity
Manager You should install NetFinity Manager 5.0
or greater in order to monitor the RAID array remotely.
NetFinity Manager can be used to schedule synchronization to
occur at any time of the day, so synchronization of the RAID
array can be scheduled for non-peak hours and will not require
user input to get things started. With NetFinity services
installed at the server, and the NetFinity manager installed
on a workstation, the RAID array can be monitored, and even
synchronized, from a remote location. The system can also be
configured to send alert messages regarding the RAID subsystem
over the network to the workstation. You can even setup
NetFinity Manager to page someone, e.g., the network
administrator or a service technician, if a certain alert
condition is reached. NetFinity Manager can also perform many
other functions such as monitoring processor utilization,
critical file monitoring and detecting installed software
across the network. NetFinity is also used to capture PFA
alerts from hard files and then send system alerts to the
appropriate parties. In order to use Netfinity 5.0 to schedule
data scrubbing, please download NF50RAID.EXE from
http://www.us.pc.ibm.com/files.html. This file contains
updated Netfinity program files which are required for
scheduling data scrubbing on controllers with the write policy
set to write-back cache. When installed with the NetFinity
Manager code the following operating systems are affected:
OS/2, WINNT, and WIN95.
Data
Scrub Drives Weekly One of the best ways to
recognize potential disk problems in advance and correct them
before a failure occurs, is to Data Scrub. Sector media errors
can be identified and corrected simply by forcing all data
sectors in the array to be accessed through Data Scrubbing.
Data Scrubbing checks all data sectors in the array and should
be performed weekly. With the IBM F/W Streaming RAID Adapter/A
and the IBM SCSI-2 F/W PCI-Bus RAID Adapter, an easy process
used to accomplish Data Scrubbing is synchronization. Data
Scrubbing will force all sectors of the drives contained in
the array to be read in the background while allowing
concurrent user disk activity. NetFinity Manager 5.0 can be
obtained at no additional charge by customers that have an IBM
server that ships with ServerGuide.
Apply All Updates You should apply
all updates regarding RAID. Check the IBM Server web site at:
http://www.us.pc.ibm.com/server/server.html or call the
HelpCenter for up-to-date information.
Install and Use RAID Administration and
Monitoring Utilities The RAID administration
utility alerts the user via the speaker and display if a drive
becomes DDD or if a Predictive Failure Analysis (PFA) alert
occurs. PFA support on disk drives recognizes potentially bad
drives, and alerts systems administrators allowing them to
replace the unit before a catastrophic drive failure. The PFA
alert prompts you to replace the drive before actual failure,
so that a HSP is always present.
The RAID
administration utility monitors RAID operations, displaying
results on the RAID Administration Screen and logging these
RAID events to a file. You can specify whether you want the
utility to save the file to a diskette drive, local hard
drive, or network hard drive; however, the recommended policy
is to use the diskette drive or network drive. This practice
makes it easier to recover from situations where the operating
system is not accessible due to the failure. The logs
themselves are required to recover data from systems when
multiple DDD drives occur. The logs also provide essential
RAID history for that server when troubleshooting and
isolating the defective part in cases where it is not the
drive that is defective.
Ensure Current Backup of RAID Configuration is
Available You should always have a
current backup of the RAID configuration; anytime the array
changes, you should make another backup. To create this
backup, select Backup Config. to Diskette under Advanced
Functions on the Main Menu of the RAID Configuration Diskette.
You are prompted to enter a filename; the default is CONFIG.
IBM recommends that you provide a unique name and backup to a
different diskette each time. A unique name ensures that a
good backup is not inadvertently overwritten, and a different
diskette allows you to write-protect the diskette and keep it
in a safe place. NetFinity 4.0 or above also allows you to
backup the configuration under the RAID manager.
Have a RAID Configuration
Utility Diskette Available Having a copy of the
RAID Configuration Utility Diskette is crucial when working on
a RAID system. Ensure that you always have a RAID
Configuration Utility Diskette available in close proximity to
all RAID systems. Due to possible changes of drive states, the
backup RAID configuration stored on the diskette may differ
from the current working RAID configuration.
Recovery Procedures for DDD
Drives This section provides you with procedures
for recovering from many different DDD scenarios. Topics
include:
- An
Overview of Drive Replacement
- Using
and Understanding the RAID Administration Log
- First
Actions to be Performed On Service Call with DDD
Drives
- Recovery Procedures When HSP is Present at Time of
Failure
- Recovery Procedures When HSP is Not Present at Time
of Failure
- Recovery From RAID Adapter Failure
Software Replace vs. Physical
Replace For the IBM SCSI-2 F/W PCI-Bus RAID
Adapter and the IBM F/W Streaming RAID Adapter/A, you perform
drive replacement via the RAID Configuration Utility. To
begin, select Replace Drive and Rebuild Drive options under
Replace/Rebuild on the RAID Configuration Main Menu. With this
action, the RAID Adapter sends a start unit command to the
drive. Once the drive starts successfully, the drive state
changes from defunct (DDD) to either hot-spare (HSP) or
offline (OFL). The drive state is HSP if an HSP has already
taken over for this DDD drive. The drive state is OFL if no
HSP drive was present when the drive went DDD. The logical
drive will be in a critical state and a rebuild is necessary
to bring this drive into the array as online (ONL). Once the
rebuild completes successfully, the logical drive indicates
OKY status.
Software Replace
vs. Physical Replace When the RAID Adapter
communicates with the hardfile and receives an unexpected
response, the adapter will mark the drive defunct in order to
avoid any potential data loss. For example, this could occur
in the event of a power loss to any of the components in the
SCSI RAID subsystem. In this case, the RAID adapter will err
on the side of safety and will no longer write to that drive
although the drive may not be defective in
anyway.
Different circumstances warrant either a
software replace or a physical replace, as discussed in the
following bullets: -Using a software replace is recommended
to try to recover data when multiple DDD drives occur. In this
situation, you may lose data on drives that are not actually
defective if you run a normal rebuild process.
WARNING: IF YOU USE THE WRONG ORDER WHEN YOU
ATTEMPT A SOFTWARE REPLACE, DATA CORRUPTION RESULTS.
- You can
perform a software replace for a DDD drive when a hot spare
(HSP) is not present in the system. In this situation, the
software replace requires a rebuild of the drive. During the
rebuild, all sectors of the drive are rebuilt. Therefore,
the drive is tested very well. If a rebuild of the drive
completes successfully, the drive does not need to be
physically replaced.
- If a
DDD drive has been replaced by an HSP, you should physically
replace the DDD drive. Under these circumstances, a software
replace will only send a start unit command to the drive. If
the unit starts successfully, then the drive is seen as good
by the RAID Adapter. Just restarting a drive does not
sufficiently test the drive. Therefore, the drive should be
physically replaced to ensure a good HSP drive is present in
the system.
Using and Understanding the RAID Administration
Log Being able to read the RAID log produced by
the RAID administration and monitoring utilities is a very
important part of recovering an array when one or more drives
are marked DDD. From the RAID log, you can determine in what
order drives went DDD, and, if multiple drives are DDD, which
one is the "inconsistent" drive. The RAID log is created by
either running the RAID Administration program or Netfinity
RAID Manager. RAID Administration Program can be obtained from
the Configuration Diskette which contains the device drivers
under the specific operating system subdirectory. The diskette
is available on the IBM website
http://www.us.pc.ibm.com/files.html. Search on "RAID." The
following is an excerpt from a RAID log created by the RAID
administration utility:
RAID Log
|
28 January
1997, 11:23:38 |
28 January
1997, 11:23:38 |
28 January
1997, 13:03:30 |
Adp 0: Drv
at ch 1 bay 2 is defunct. |
28 January
1997, 13:03:40 |
Adp 0: Drv
at ch 1 bay 2 is not auto
replaced. |
The original configuration was:
- bay 2:
HSP
- bay 4:
ONL
- bay 5:
ONL
- bay 6:
ONL
The first
two lines of the RAID log show that the drive in bay 5 was
marked DDD and auto replaced by the HSP drive in bay 2. At a
later point in time after the rebuild to bay 2 was successful,
bay 2 was marked DDD. Because there was no HSP drive defined
(bay 5 had been neither physically nor software replaced, so
it was still DDD), bay 2 was not auto replaced, so the array
remains in the critical state until a replacement drive is
added. Using the time stamps on this RAID log, you can tell
the exact times the apparent drive failures occurred. You can
use this information to rebuild the array properly when
multiple DDD drives occur at the same time.
In the
current status interpreted by the RAID log, the drive in bay 2
is the "inconsistent" drive, and you must physically replace
it. If more drives are DDD but not listed in the RAID log
because the server has trapped (OS/2 or NT) or the volume was
dismounted (NetWare). Then, you need to software replace those
drives before replacing the drive in bay 2, because the other
drives contain the correct information to rebuild the
"inconsistent" drive assuming no other error has arisen on
those drives.
Before you perform any actions on the
hardware, use NetFinity, the RAID administration program, or
the RAID configuration program to fill in the attached
template at the end of this document with the current status
of all the drives, both internal and external. This template
provides a three-channel diagram to accommodate all types of
IBM RAID Adapters.
For the F/W Streaming RAID
Adapter/A and SCSI-2 Fast/Wide PCI-Bus RAID Adapter, if power
is lost or another drive is marked DDD during a rebuild
operation, the rebuild fails and the drive being rebuilt
remains in the OFL state. If you are working with systems that
have these adapters, do not perform any operations on the OFL
drive until all other DDD drives are changed back to either
ONL or HSP. This is because the OFL drive is "inconsistent"
from the rest of the array and requires a rebuild operation.
If you do not rebuild the drive, then data will be corrupted.
If you accidentally select an OFL drive to rebuild while other
drives in the array besides the HSP are DDD, then the rebuild
fails and the OFL drive becomes DDD. In a case such as this,
if you have not noted which drive was OFL, then you no longer
are able to tell which drive was the original the OFL or
"inconsistent" drive. The best way to ensure data is rebuilt
successfully is to perform the following two steps:
1.
Do not perform any operations on an OFL drive until all DDD
drives have changed back to either ONL or HSP.
2.
Write down which drive is OFL so that you have a note of the
"inconsistent" drive. This ensures that you will be able to
determine the "inconsistent" drive in case you inadvertently
cause it to go DDD.
First
Actions to be Performed On Service Call with DDD
Drives 1. Pull RAID Administration Log created
by RAID Administration Program or Netfinity Manager. RAID
Administration Program can be obtained from the Configuration
Diskette which contains the device drivers under the specific
operating system subdirectory. The diskette is available on
the IBM website http://www.us.pc.ibm.com/files.html. Search on
"RAID."
2. From reading the RAID Administration Log or
Netfinity Manager log, determine whether a HSP drive was
present in the system or not. The log will indicate that a
drive in a specific bay went DDD. Then, if hot spare was
present, it will indicate that a drive in a specific bay
auto-replaced the DDD drive bay.
3. View the Drive
Information under Options in RAID Administration Program or
under Netfinity RAID Manager to determine if any errors were
recorded against DDD drive.
NOTE: Do not reboot the system
because these error counters initialize to zero when the
system is rebooted.
Hard Errors - The number of
SCSI I/O processor errors that occurred on the drive since the
Device Error Table was last cleared. It also indicates if the
drive exceeded Predictive Failure Analysis (PFA)
threshold.
Action: Contact your support
representative for further problem
determination.
Soft Errors - The number of SCSI
Check Condition status messages returned from the Drive
(except Unit Attention) since the Device Error Table was last
cleared.
Action
:
- If HSP
is present, follow procedures in the next section: Recovery
Procedures when HSP is Present at Time of Failure.
- If HSP
is not present, follow procedures in the section: Recovery
Procedures When HSP is Not Present at Time of
Failure.
Miscellaneous Errors - The number of other
errors (such as selection timeout, unexpected bus free, or
SCSI phase error) that occur on the drive since the Device
Error Table was last cleared.
Action: Ensure cabling
and connectors are seated properly. If backplane, ensure
backplane is not bowed causing poor drive connection. If there
are no problems with cable, backplane, etc, determine whether
HSP drive is present or not and follow appropriate Recovery
Procedures listed below but do not software replace the drive.
Physically replace the drive.
Parity Errors -
The number of parity errors that occurred on the SCSI bus
since the Device Error Table was last
cleared.
Action: Check to ensure SCSI bus is
properly Terminated with one Active Terminator only placed at
the end of the SCSI Chain. If a backplane is the last device
on the chain, then the backplane terminates the bus as long as
no cable is plugged into the daisy-chained connector on the
backplane.
PFA Error - Predictive Failure
Analysis
Action: Determine whether HSP drive is
present or not and follow appropriate Recovery Procedures
listed below, but do not software replace the drive.
Physically replace the drive.
Recovery Procedures When HSP is Present at Time
of Failure
The following instructions apply
to the IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM
Fast/Wide Streaming Adapter/A.
One DDD Drive, No
OFL Follow the steps below to bring the DDD drive back
to ONL if the following items are true:
- Only
one drive is marked DDD and the rest are ONL.
- The
RAID logical drive status is OKY because an HSP is present
in the system. Either the HSP drive is the hard drive that
went DDD or the HSP has already automatically taken over for
the DDD drive and has been rebuilt successfully.
- There
are no drives with an OFL status.
Once you
verify the conditions above through either the RAID
administration log or the RAID administration utility, perform
the following steps to bring the DDD drive back to HSP status.
1. Physically replace the hard drive in the DDD bay
with a new one of the same capacity or greater.
2.
With a RAID-1 or RAID-5 array, the operating system is still
functional at this point. Use either NetFinity or the RAID
administration utility to bring the drive back to HSP status.
With the RAID administration utility, open the options menu
and select Replace Drive.
3. When you see the prompt to
select the DDD drive, highlight the drive you just replaced
and press Enter.
4. The RAID adapter issues a start
unit command to the drive. Once the drive successfully spins
up, the RAID adapter changes the drive's status to HSP and
saves the new configuration.
5. If you see an "Error in
starting drive" message, reinsert cables, the hard drive,
etc., to verify these are connected properly, then go to step
2. If the error persists, go to step 1.
6. If the
error still occurs with a known good hard drive, then
troubleshoot to determine the defective part, which may be a
cable, back plane, RAID adapter, etc. Once you have replaced
the defective part so that there is a good connection between
RAID adapter and hard drive, go to step 2.
Two DDD Drives, No OFL If the
system has two DDD drives, and a defined hot spare existed
prior to the drive failures, then the system should still be
up and running as long as the logical drives are configured as
RAID-5 or RAID-1. If the system is still running, then one of
the DDD drives becomes HSP when you replace it. Perform the
following steps to bring the logical drive back to ONL status.
(Because the operating system is functional, this procedure
assumes you are using the RAID administration utility within
the operating system to recover.):
1. Physically
replace both drives that are marked DDD.
2. Once you
replace both drives, select the options menu of the RAID
administration utility. Choose Replace Drive, highlight the
first DDD drive, and press Enter. You receive a message
confirming that the drive is starting. After that, one of two
things happens:
- The
drive starts the rebuild process; when complete, the drive
changes to ONL.
OR
- The
drive becomes HSP. This happens if the actual hot-spare
drive that was previously defined is defective, or a
different drive was marked DDD and the hot spare
successfully rebuilt the data before the second drive went
down.
You can
check which one occurs by viewing the RAID log.
3.
Repeat step 2 for the second DDD drive.
More than 2 DDD Drives, No OFL In
this scenario, the operating system is no longer functional.
Therefore, you must boot to the RAID Option Diskette to
recover the array. It is extremely important to confirm that
either the RAID administration utility or NetFinity Manager
has been running prior to the drives being marked defunct. If
so, the utility or NetFinity Manager has logged the sequence
of DDD events to a log file either on a diskette or on a local
or network drive. With this file, you can view the log file on
another machine to determine the "inconsistent" drive. When
you know which drive is "inconsistent", you can attempt to
recover data.
NOTE: The previous paragraph
states "attempt to recover" because once you lose more than
one drive in a set of RAID-5 or RAID-1 logical drives, loss of
data is definitely a possibility. The steps below guide you
through a recovery, if at all possible.
1. View the
RAID log on another machine and write down the order in which
the drives went defunct.
2. Boot to the RAID
configuration diskette, and select View Configuration. Make
sure that the template contains the correct information for
the status of all drives, not just those listed in the RAID
log.
3. Using the RAID configuration utility, select
Replace Drive and choose a DDD drive that is not listed in the
RAID log. Repeat this step until the only DDD drives remaining
are those indicated in the RAID log file.
NOTE: The drives marked DDD that are
not listed in the RAID log are the last ones to go defunct.
You must recover these drives first so that the information
from them can be used to rebuild the original drive that
failed (the "inconsistent" drive). If you do not replace the
"inconsistent" drive last, then the system uses it to rebuild
the last drive that went defunct, resulting in corrupted data.
Therefore, it is extremely important to perform step 3
carefully.
4. Select Replace Drive and then
select the last drive to go defunct according to the log file.
Repeat this step until you have replaced all drives in the
correct order. One of the drives should appear as OFL and one
should appear as HSP; the rest appear as ONL.
5.
Select Rebuild and highlight the DDD drive.
6.
If the rebuild completes successfully, reboot to the operating
system. If it does not complete successfully, go to step 7.
At this point, run non-destructive RAID diagnostics
individually on each drive. Run these diagnostics individually
to ensure that you do not get more than one drive that goes
defunct at a time. If a drive does go DDD, physically replace
that drive and run a replace/rebuild procedure. This verifies
that you remove all defective drives from the system, if any
exist.
7. If the rebuild process fails, then perform
these steps:
a. Exit to the RAID Main Menu.
b.
Select Drive Information and view the error counts for
each of the hard drives to determine which drive has errors.
c. If the errors occurred on the drive being rebuilt,
then physically replace this drive. Select Replace. The status
of the drive changes from DDD to OFL. Attempt the rebuild
process again. If it completes successfully, go to Step 6.
If the drive still fails the rebuild process, then
verify that the drives being rebuilt from do not have any
errors. If they have no errors, then you should be able to
rebuild the data. Check cable connections to the drive being
rebuilt - it is possible that you replaced a defective drive
with another defective drive.
When errors occur on the
drives that you are rebuilding from, the adapter continues to
rebuild all information except that contained in the
unrecoverable defective sector. If the unrecoverable sector
was in the data area of the disk, then naturally some data has
been lost. There is no method at this time for determining
whether the errors are in a data or non-data area of the disk.
Users must inspect their personal files to determine this.
To recover the portion of the data that was rebuilt,
perform the following steps after the "Rebuild Failure"
message:
1. If a backup configuration is available,
restore the backup configuration.
2. If a backup
configuration is not available, write down the information you
can retrieve by selecting the View Configuration option.
Delete the array and manually create it to match this
configuration information. Perform this step carefully, for if
you deviate in any way from the original configuration, then
you will lose all data. NOTE: Do not Initialize this logical
drive.
3. Have all users verify their personal files to
ensure their data is good. Keep in mind that some files may be
corrupt due to rebuild errors.
One or More DDD Drives, and One OFL
Drive Follow the same basic steps as those
listed in the above section to recover your data. When a drive
is marked OFL, that means that it is spinning but
"inconsistent" with the rest of the array. Usually when a
drive is marked OFL, the data on it is being rebuilt from the
remaining drives in the array. If the server loses power, or
if another drive goes DDD during a rebuild, then the drive
being rebuilt remains OFL. In this case, you have to boot the
machine to the RAID Configuration Diskette and then follow the
procedure in the previous section. Make sure that the OFL
drive is the last drive to be software replaced. The offline
drive is the "inconsistent" drive, and it requires a
rebuilding process.
NOTE:
Data corruption occurs if the OFL drive is used to
rebuild another drive.
Recovery
Procedures When HSP is not Present at Time of
Failure For the IBM SCSI-2 Fast/Wide PCI-Bus
RAID Adapter and IBM Fast/Wide Streaming Adapter/A, use the
following instructions.
One DDD
Drive, No OFL Follow these steps to bring the
DDD drive back to the ONL state if the following items are
true:
- Only
one drive is marked DDD and the rest are ONL.
- There
are no drives with an OFL status.
Once the
conditions above are verified through either the RAID
administration log or the RAID administration utility, perform
the following steps to bring the DDD drive back to ONL status.
1. If drive has never been marked DDD, proceed to step
3 to software replace the drive using the RAID Administration
Program or Netfinity RAID Manager.
NOTE: Refer to "Software Replace vs.
Physical Replace" section of this paper to understand
differences between software and physical
replacement
2. If the drive has been marked DDD before,
proceed to step 7.
3. With a RAID-1 or RAID-5 array,
the operating system will be functional. Use either NetFinity
or the RAID administration utility within the operating system
to bring the drive back to ONL status. With the RAID
administration utility, open the Options menu and select
Rebuild Drive.
4. When you see the prompt to select the
DDD drive, highlight the drive you just replaced and press
Enter.
5. The RAID adapter issues a start unit command
to the drive. You receive a message confirming that the drive
is starting. The drive then begins the rebuild process. Once
the drive completes this process, the drive's status changes
to ONL.
6. If you see a "Error in starting drive"
message, reinsert the cables, hard drive, etc., to verify
there is a good connection, then go to step 3. If the error
persists, go to step 7.
7. Physically replace the hard
drive in the DDD bay with a new one of the same or greater
capacity and go to step 3.
8. If the error still occurs
with a known good hard file, then troubleshoot to determine if
the cable, back plane, RAID adapter, etc., is defective.
NOTE: RAID Adapter
should not be replaced unless Hard Errors are reported under
Drive Information with RAID Administration Options Menu or
Netfinity RAID Manager.
Once you have replaced the
defective part so that there is a good connection between the
adapter and hard drive, go to step 3.
Two DDD Drives, No OFL In this
case, with no defined hot spare drive, then the server more
than likely trapped (under OS/2 and NT), or the volume was
dismounted (under NetWare). To attempt to resolve this
scenario, you must examine the RAID log generated by the RAID
Administration Utility and follow the steps below:
1.
Boot to the RAID configuration utility for your RAID adapter.
2. Select Replace Drive. Highlight the drive marked
DDD last by the RAID adapter and press enter. The drive spins
up and changes from DDD to ONL status.
WARNING: IF YOU USE THE WRONG ORDER
WHEN YOU SELECT SET DEVICE STATE TO CHANGE DRIVE'S STATE TO
ONL, DATA CORRUPTION RESULTS. SEE NOTE BELOW TO DETERMINE LAST
DRIVE MARKED DDD BY THE RAID ADAPTER
NOTE: Refer to "Using and
Understanding the RAID Administration Log" section of this
document, for details on obtaining and interpreting the RAID
log. If only one drive is recorded in the RAID log because the
RAID adapter was not able to log the defunct drive before the
operating system went down, then the last drive that went
defunct is the drive that is not recorded in the RAID log. If
two drives are recorded in the RAID log, then the last drive
to go defunct is the second drive listed in the log - the
drive with the most recent time stamp.
3. If the drive
has been marked DDD before, proceed to step 8.
4.
Proceed to step 5 to software replace the remaining DDD drive
using the RAID Administration Program or Netfinity RAID
Manager.
NOTE: Refer
to "Software Replace vs. Physical Replace" section of this
paper to understand differences between software and physical
replacement
5. With a RAID-1 or RAID-5 array, the
operating system will be functional. Use either NetFinity or
the RAID administration utility within the operating system to
bring the drive back to ONL status. With the RAID
administration utility, open the Options menu and select
Rebuild Drive.
6. When you see the prompt to select the
DDD drive, highlight the drive you just replaced and press
Enter.
7. The RAID adapter issues a start unit command
to the drive. You receive a message confirming that the drive
is starting. The drive then begins the rebuild process. Once
the drive completes this process, the drive's status changes
to ONL.
8. If you see a "Error in starting drive"
message, reinsert the cables, hard drive, etc., to verify
there is a good connection, then go to step 5. If the error
persists, go to step 9.
9. Physically replace the hard
drive in the DDD bay with a new one of the same or greater
capacity and go to step 5.
10. If the error still
occurs with a known good hard file, then troubleshoot to
determine if the cable, back plane, RAID adapter, etc., is
defective.
NOTE:
RAID Adapter should not be replaced unless Hard
Errors are reported under Drive Information with RAID
Administration Options Menu or Netfinity RAID
Manager.
Once you have replaced the defective part so
that there is a good connection between the adapter and hard
drive, go to step 3.
11. If software replacement brings
all drives back ONL and makes system operational, carefully
inspect all cables, etc to ensure that cable or backplane is
not defective. Check all backplane connectors and ensure that
backplane is not bowed. When multiple drives are marked
defunct, it is often the communication channel (cable or
backplane) that is the cause of the failure. If backplane is
bowed, drives and backplane connectors may not seat properly
causing it to have a bad connection. Also, with hot-swap
drives that are removed frequently, connectors could become
damaged if too much force is exerted.
12. If the
rebuild completes successfully, then perform the following
steps to ensure that all drives are good: Run
non-destructive RAID diagnostics individually on each drive.
Run the diagnostics individually to ensure that you do not
have more than one drive that can become defunct at a time. If
a drive does become DDD, physically replace that drive and run
a rebuild process on the new drive. This verifies that all
defective drives are removed from the system, if any
exist.
If the REBUILD process fails, then perform the
following steps:
a. Exit to the RAID Main Menu. b.
Select Drive Information and view the error counters for each
of the hard files to find out which drive had errors. Refer to
"First Actions to be Performed on Service Call With DDD
Drives" for descriptions of the various errors and the
appropriate action. c. If the errors occur on the drive
being rebuilt, then physically replace this drive and select
Rebuild again. The drive's status changes from DDD to RBL and
the rebuild process begins. If this process completes
successfully, go to Step 5.
If it still fails the
rebuild, then verify that the drives that are being rebuilt
from do not have any errors. If they have no errors, then you
should be able to rebuild the data. Check cable connections to
the drive being rebuilt. It is possible that you replaced a
defective drive with another defective drive.
- When
errors occur on the drives that you are rebuilding from, the
adapter continues to rebuild all information except that
contained in the unrecoverable defective sector. If the
unrecoverable sector was in the data area of the disk, then
naturally some data has been lost. There is no method at this
time for determining whether the errors are in a data or
non-data area of the disk. Users must inspect their personal
files to determine this.
To recover the portion of the
data that was rebuilt, perform the following steps after the
"Rebuild Failure" message:
1. If a backup
configuration is available, restore the backup configuration.
2. If a backup configuration is not available, write
down the information you can retrieve by selecting the View
Configuration option. Delete the array and manually create it
to match this configuration information. Perform this step
carefully, for if you deviate in any way from the original
configuration, then you will lose all data. NOTE: Do not
Initialize this logical drive.
3. Have all users verify
their personal files to ensure their data is good. Keep in
mind that some files may be corrupt due to rebuild
errors.
More than 2 DDD Drives, No OFL To
attempt to recover, perform the following: 1. View the RAID
log and write down the order in which the drives went defunct.
2. Boot to the RAID Configuration Diskette and select
View Configuration. Make sure that the template contains the
correct information for the status of all drives, not just
those listed in the RAID log.
3. Using the RAID
configuration utility, select Replace Drive and choose a DDD
drive not listed in the RAID log. Change the state of this
drive to ONL. Perform this step until the only DDD drives
remaining are those indicated in the RAID log.
WARNING: IF YOU USE THE WRONG ORDER WHEN YOU
SELECT SET DEVICE STATE TO CHANGE DRIVES' STATEs TO ONL, DATA
CORRUPTION RESULTS. ENSURE THAT YOU ONLY CHANGE DEVICE STATES
TO ONL OF DRIVES NOT LISTED AS DDD IN THE RAID LOG. THE FIRST
DRIVE THAT WENT DEFUNCT REQUIRES REBUILDING. SO IT MUST BE
REPLACED LAST.
NOTE: Refer to "Using and Understanding
the RAID Administration Log" section of this document, for
details on obtaining and interpreting the RAID log. Refer to
"Software Replace vs. Physical Replace" section of this paper
to understand differences between software and physical
replacement
4. Follow the same procedure used to
recover from two DDD drives, as outlined in the previous
section.
Recovery from RAID
Adapter Failure When a RAID adapter failure
occurs, you must replace the RAID adapter and then place the
new RAID configuration onto the RAID adapter. For the IBM
SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM Fast/Wide
Streaming Adapter/A, there are two ways to restore the RAID
configuration:
1. If you have the most recent backup
of the current RAID configuration, then perform the following
steps:
a. Boot
to the RAID Option Diskette. b. Select Advanced
Functions. c. Select Restore RAID
Configuration. d. Enter the backup configuration
filename and press Enter. e. The RAID adapter
saves the new configuration.
2. If no
backup of the current RAID configuration is available, then
perform one of the following steps:
- Remove
the NVRAM module from the old RAID adapter and place it on
the new RAID adapter. Boot the system. At the Mismatch
Configuration screen, you see a prompt to choose the
configuration to use. Select the NVRAM
configuration.
OR
- Manually configure the new RAID adapter via the
Create/Delete Array menu of the RAID Option Diskette Main
Menu. To do this, you must know what the configuration was
when the RAID adapter failed. You can determine the
configuration by reading the RAID log. In addition,
understanding how the array was originally configured before
the DDD failures will help you complete this step.
NOTE: Do not Initialize this logical
drive.
NOTE: When you have a defined hot spare
and the RAID log is not available, remember that the hot spare
becomes part of the array as soon as the first drive is marked
defunct. The initial drive that went defunct is DDD in the
configuration and is no longer part of the array. However, the
hot-spare drive, until it is completely rebuilt, is marked as
write only in the configuration. If the configuration is lost,
then you must remember that the hot spare may or may not have
completed rebuilding. Therefore, take this into account when
replacing RAID adapters where the NVRAM is also corrupted, the
known state of the array is uncertain, the RAID log is not
available, drives are DDD, or a hot spare was
defined.
1. Manually define the array according to your
best estimate, including the original HSP drive as part of the
array. You include the HSP drive because other drives were
defunct besides the HSP. Therefore, the HSP has most likely
taken over for the original drive.
2. Before booting
to the configuration, pull the original HSP drive and mark it
as defunct. This ensures the logical drive is running in the
CRT state. This in turn eliminates problems if the HSP could
not have completed rebuilding.
NOTE: The
information above is to help guide you to make the best
choices when servicing RAID problems. However, there will be
times when data is not recoverable.
Drive Template As mentioned in the
section titled "Using and Understanding the RAID
Administration Log," you may find this template useful to
record the status of drives as you begin the troubleshooting
process.
Channel
1 |
Channel
2 |
Channel
3 |
|
. |
. |
|
. |
. |
|
. |
. |
|
. |
. |
|
. |
. |
|
. |
. |
|
. |
. |
Definitions
Array In
the RAID environment, data is striped across multiple physical
hard drives. The array is defined as the set of hard drives
included in the data striping. The largest number of physical
hard drives that you can define in one array is eight.
Data Scrubbing Data Scrubbing forces all
data sectors in a logical drive to be accessed so that sector
media errors are identified and corrected at the disk level
using disk ECC information if possible, or at the array level
using RAID parity information if necessary. For a high level
of data protection, Data Scrubbing should be performed
weekly.
Logical Drive The array specifies
which drives should be included in the striping of data. Each
array is subdivided into one or more logical drives. The
logical drives specify the following:
- The
number and size of the physical drives as seen by the
operating system. The operating system sees each defined
logical drive as a physical drive. T
- he RAID
level. When a logical drive is defined, its RAID level (0,
1, or 5) is also defined.
RAID-0 RAID level 0 stripes the data across
all of the drives of the array. RAID-0 offers substantial
speed enhancement, but provides for no data redundancy.
Therefore, a defective hard disk within the array results in
loss of data in the logical drive assigned level 0, but only
in that logical drive.
RAID-1 RAID level 1
provides an enhanced feature for disk mirroring that stripes
data as well as copies of the data across all the drives of
the array. The first stripe is the data stripe, and the second
stripe is the mirror (copy) of the first data stripe The data
in the mirror stripe is written on another drive. Because data
is mirrored, the capacity of the logical drive when assigned
level 1 is 50% of the physical capacity of the grouping of
hard disk drives in the array.
RAID-5 RAID
level 5 stripes data and parity across all drives of the
array. When a disk array is assigned RAID-5, the capacity of
the logical drive is reduced by one physical drive size
because of parity storage. The parity is spread across all
drives in the array. If one drive fails, the data can be
rebuilt. If more than one drive fails, but one or none of the
drives are actually defective, then data may not be lost. You
can use a process called software replacement on the
non-defective hard drives.
Software
Replace A Software Replace of a hardfile refers to when
the hardfile is not physically replaced in the system. A drive
may have been marked defunct but brought back online using the
RAID Administration program. The drive is rebuilt without
having been physically replaced. This could occur because when
the RAID Adapter communicates with the hardfile and receives
an unexpected response, the adapter will mark the drive
defunct in order to avoid any potential data loss.
Synchronization Synchronization reads all
the data bits of the entire logical drive, calculates the
parity bit for the data, compares the calculated parity with
the existing parity, and updates the existing parity if
inconsistent.
The following definitions describe the
logical drive states for the IBM SCSI-2 F/W PCI-Bus RAID
Adapter and the IBM F/W Streaming RAID Adapter/A:
CRITICAL (CRT) This is the status for
RAID-1 and RAID-5 arrays where the system is running in
degraded mode because one drive is DDD. If another drive goes
DDD, the array will be OFL and the operating system will not
be operational.
GOOD For the IBM SCSI-2 F/W
PCI-Bus RAID Adapter and the IBM F/W Streaming RAID Adapter/A,
the logical drive status is GOOD when all drives in the array
are ONLINE and fully operational. The adapters also assign
device states to physical drives. The following definitions
describe these device states:
DDD The RAID
adapter marks an ONL or OFL drive defunct, changing its status
to DDD and removing power from the drive, when one of the
following conditions occur: - The drive does not respond
to commands by a certain timeout value. - The drive
exceeds the number of allowed busy status responses as
specified by the RAID adapter firmware. - A reassign
failure or two successive failures in verification occur when
the RAID adapter tries to recover from a media error reported
from the drive.
NOTE: Media error recovery and
conditions under which a drive is marked defunct in the
recovery process vary slightly depending upon the specific
RAID adapter.
FMT Format; the drive is being
reformatted.
HSP A hot-spare (HSP) drive is
a drive designated to be a replacement for the first DDD drive
that occurs. The state of the drive appears as HSP. When a DDD
drive occurs and a HSP is defined, the hot-spare drive takes
over for the drive that appears as DDD. The HSP drive is
rebuilt to be identical to the DDD drive. During the
rebuilding of the HSP drive, this drive changes to the OFL
state. The OFL state will turn to ONL once the drive is
completely rebuilt and fully operating for the DDD drive.
OFL Offline; a good drive that replaces a
defunct drive in a RAID level 1 or level 5 array. This drive
is associated with the array, but does not contain any data.
Drive status remains OFL during the rebuild
phase.
ONL Online; a drive that RAID adapter
detects as installed, operational, and configured into an
array appear as this state.
PFA The
firmware of a hard drive uses algorithms to track the error
rates on the drive. The drive alerts the user with a
Predictive Failure Analysis (PFA) alert via the RAID
administration utility and NetFinity when degradation of drive
performance (read/write errors) is detected. When a PFA alert
occurs, physical replacement of the drive is
recommended.
RDY RDY appears as the status
of a drive that the RAID adapter detects as installed, spun
up, but not configured in an
array.
UFM Unformatted; a drive that requires
a low-level formatting before it can be used in an array. You
can start the low-level format by selecting Format Drive from
the RAID Configuration Main Menu.
Additional Information
Web
Sites IBM maintains extensive and timely information on
the world wide web. Visit the following sites for more
information on IBM servers and other IBM products. These
sources contain product information, performance data, and
technical literature.
IBM Home Page
............................................
http://www.ibm.com IBM PSG Home page
..................................
http://www.pc.ibm.com IBM PSG Server Home page
....................
http://www.pc.ibm.com/us/server/server.html IBM PSG Company
Support ......................
http://www.pc.ibm.com/us/support.html TechConnect Program
................................
http://www.pc.ibm.com/techconnect/ File repositories
.............................................
http://www.pc.ibm.com/us/files.html or ftp://ftp.pcco.ibm.com
White
Papers The following White Papers pertain to RAID and
hardfile technologies. These provide procedures for ensuring
the highest protection and availability of customer data and
are viewable on-line in PDF format at:
http://www.pc.ibm.com/techconnect/tech/resource.html. From
this site select "White Papers".
1. Using IBM RAID
Adapters to Avoid Data Loss. 2. High Availability Using
the IBM ServeRAID Adapter 3. Understanding Hard Disk Drive
Media Defects.
Notice International Business
Machines Corporation 1997. All rights
reserved.
References in this publication to IBM
products, programs or services do not imply that IBM intends
to make these available in all countries in which IBM
operates. Any reference to an IBM product, program, or service
is not intended to state or imply that only IBM's product,
program, or service may be used. Any functional equivalent
program that does not infringe any of IBM's intellectual
property rights may be used instead of the IBM product,
program or service.
Information in this paper was
developed in conjunction with use of the equipment specified,
and is limited in application to those specific hardware and
software products and levels.
IBM may have patents or
pending patent applications covering subject matter in this
document. The furnishing of this document does not give you
any license to these patents. You can send license inquiries,
in writing, to the IBM Director of Licensing, IBM Corporation,
500 Columbus Avenue, Thornwood, NY 10594 USA.
The
information contained in this document has not been submitted
to any formal IBM test and is distributed AS IS WITHOUT
WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. The information about non-IBM
(VENDOR) products in this manual has been supplied by the
vendor and IBM assumes no responsibility for its accuracy or
completeness. The use of this information or the
implementation of any of these techniques is a customer
responsibility and depends on the customer's ability to
evaluate and integrate them into the customer's operational
environment. While each item may have been reviewed by IBM for
accuracy in a specific situation, there is no guarantee that
the same or similar results will be obtained elsewhere.
Customers attempting to adapt these techniques to their own
environments do so at their own risk. This publication could
include technical inaccuracies or typographical errors.
Changes are periodically made to the information herein. IBM
may make improvements and/or changes in the product(s) and/or
the program(s) described in this publication at any time.
The following terms are trademarks or registered
trademarks of the International Business Machines Corporation
in the United States and/or other countries.
OS/2®
NetFinity®
Microsoft, Windows, Windows NT, and the
Windows logo are registered trademarks of Microsoft
Corporation. UNIX is a registered trademark in the United
States and other countries licensed exclusively through X/Open
Company Limited.
Other company, product, and service
names may be trademarks or service marks of others. IBM Server
White Paper IBM Corporation 1997. All rights
reserved. |