1.Introduction
2. ABOUT RAID
3. Why is raid important >
INTRODUCTION
RAID is an acronym for Redundant Array of Independent Disks:
Redundant means that part of the disks’ storage capacity is used to store checkdata that can be used to recover user data if a disk containing it should fail.
Array means that a collection of disks are managed by control software thatpresents their capacity to applications as a set of coordinated virtual disks. In host based arrays, the control software runs in a host computer. In controlle rbased arrays, the control software runs in a disk controller.
Independent means that the disks are perfectly normal disks that could function independently of each other.
Disks means that the storage devices comprising the array are on-line storage. In particular, unlike most tapes, disk write operations specify precisely which blocks are to be written, so that a write operation can be repeated if it fails.
ABOUT RAID
The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array
of drives appears to the computer as a single logical storage unit or drive.
The Mean Time Between Failure (MTBF) of the array will be equal to the MTBF of an individual drive, divided by the number of drives in the array. Because of this, the MTBF of an array of drives would be too low for many application requirements. However, disk arrays can be made fault-tolerant by redundantly storing information in various ways. Five types of array architectures, RAID-1 through RAID-5, were defined by the Berkeley paper, each providing disk fault-tolerance and each offering different trade-offs in features and performance. In addition to these five redundant array architectures, it has become popular to refer to a non-redundant array of disk drives as a RAID-0 array.
Why Is RAID Important?
As the storage industry becomes increasingly independent of the computer system industry, storage alternatives are becoming more complex. System administrators, as well as managers who make storage purchase and configuration decisions need to understand on-line storage alternatives. Awareness of what RAID can and cannot do for them helps managers make informed decisions about on-line storage alternatives. Users of networked personal computers may also be concerned about the quality of the storage service provided by their data servers.
Why use RAID?
Typically, RAID is used in large file servers, transaction or application servers, where data accessibility is critical, and fault tolerance is required. Today, RAID is also being used in desktop systems for CAD, multimedia editing and playback, where higher transfer rates are needed.
Disk Striping
Fundamental to RAID technology is striping. This is a method of combining multiple drives into one logical storage unit. Striping partitions the storage space of each drive into stripes, which can be as small as one sector (512 bytes) or as large as several megabytes. These stripes are then interleaved in a rotating sequence, so that the combined space is composed alternately of stripes from each drive. The specific type of operating environment determines whether large or small stripes should be used. Most operating systems today support concurrent disk I/O operations across multiple drives. However, in order to maximize throughput for the disk subsystem, the I/O load must be balanced across all the drives so that each drive can be kept busy as much as possible. In a multiple drive system without striping, the disk I/O load is never perfectly balanced. Some drives will contain data files that are frequently accessed and some drives will rarely be accessed. NG DISK DRIVES
By striping the drives in the array with stripes large enough so that each record falls entirely within one stripe, most records can be evenly distributed across all drives. This keeps all drives in the array busy during heavy load situations. This situation allows all drives to work concurrently on different I/O operations, and thus maximize the number of simultaneous I/O operations that can be performed by the array.
Definition of RAID Levels
RAID 0 is typically defined as a group of striped disk drives without parity or data redundancy. RAID 0 arrays can be configured with large stripes for multi-user environments or small stripes for single-user systems that access long sequential records. RAID 0 arrays deliver the best data storage efficiency and performance of any array type. The disadvantage is that if one drive in a RAID 0 array fails, the entire array fails.
RAID-1
Raid-1 also known as disk mirroring, is simply a pair of disk drives that store duplicate data but appear to the computer as a single drive. Although striping is not used within a single mirrored drive pair, multiple RAID 1 arrays can be striped together to create a single large array consisting of pairs of mirrored drives. All writes must go to both drives of a mirrored pair so that the information on the drives is kept identical. However, each individual drive can perform simultaneous, independent read operations. Mirroring thus doubles the read performance of a single non-mirrored drive and while the write performance is unchanged. RAID 1 delivers the best performance of any redundant array type. In addition, there is less
performance degradation during drive failure than in RAID 5 arrays.
RAID 2
Raid 2 arrays sector-stripe data across groups of drives, with some
drives assigned to store ECC information. Because all disk drives today embed ECC information within each sector, RAID 2 offers no significant advantages over other RAID architectures and is not supported by Adaptec RAID controllers.
RAID 2
0 RAID 3
Raid 3 as with RAID 2, sector-stripes data across groups of drives, but
one drive in the group is dedicated to storing parity information. RAID 3
relies on the embedded ECC in each sector for error detection. In the case
of drive failure, data recovery is accomplished by calculating the exclusive
OR (XOR) of the information recorded on the remaining drives. Records
typically span all drives, which optimizes the disk transfer rate. Because each I/O request accesses every drive in the array, RAID 3 arrays can satisfy only one I/O request at a time. RAID 3 delivers the best performance for single-user, single-tasking environments with long records. Synchronized-spindle drives are required for RAID 3 arrays in order to avoid performance degradation with short records. Because RAID 5 arrays with small stripes can yield similar performance to RAID 3 arrays, RAID 3 is not supported by Adaptec RAID controllers.
RAID 3
RAID 4
Raid 4 is identical to RAID 3 except that large stripes are used, so that records can be read from any individual drive in the array (except the parity drive). This allows read operations to be overlapped. However, since all write operations must update the parity drive, they cannot be overlapped. This architecture offers no significant advantages over other RAID levels and is not supported by Adaptec RAID controllers.
RAID 5
Raid 5 sometimes called a Rotating Parity Array, avoids the write
bottleneck caused by the single dedicated parity drive of RAID 4. Under
RAID 5 parity information is distributed across all the drives. Since there
is no dedicated parity drive, all drives contain data and read operations
can be overlapped on every drive in the array. Write operations will
typically access one data drive and one parity drive. However, because
different records store their parity on different drives, write operations
can usually be overlapped.
RAID 5
STANDARD RAID TYPES THEIR ADVANTAGES AND DISADVANTAGES :
RAID 0: Also known as 'striping', this is technically not a RAID level since it provides no fault tolerance. Data is written in blocks across multiple drives, so one drive can be writing (or reading) a block while the next is seeking the next block The advantages of striping are the higher access rate, and full utilization of the array capacity. The disadvantage is there is no fault tolerance if one drive fails, the entire contents of the array become inaccessible.
RAID 1: Mirroring provides redundancy by writing twice - once to each drive. If one drive fails, the other contains an exact duplicate of the data and the controller can switch to using the mirror drive with no lapse in user accessibility. The disadvantages of mirroring are no improvement in data access speed, and higher cost, since twice the number of drives is required (50% capacity utilization).
RAID 3: RAID level 3 stripes data across multiple drives, with an additional drive dedicated to parity, for error correction/recovery. RAID 3 is not found on all controllers.
RAID 5: RAID level 5 is the most popular configuration, providing striping as well
Which RAID level should I use?
The PAC (Performance Availability Capacity) strategy is one method of assessing which RAID level is most appropriate. Performance is how quickly the data can be accessed. Availability refers to fault tolerance (if a drive fails the data is still available). Capacity refers to how efficient the data storage is (how many drives are required for a given array size).
RAID 0 has the best performance and capacity, but the lowest availability (no fault tolerance). If one drive fails, the entire array fails because part of the data is missing with no way to recover it other than restoring from a backup.
RAID 1 has the highest availability but lowest capacity, since twice the number of drives are required. Performance is roughly the same as for a single drive, although in some instances the dual write may be somewhat slower.
RAID 0+1 offers some performance improvements by striping, then mirroring the striped array, but capacity is low since the mirror requires a duplicate set of drives.
RAID 5 has moderate benefits in all three areas, so it ranks roughly in the middle. Read performance can be as fast as RAID 0, but write performance is slower, since the parity information must be calculated and written along with the data. Capacity is higher than for RAID 1 but not as high as with striping, since the array uses additional space for the parity information. Availability is high with RAID 5 because of the fault tolerance - if a drive fails, the missing data is recalculated from the remaining operational
drives.
RAS (Reliability, Availability, Serviceability)
RAS Definitions
Let's examine the three main considerations for evaluating a RAID storage solution from a data availability standpoint: reliability, availability, and serviceability.
Reliability
Reliability means when or how often can you expect the item in question to fail. Typically expressed in Mean Time Between Failures (MTBF), this metric is used to quantify hardware component failures that exhibit an exponential failure. For instance, disk drive manufacturers claim MTBFs of 300,000 to 800,000 or more hours. Those disk drive MTBFs sound good, but that's only part of the picture. What is stated on a specification sheet may represent the average of the population, not your drive in particular. Your drive's environment may not be optimal, either because a fan in the server packaging is not running optimally, or your system experiences a power surge that cripples your disk drive. The manufacturer may specify theoretical, not operational MTBFs, where theoretical MTBF specifications are derived from mathematical models of empirical field data of the individual drive components. Theoretical MTBFs do not account for failures due to drive infancy, manufacturing-
induced defects, drive returns in which the failure cannot be repeated (i.e., NTFs - No Trouble Founds), and damage due to improper handling.
Your disk drives or other components of your system will eventually fail. If a critical drive fails, such as a boot drive or a drive containing payroll information, your entire organization may be effected.
software is just as likely to fail as hardware.So, how do you protect your critical data? Implementing RAID technology, either software or hardware-based, is a logical first step in protecting your data from disk drive failures. RAID technology should be deployed on any server or workstation where the cost of lost data or downtime warrants it. But that's just the beginning. There are other availability and serviceability features that you should examine to determine the optimum RAID solution for your environment.
Availability
Data availability is defined as having your data accessible at all times. There are two components to data availability: data integrity and fault tolerance.
Data integrity : Data integrity means getting the correct data, every time. Most RAID solutions offer dynamic sector repair, where the defective sectors due to soft media errors are repaired on the fly. The real differentiating factor is the amount of error correction and error detection code provided. Software-based RAID typically relies on a standard SCSI bus for data integrity protection, where it can detect 1-bit errors but has no ability to correct any errors. Hardware-based RAID solutions usually contain more robust code. For instance, Adaptec's hardware-based RAID
solutions not only detect 4-bit errors, but also correct 1-bit errors on the entire data path from the storage media to the host system bus. l.
Fault-Tolerance : Fault tolerance is defined as maintaining data availability in the event of one or more failures in the system. The most common method of achieving fault tolerance on servers andworkstations today is RAID technology.Each RAID level offers different tradeoffs on performance, cost, and availability, and as such, itmay be appropriate to use different RAID levels for different applications - even on the same server or workstation. RAID 0 (i.e., striping) should only be used in high performance applications that can afford downtime and/or lost data. Critical files in which an outage would severely cripple business activities, such as boot drives, would best be protected using RAID 1 (i.e., mirroring), or for even better performance, RAID 0/1 (mirrored striping). Most applications can best be protected by RAID 5 (striped parity), which offers the best balance between performance, cost and availability.
1 Drive hot swap is defined as the ability to pull out and replace a drive while the system is running and data is being accessed. With warm swap, you must first pause activity on the SCSI bus before removing the drive. More sophisticated hardware-based RAID solutions also offer the option of either dedicating spares to each array, or using a pool of spares for all arrays to draw on. Using dedicated spares on the most critical applications eliminates contention for spares in the event of multiple drive failures. Pooling spares is a more cost-effective method of data availability that is appropriate for less critical applications.the next level of fault tolerance protection is to add redundancy to non-disk drive components. The downside is that it significantly increases the cost of your configuration. Typical areas to add redundancy include packaging, such as extra fans, dual I/O paths from the server to the disk drive (i.e., redundant controllers), and multiple servers. Using an Uninterrupted Power Supply (UPS) is also a good idea. Some software-based RAID solutions support disk duplexing, a form of mirroring (RAID 1) using redundant controllers where each disk drive is attached to a separate controller, thereby eliminating the controller as the single point of failure. The hardware-based RAID equivalent solution is called active-active controller failover, available only on more expensive, high-end external RAID controllers such as Data General's CLARiiON series.
Server redundancy is most cost-effectively achieved through clustering, such as that offered on Microsoft's Windows NT Server 4.0 Enterprise Edition or many Unix and mainframe computer systems. With clustering, multiple servers access the same storage. In the event of a server failure, data on the disk drives can still be accessed using other servers in the cluster. Hardware based external RAID controllers are typically used to provide RAID protection for the disk drives in a clustered environment. In everhigh end mission-critical applications, remote mirroring (RAID 1) software such that offered by Compaq/Digital's OpenVMS is employed to mirror data to a remote site for disaster protection. Such configurations are very expensive, because the entire server configuration is duplicated at an offsite location.
Serviceability
Serviceability means in the event of a failure, how fast and easy is it to detect and isolate the failure, repair or replace the failed component, and reset the application or operating system. Serviceability also includes preventive maintenance features that help you monitor andreplace marginal components before they fail. S.M.A.R.T. and SAF-TE are two standards that have emerged in recent years that should be employed on any serious RAID implementation. Configurations supporting the S.M.A.R.T. (Self monitoring, Analysis and Reporting Technology) standard monitor disk drives and report any out- of-threshold conditions that may signify a potential failure to the array card or server management software, permitting you to replace the drive before it fails.
Configurations supporting SAF-TE (SCSI Accessed Fault-Tolerant Enclosure) monitor and report enclosure conditions to array or server management software, assisting in alerting and isolating enclosure-related failures. In either case, you need to check that not only the disk drives are S.M.A.R.T.-compliant or enclosure is SAF-TE-compliant, but also that the RAID cardÕs management software and operating system support these standards. Many software- and hardware-based RAID solutions support S.M.A.R.T. and SAF-TE. However, just as there are many different vendor implementations of SCSI drives, there are many different implementations of SAF-TE enclosures, all of which need to be tested for compatibility to ensure that enclosure-related events are properly reported and interpreted by the card and RAID management software. With Microsoft NT software-based RAID, drive and enclosure events are reported via SNMP to the general management log, a log that contains storage- as well as server- and network-related events. The system manager can then employ a filter to view only storage-related events. Each storage installation can only be monitored locally on each server, so the system manager must physically "make the rounds" to monitor each RAID installation. Many hardware-based RAID solutions offer RAID management software specifically designed not only to configure and manage RAID arrays but also to report storage-related events.
The more sophisticated of these RAID management software packages categorize errors and events by severity, such as color-coded alerts highlighted in yellow for a potential problem and red for an actual component failure. Some even e-mail, fax or page the system manager in the event of alerts requiring immediate attention, greatly increasing the system manager's ability to detect problems and decrease the time it takes to bring the storage subsystem back up to full operational capability. Others allow you to manage, monitor and in some instances repair all hardware-based RAID installations from a single station, even remotely.
MANAGEMENT OF DATA
With the explosion of on-line data, the cost of managing that data has escalated as well. For every dollar spent on initial storage purchase, various estimates calculate that another $5 to $7 is spent managing the storage.These figures include the cost of installing, configuring, monitoring, and optimizing the on-line storage for performance, as well as backing up, restoring, and archiving the data. For smaller businesses and IT sites who can't afford a dedicated or sophisticated IT staff but need to protect their valuable data, storage management ease of use is of paramount importance. Let's examine the manageability issue from two aspects: how easy is it install and configure a software or hardware-based RAID, and how easy is it to monitor and proactively manage the RAID installation.
Configuing RAID
There can be significant differences between RAID solutions in both the ease of configuringarrays and the degree to which you can tune your arrays for optimum performance orfunctionality. Does the solution offer a streamlined configuration "wizard" that uses default settings to help first-time users to get up and running quickly? For more sophisticated users,
advanced features like variable stripe depths, spare allocation (either dedicated or global) and setting drive reconstruction priorities (low/medium/high) become important differentiators. Windows NT RAID software uses one stripe depth - 64 kB, based on research that concludes most applications achieve optimal performance with stripe depths between 64 kB and 128 kB. Windows NT does not offer spare allocation because failed drives are manually replaced by the system manager. In contrast, hardware-based RAID solutions typically offer a variety of stripe depth
options, such as 8, 16, 32, 64 and 128 kB. More sophisticated hardware solutions also offer spare allocation and priority settings on drive reconstruction.
¥ Managing RAID
As discussed in the previous section on serviceability, some of the key differences in managing software- and hardware-based RAID solutions center on the ease of identifying and reporting errors. Hardware-based solutions typically offer more sophisticated management software
features such as alerts color-coded by severity, e-mail, fax or pager notification of errors, and remote management of multiple RAID installations. But this is just the beginning. Graphical user interfaces (GUIs) that employ a Windows-like look and feel with pop-down menus, property tabs, physical and logical views in drill-down WindowsÒ Explorer-type tree structures, and detailed views can make a huge difference in the ease of
managing your storage. Not all RAID solutions offer GUIs. Unlike software-based solutions, hardware-based RAID solutions allow monitoring and management of RAID configurations on multiple operating
systems such as Windows NT and Novell Netware. The ability of hardware-based solutions to remotely manage RAID storage means that you can initialize new arrays and reactivate offline arrays without ever leaving your desk. More sophisticated hardware-based RAID management implementations support preventive maintenance activities such as monitoring card, drive and enclosure fan and temperature status but
also testing hot spares, verifying parity information, and reconstructing the information on a failed drive. Some even allow you to schedule these activities, thus eliminating the need for manual intervention and minimizing impact on server performance. Another distinguishing feature among
RAID management implementations is the ability to poll servers, networks and non-RAID configurations, so that downtime conditions are more quickly detected and isolated.
Performance Considerations :
Running benchmarks in a controlled environment is a useful method for comparing performance,such as the Ziff-Davis WinbenchÒ 97 benchmark results contained in the AAAÒ-131CA PCI Array Card Series report. This report concludes that Adaptec's hardware-based RAID solutions demonstrate a consistent performance advantage over NT software-based RAID. Certain applications such as NASTRAN, Adobe AfterEffects, Adobe Photoshop and AutoCAD may see a significant performance improvement due to card-based caching used on AAA-131CA workstation cards because of more efficient cache flush operations, reduced disk drive head thrashing, fewer cache misses, and more writes at memory rather than disk speeds. For a more comprehensive discussion on the benefits of card-based caching, see Performance Benefits of a Caching RAID Coprocessor in PC NT Workstations. But just as your car mileage may vary from EPA mileage ratings, performance on your RAID storage will vary based on your system configuration and application environment. Whether the performance differences are enough to warrant selecting a hardware-based solution is the tricky part. However, since most applications can be characterized as being CPU-bound, I/O-bound, or a mixture of both, an empirical discussion of software- and hardware-based RAID solutions may be helpful in determining which solution is best for you.
CPU Bound Applications
The argument in favor of hardware-based RAID in CPU-bound applications is straight forward one. Offloading RAID 5 parity calculations and RAID 1 secondary writes to a separate hardware- based RAID co-processor reduces CPU interrupts, freeing the main CPU to perform other compute-intensive functions. I/O traffic on the main PCI bus is reduced, so that other activities such as network traffic can be processed more efficiently. The performance advantages of hardware-based RAID is especially pronounced when RAID 5 data sets are operating in degraded mode (i.e., a drive in the array has failed), because both read and write requests require parity calculations, significantly increasing CPU interrupts and I/O traffic.
I/O Bound Applications
In I/O-bound applications, the differences between software- and hardware-based RAID are less apparent. Clearly, if disk drives are the bottleneck, whether the parity calculations are performed in the main CPU or RAID co-processor will make little difference in overall system performance. However, there are some situations where hardware-based RAID may be advantageous. You could see a significant improvement in mirrored drive (RAID 1) performance if you implement striping and mirroring (RAID 0/1), not available on Windows NT or Netware software-based RAID implementations. With RAID 0/1, not only could your application experience improved read and write times due to simultaneous multiple drive accesses, but also more consistent and predictable performance due to the load balancing effect of RAID 0. If your application is already I/O-bound, a failed drive in a RAID 5 data set can have a paralyzing effect on system performance. Hardware-based RAID solutions that support automatic failed drive detection with hot spare replacement can significantly reduce the amount of time your application is running in degraded mode, because the application does not have to wait until you
physically replace the failed drive. Hardware-based RAID solutions that allow you to set the priority (low/medium/high) for array reconstruction, gives you control over the tradeoffs you are willing to make between overall system performance and availability. Hardware-based RAID solutions can improve system boot time and operating system performance by striping the operating system files, a feature not supported on NT's software-based RAID implementations. are-based RAID solution.
COST
Clearly, the up front costs of software-based RAID are hard to beat. For independent softwareRAID packages, there's just the cost of a software license and software installation. There are no acquisition costs for operating systems supporting embedded RAID, and since you're installing the operating system software anyway, the incremental installation costs are zero. Getting something for free is easy to cost-justify to management, and basic RAID protection is better than no protection at all.
Most common sources of problems with RAID?
The most common source of a wide variety of symptoms, including data loss, is incorrect cabling and termination. Use good quality cables which are no longer than absolutely necessary. Provide good active termination at the end of the SCSI bus. Use a separate terminating plug rather than using the termination on the hard drive, to avoid loss of termination if that drive fails or is removed. Resource conflicts are another common source of problems. Incorrect interrupts or no interrupt assigned, onboard controller chips which have not been disabled when not in use, non-compliant motherboards, and so on, can all result in system lockups, installation aborts, and generally erratic behavior. Drive mismatches can cause problems with the array and its performance. Ideally, all the drives in the array should be identical, including the firmware version on the drive itself, since different versions of the drive firmware code can result in differences in access speed, queuing algorithms, and head movement optimization. Most disk manufacturers will provide firmware updates if multiple versions have been released over the life of a particular model.