2005-09-04

Fake RAID (FRAID) sucks even more at RAID-5

Well, this keeps coming up, so it's time for a blog.

Fake RAID (FRAID) is not hardware RAID, it never will be

I continue to loathe Fake RAID (FRAID) implementations. I regularly run into discussions from both end-users and even MCSEs with servers who love FRAID. They think it's a cheap way to heaven and redundancy. And they have the CPU utilization to prove it (or so they think -- as we'll discuss)!

For those that don't know, Fake RAID (FRAID) is extremely popular because it requires *0* additional hardware. It's not hardware RAID, because it uses your main CPU -- host RAID (not host adapter RAID, which is an intelligent RAID card). Your main CPU does all RAID functionality at all times.

When in the BIOS, you use the 16-bit Int13h disk services that have been added to an ATA channel. This is what turns an ATA controller into a FRAID controller. In fact, many "regular" ATA cards could be turned into a FRAID controller with a simple BIOS flash (maybe with a jumper trace added or pull-down resistor) because ATA cards and their FRAID versions are *0* different in hardware.

Once the 32-bit/64-bit OS loads, the FRAID driver is required. The FRAID driver is both an interface and, more importantly, the RAID logic. The RAID logic is typically licensed from a 3rd party**, meaning its proprietary** and different vendors/cards have slightly varying versions. It means the driver is a bloated mess of CPU commands to do software RAID. All data much travel up the CPU, instead of direct memory access (DMA) from memory to I/O directly. It's not the CPU or xor instruction that loads the system, it's all the load, stor and other duplication in the system interconnect.

**NOTE: Hence why Linux GPL drivers are virtually impossible, and even though a GPL FRAID logic exists (ataraid.c), the vendor interface drivers (hptraid.c, pdcraid.c, silraid.c, etc...) are never well-aligned with various card implementations that vary by release.

ICH7R/MCP-04, RAID-5 goes mainstream with 15MBps writes (yeah, it sucks hard!)

RAID-5 is absolutely detrimental in software. New benchmarks at GamePC clearly show how bad it gets with the new Intel I/O Memory Controller Hub 7 RAID (ICH7R) and nVidia Media and Communication Processor 04 (MCP-04) peripheral controllers. Everyone wants to talk read performance, but they don't like to talk about write or -- gasp -- rebuild performance. Well, just looking at write performance for a single write that is 1GBps or less:
GamePC RAID-5 Page 9

Now this operation is PURE DESKTOP! One large file copy of sizes no larger than 1GB. That's not even putting a dent in the memory, let alone it's only one operation. And in the ICH7R and MCP-04, the SATA channels are on a DEDICATED 250MBps PCI-Express x1 channel (PCIe x1). But even then ... the result?

15MBps! Welcome back to i486-era Programmed I/O (PIO)!

That absolutely sucks. You have disks today of 50-80+MBps, and you can't even break old Programmed I/O (PIO) Mode 4 or Mode 5 (16.6MBps or 22MBps, respectively) performance. In fact, that's basically what the problem is. Instead of pushing the data stream via direct memory access (DMA) transfers from memory to I/O, without bothering the CPU, the FRAID driver is doing programmed I/O (PIO) through the CPU. The FRAID driver has turned your DMA capable drives into CPU PIO driven devices -- quite technically! As any ATA storage benchmark shows, it's very, very difficult to get more than 15-20MBps with PIO today -- because the CPU interconnect is completely saturated with operations that it was NEVER intended for.

Back in the days of a non-superscalar i486 that could barely push more than 133MBps, and not even that close before synchronous timing, it was fine to do 8-16MBps for that period's Enhanced Small Device Interface (ESDI), the father of Integrated Drive Electronics (IDE). So now seeing a limitation back to 15-20MBps over the 250MBps PCIe x1 interface that the SATA channels of the ICH7R/MCP-04 use, did not surprise me one bit -- because they match the expected PIO mode 4 performance of IDE, even just writing a single desktop process. The PC CPU-memory interconnect is not an I/O processor. It never has been, it never will be. Yes, Opteron's partial mesh of 2x DDR and 2-3x HyperTransport tunnels per CPU helps, but it's still not an I/O designed as a storage host and servics host in one.

It's the same problem in using a PC for a router or network switch. Your CPU is well away from the Network Interface Card (NIC) through layers of interconnect, overhead and the fact that your CPU is a software driven processor. A network router or switch is an Application Specific Integrated Circuit (ASIC) or I/O Processor (IOP) that taps the network interfaces directly, processes frames/packets without much separation between it and the raw device. Specification wise, your CPU should be a much, much faster router/switch than a little, dedicated hardware device -- but it's not. Hence why your CPU cannot match a "storage switch" or "buffering storage controller" any better than it can a "network switch" or "buffering router."

You'll also note in the same article the write performance still at a single write (second graph):
GamePC RAID-5 Page 10 (see second graph for writes). Here's where even the partially software-based Broadcom cannot compete with the Intel IOP331 (superscalar XScale I/O Processor) Areca ARC-1110. Now GamePC, in its continued ignorance, thinks it's a cache-based reason. They even (at the end of the article) say the Areca product is overpriced and don't know why. Well, duh, it's not just some "dumb" ATA channels with software -- it's a true, locally intelligent, off-loading RAID card.

And this is just DESKTOP performance. On a server, the multiple I/O requests would TRASH FRAID or even software RAID in queuing -- rendering the host system into a role that is primarily dedicated as a storage device. Much like putting a PC as a network switch and/or router would be -- quickly detrimental and not fit for the role.

POST NOTE: Software RAID will never cut it, unless that's all your system does (storage)

Which brings me to my final point. I do NOT call Software RAID done at the OS level as FRAID. FRAID is Fake RAID done in a "dumb" ATA controller because ... why? ... the vendor can. And 90+% of consumers will believe it is hardware RAID. In fact, I often recommend Logical Disk Manager (LDM) on NT5+ (2000+) and Linux Logical Volume Manager (LVM) and/or MultiDisk (MD) instead for higher performance. But in the end, whether FRAID or software RAID, it's not hardware RAID, even if software RAID isn't as bad as FRAID.

Now there's no end to Linux administrators who swear by Multi-Disk (MD) over Hardware RAID. In both cases, they don't look at all the facts, and make statements about hardware RAID that were NOT valid even 5 years ago. I get tired of these people, because they think I'm some fool who hasn't been deploying both Linux MD and intelligent RAID hardware solutions for 7+ years. Most of them are still just trying to get MD off-the-ground, or have run into their first MD "hiccup" or, worse yet, the (pun)myraid(pun) of issues of software storage layer upon software storage layer (a major issue of RACE CONDITIONS in the Linux kernel).

Intelligent Hardware RAID Sucks Falicy #1:
I can move my disks between hardware

This is a typical answer from someone who has only used FRAID, or maybe an old i960 controller from DPT that is now dead and beyond its end-of-life. Most of these administrators have never dealt with the small changes in LVM/MD in various Linux versions. I have. I really HATE it when some small "layer" in the Linux kernel turns my LVM or MD volume into bits of no organization. So I really HATE it when I'm given this totally infactual statement. I've yet to me someone who has moved MD volumes between systems of 3+.

3Ware, on the other hand, has maintained 5+ years of volume upward compatibility -- from the 5000 series to the latest 9000 series. I have no problem taking volumes from older devices to newer. Heck, I've even take a RAID-10 volume from a newer 7500-4LP series to an older 6400 because RAID-10 has not changed since the 6.9 firmware, even in the 7.x firmware. It just works.

Case-in-point: There are hardware RAID vendors with extremely poor Linux history (e.g., Adaptec) and those with very good Linux history (3Ware) and those with a fairly good history (e.g., Symbios/LSI) and those that are now dead (e.g., DPT now Adaptec = crap, Mylex now LSI = good). I have stuck with 3Ware and Mylex/LSI with great results.

Intelligent Hardware RAID Sucks Falicy #2:
Hardware RAID is slower

Now this is more directed at intelligent hardware RAID, and my absolute favorite! Major OEMs like Dell continue to sell 10 YEAR OLD hardware RAID designs with the Intel i960/IOP30x series. These are old, slow designs that can't break 50MBps with RAID-5. I stopped using them 6+ years ago, when I moved to 3Ware Escalde 5000 (and, subsequently, 6000/7000 series shortly afterwards) as Mylex eXtremeRAID 1100/2000 (DAC960) as well. People complain about cost, but 3Ware wasn't that expensive at all for ATA, and Mylex was the way-to-go if you were deploying SCSI (where disk cost is the biggest issue).

Every single time -- EVERY SINGLE TIME -- I get people talking about i960 solutions. I didn't use them 6+ years ago, so STOP USING THEM AS EXAMPLES! And start by stopping your purchases with OEMs that still sell that crap. ;->

I've heard all the excuses, and they are NOT valid with my PROVEN use of specific products. Most people do NOT "do their homework" and that's their problem. I did my homework long ago, and my clients have reaped the benefits from it. And that includes 5+ years of volume support, no messy issues with multiple layers in the kernel and upward compatibility of volumes with new devices -- for a few hundred bucks, and BETTER performance. It's well worth the reduction in headaches.

Intelligent Hardware RAID Sucks Falicy #3:
How can RAID ASICs/IOPs compete with a modern CPUs performance?

If this was true, why don't we just use PCs instead of dedicated Ethernet switches and routers? We don't because even a "slow" 100-1,000MHz MIPS or XScale embedded microprocessor/microcontroller (uP/uC) or core in an Application Specific Integrated Circuit (ASIC) is designed to push data around, whereas a PC is designed to process data. The interconnect is everything.

The 10+GBps of CPU-memory interconnect is NOT designed for I/O! I'd much rather have a 1-3GBps I/O Processor (IOP) interconnect or 2-4GBps switch fabric that is designed to push data around, replicating (RAID-1), striping (RAID-0/3/4/5) and XOR'ing (RAID-3/4/5) in-line with my data coming over the I/O -- than pushing redundant and multiple copies up through a CPU-memory interconnect.

You see. In a RAID-5 write to a hardware RAID device, the data stream goes directly from memory over the PCI[e|-X] bus to the intelligent storage card. That card then handles all caching/buffering for duplication/XOR directly to the channels locally. If you use software, then that data has to go from memory up the CPU for duplication/XOR -- it's not the CPU processing that kills it, it's the redundance data streams that eat up your I/O. Now that is then pushed back a second time (be it the duplication or parity) to memory before being committed to disk. And if a disk read is required for verifiation and other operations (such as during a rebuild) -- forget it! Your system is TOAST with load!

Linux people are the worst to discuss this with because the Linux kernel has POOR utilities for measuring interconnect I/O. The only thing Linux can do is stat the amount of I/O services used by the CPU. Although it's a good way to detect how much I/O the CPU is directing, it doesn't tell you when the interconnect of the system -- especially a non "front-side bottleneck" design -- is being HOSED by redundant I/O streams. In fact, your CPU utilization can actually go down because the CPU is STARVED by the data transfers -- although the performance will clearly show on writes.

The ONLY time software RAID is useful is when you have a DEDICATED storage device. That means all the device is doing is being a storage device. Your services are on ANOTHER host. So the CPU can be dedicated to those operations. On a system that is both storage and service, hardward RAID is always the best choice. E.g., 3Ware cards will keep not only I/O down, but keep the traffic off of your CPU interconnect (regardless of the 3% or less CPU utilization that 3Ware maintains just in overhead). And they will queue up a massive number of requests.

If you are both storage and services on one host, truly consider not putting the I/O burden on your CPU interconnect, and paying a few hundred bucks to save yourself headaches. Especially when it comes to volumes, etc... Especially given the management tools that 3Ware has provided, and continues to provide, in services like 3DM2.

Still don't believe me? Then why is Intel starting putting IOP33x on server mainboards?

Intel is moving to address the issue by starting to put its superscalar XScale I/O Processor (IOP) into new server mainboards, possibly future I/O Controller Hub (ICH) chips designs for servers themselves. The idea is to off-load IO operations onto a processor that is dedicated for such functionality, and not tie up the Memory Controller Hub (MCH) with redundant operations that trouble the CPU with redundant copies/processes. The queuing, buffering and other operations that can be eliminated is a major bonus.

There is nothing worse than tying up a service host that is supposed to be servicing and operating on data with I/O operations that can be handled much better, much closer to the actual storage and its buffer. Intel realizes this, and the advanced its server mainboards can gain with an IOP processor on-board, or even in the ICH itself. But the ICH7R isn't it, and it probably will never be on a desktop mainboard anyway -- even though that's what I see 80% of sysadmins and even some "fly-by-night" system integrators still use for servers.

10 comments:

Tito Maury said...

Hello,

I am searching for fresh information
for my asic design flow , 30,000 daily updated Information Pages about all kind of subjects.

It might interest you to know that your blog has been visited and has been read. I hope you enjoy your "Blogging".

I wish you all the luck I can, keep the good work going!

Kind regards,
Jos
asic design flow

Tito Maury said...

Tis the season! I was searching the web and found your entry this post I really like your site and found it worth time reading through the post. I am looking to publish a comprehensive site ranges many types of historical needlework. All those interested in this area will find this article of interest as it is written from many perspective. Please feel free to take a look at my blog at asic design and add any thing your want.

PAStheLoD said...

Very insightful and informative post. I've been totally decieved by the FRAID on my nForce4 MB. RAID-5 was überSLOW, and there was no real explanation except "crap by design", now I know in fact,this is the case.

RAID-0 is better, however, very far from "ideal" :)

4 SATA-II drives should give more than 100 MBps and they do on HW RAID.

And I'm in even bigger misery when I try to simply create a partition on a FRAID volume.. no go. :|

Will said...

Very good article.

Only thing I noticed amiss was your spelling of "fallacy."

dghnfgj said...

welcome to the wow gold, cheap WoW Power Leveling, service site,wotlk gold buy cheap wow gold,wow gold,world of warcraft power leveling buy wow gold

zhijian said...

Buy wow gold website: cheap wow gold,buy wow gold,world of warcraft gold,wow power leveling.09.05.14T

zhijian said...

Buy breitling replica,cheap wow gold auto insurance 09.05.14T

zhijian said...

Buy rolex fake,rolex fake replica omega09.05.14T

chen said...

you can get following Four series of products information:(jackwalk1985)
First,wow gold product:buy wow gold,buy cheap wow gold,and World of Warcraft Gold;
Second wow account product: wow account,buy wow account,sell wow account,World of Warcraft Gold;
Second product:runescape account product runescape account,buy runescape account,sell runescape account,rs account,sell rs account;
Last product:maplestory account maplestory account,maple story account.writed by edmund chein

wow power leveling said...

Americans everywhere humor A detention wow gold notice was written like this: a wow power leveling police car with stones, to win wow gold the detention center for seven wow power leveling days all-inclusive accommodation replica rolex Tour Value; hit send 2 a beautiful bracelet, wow power level fashionsuit, police transport; more more surprises , the former can enjoy free shaved 10; before the 100 can play with power leveling the dogs, the guests were presented massage sticks, electric shocks to CHEAPEST power leveling the dead skin beauty care services.