2005-09-25

MicroATX Power: nVidia C51 (GeForce 61x0 + nForce 4x0)

Well, in case you haven't heard, nVidia has finally released its long awaited C51 nForce 400 series with integrated GeForce 61x0 graphics processor units (GPU). These devices are designed explicitly for the MicroATX form-factor, providing quite a punch in a "small enough" form-factor out-of-the-box (see my prior commentary on the "Small Enough" Form-factor PC).

It's a 2-chip HyperTransport solution, the option of the 6100 or 6150 GPU with the nForce 410 or 430 peripherial logic. AnandTech has the skinny.

The nVidia 6150/430 combination offers HDTV out, including 1080p (16:9 1920x1080@60Hz), as well as all other, standard HDTV modes -- 1080i, 720p and 480p. This makes it an ideal HDTV set-top solution for both today's and the future's HDTVs. Add the HDTV3000 with a Linux/MyTV should give you a $500 (including software) set-top solution that might not be a bad gaming solution either. 30fps should be possible at 720p (16:9 1280x720@60Hz) gaming resolutions in UT2K4, Far Cray and most other titles all but the latest games (e.g., Doom3, Battlefield2, etc...), unless you drop the resolution further (maybe 480p, 16:9 720x480@60Hz).

AnandTech offers a first-look review of the Biostar TForce 6100-939 which is the entry-level 6100+410 combination and now available at Newegg.COM for $79. Note this does not have HDTV out, and the performance will be slightly slower, although it's nothing to slouch at.

The entry-level $75 Biostar TForce 6100-939 lacks GbE and uses the slightly slower GeForce 6100, but it's still a great set-top compared to anything Intel or even ATI offers, and a faster NIC (via the PCIe x1) and/or faster GPU (via the PCIe x16) slots are always upgrade options:

UPDATE: NewEgg has just added an even cheaper $61 Socket-754 entry, the Biostar GeForce 6100-M7 to the line-up. For all intents and purposes, it's basically the same feature set as the previously mentioned Socket-939 version, but offers a lower, total entry point since it uses Socket-754 processors (such as the Sempron 64).

Foxconn also offers a pair of C51 solutions, the 6100+410 combination in the Foxconn 6100K8MA-RS that is pre-ordering for $80+ and the 6150+430 combination in the Foxconn 6150K8MA-8EKRS that is pre-ordering for $95+. One disappointment in the Foxconn solutions is the lack of a PCIe x1 slot, which prevents storage or network upgrades. But I guess some might like the three (3) 32-bit/5V PCI slot options instead of only two (2)? Personally, I'd much rather have the PCIe x1 slot option at the cost of a PCI slot.

2005-09-23

Travla C156 -- 1.2G/512M/40/DVD in 7x7x2.7"

For some prototypes, we're using the x86 ViA EPIA MII (Mini-ITX with CardBus/CompactFlash) solution. The Travla C156 is a 7" x 7" x 2.7" that fits the 6" x 6" Mini-ITX board, while having not only enough clearance for the CardBus/CompactFlash at the rear, but also a slimline CD/DVD and a 2.5" notebook hard drive. Power is provided via an internal 12V DC to 3.3/5/12V ATX 60W unit, with an external AC/DC adapter (NOTE: we like having 12V DC, since our portable/standalone power is far more efficient without having to convert to AC only to go back to DC).

Specifications:
- Travla C156 Mini-ITX 7" x 7" x 2.7" enclosure
Includes 60W 12V@5A DC/DC PS (internal) and 100-240V@1.8 AC/DC Adapter (external)
- ViA EPIA MII12000 Mini-ITX 6" x6" mainboard with 1.2GHz ViA C3
MII series includes CardBus and CompactFlash slots
NOTE: The MII6000 with a 600MHz ViA C3 is a passively cooled option (although harder to find)
- 512MiB DDR266 (PC2100) SDRAM low-profile
- 40GB 4200rpm 2.5" notebook hard drive
- 24x CD-ROM (or 8x DVD-ROM) slimline optical drive

Front view in comparison to the Chemning 118 (MicroATX/ATX 11" x 14" x 9"):


Rear view in comparison to the Chenming 118:


Close-up rear view of on-mainboard MII capabilities, including our proprietary, adhoc, de-centralized, self-configuring 2.4GHz wireless technology (1.5-6Mbps up to several miles). Yes, that's a handle at the top the antenne is wrapped through (making the C156 enclosure very easy to carry):


We're gearing up to support Hurricane Rita. We are increasingly using our own prototypes for field units as our existing Hurricane Katrina support efforts have most of our existing capabilities deployed. Thanks goes to Casetronic Engineering in California for providing ITX components on such short notice. The ViA MII with CardBus and CompactFlash slots already mounted is a Godsend in a general x86 platform.

Without any other communication capabilities in the aftermath of a storm, we can deploy these boxen and build square miles of communication capabilities usable by VoIP phones, notebook computers, etc... They can be vehicle mounted (hence why 12V DC is ideal ;-), and work while traveling 200+mphs. They are decentralized which means if several units are not in range of other units, they will continue to communicate with each other adhoc -- and automatically renegotiate when they do come back into range of other units.

Eventually we'll use a read-only CompactFlash boot (instead of the 2.5" HD) in a NEMA-certified enclosure, and finally a non-x86 solution for more "bang for the buck." But for now, these Mini-ITX solutions are great for ~$500 and build upon our x86 prototypes.

2005-09-11

Linux Servers: Eccentric Practices for Disk Slicing

This is a "follow-up" blog from my previous blog entry on filesystem fundamentals.

This blog focuses entirely on Linux servers and my "best (some possibly "eceentric") practices" when it comes to Disk Slicing (commonly referred to, but not totally accurately, as "disk partitioning" -- there is a reason for the legacy UNIX terminology, long story), with some basic hardware, system and other service assumptions.

Hardware Assumptions and Considerations

I assume you are either using "raw" disks, or you are using true, intelligent hardware RAID like a 3Ware Escalade or LSI MegaRAID "X" (XScale-based) product. In all examples, I will use "sda, sdb, etc..." (and LVM/LVM2 volume groups, vg, and logical volumes, lv), but they could be applicable to "hda, hdc, etc..." as well if you are using ATA, SATA (SATA channels with non-SCSI module drivers), etc...

The reason why I do NOT recommend software RAID is simple, there have been nagging issues with the layers of filesystem abstraction in the kernel, and it gets NO BETTER in kernel 2.6. The Multi Disk (MD) driver is not bad, but it has its issues, and those have varied over time -- whereas 3Ware has given me a 6+ year track record of perfection, including upward compatibility. In addition to various race conditions, MD provides little "automation" when it comes to recovery -- which can be a KEY issue. Some think the answer is in FRAID (Fake/Free RAID -- regular ATA controllers with a "trick BIOS" and 100% software driver, 99% of $50 and nearly ALL mainboard ATA RAID is FRAID), but that actually makes things worse for Linux (long story).

MD with Logical Volume Management (LVM) is a major issue, as LVM has lots of race conditions (especially kernel 2.6 with LVM2 and the Device Mapper, DM). I only use the flexibility of LVM for disk slice management -- e.g., just for adding new volumes without rebooting (where slicing the "raw" hda/sda disk device would not work because a reboot is required to re-read the BIOS/DOS disk label aka "partition table"). I don't use LVM for anything else, except for the extremely rare resize, although I would like to utilize LVM2+DM for snapshots should the race conditions ever be worked out).

Because in the end, a measly $100 gets you a great "piece of mind" chunk with a 3Ware 7006-2 (ATA) or 8006-2 (SATA) and 4-channels are well under $300 and 12-channels can be had for under $500 -- adding extremely little to the overall disk cost (let alone overall server cost). It's damn well work it, especially with 3DM/3DM2 monitoring, let alone the true hardware ASIC, 0 wait state SRAM, etc...

Lastly, the hardware aspects of these "best practices" assumes you are using local storage (raw disk or intelligent, hardware RAID), with up to 2, internal RAID controllers (possibly on different PCI-X channels, striped across each volume), and not going to a SAN or external disk. In SAN cases, the layout is up to the external subsystem, and other considerations (including performance) can vary wildly -- and actually changes nearly ALL considerations (because a new set of criteria apply). In fact, I will blog on SAN (iSCSI, FC-AL and even SAS) if I get enough requests.

System v. Data Volumes

On servers, I refer to filesystems in terms of "system" or "data" volumes. This is for reasons of both recoverability and performance. You do not have to follow these guidelines, although it is always good to know how different filesystems work. E.g., try to avoid putting swap, /tmp and /var on RAID-5.

A system volume on my servers is [b]always[/b] it's own disk, or its own RAID-1 (2+ disk mirroring), RAID-1e (2+ disk stripe+mirror simultaneously in hardware) or RAID-10 (4+ disk stripe+mirror simultaneously in hardware). Filesystems such as swap, /tmp and /var are small write PITAs that will RAID-5 rather quickly. Anything that is a temporary, spool or other filesystem that has lots and lots of small writes is nowhere near as well performing on RAID-5 than RAID-1, 1e or 10.

A data volume on my servers can be part of system volume's disk, RAID array (especially for systems with only 2 or 4 disks) and even its volume group. But in the case of 2 "raw" disks (non-RAID) or at least 5 (total for RAID) disks, I like to make it a separate disk/array, including a separate volume group. In the case of a separate RAID array/volume, RAID-4 (if available) is best when your service block size is large (e.g., 32+KiB like NFS), the files being written are typically large, etc... RAID-5 is better when reads drastically outnumber writes, the writes are smaller and more random, etc...

But note that RAID-10 can and does beat RAID-5 at even reads in many applications, when done by intelligent, load-balancing hardware RAID (which can read from mirrors indepenednetly), so RAID-5 becomes more about a consideration of "disk efficiency" over RAID-10. Such is the case with 8+ total disks. If lots and lots of writes are prevalent in a data volume, consider RAID-10 instead of RAID-5, especially for only a few disks (like 4 total -- just share System and Data in a RAID-10 volume).

Here are some common implementations I have used:
  • 2 "raw" disks: Typically separate System and Data Volumes, or Unified on one disk, using the other for an rsync'd mirror or software (MD) RAID-1

  • 2 disk hardware RAID: Unified RAID-1/1e for System and Data Volume

  • 4 disk hardware RAID: Unified RAID-10 for System and Data Volume

  • 5-7 disk hardware RAID: 2-disk RAID-1/1e System Volume, 3-5 disk RAID-4/5 Data Volume

  • 8-12 disk hardware RAID: 4-disk RAID-10 System Volume, 4-8 disk RAID-4/5 Data Volume

  • 16-24 disk 2xhardware RAID: Stripe (RAID-0, max performance) or Mirror (RAID-1, card redundancy) across separate 2-disk RAID-1 System Volume, 2x8-10 disk RAID-4/5 Data Volume

The System Volume (sda/vg00/lv## used in examples)

System volumes are obvious at first, although some seemingly service and other clear "data" may actually be a better fit (recovery/performance-wise) for the "System Volume" than the "Data Volume" (next section).

The elementary system volumes are / (root), swap, /tmp, /var and /usr.

Optional system volumes may include /usr/local and /opt, although most of the time, picking either /usr/local or /opt and symlinking one to the other is doable. I almost always make /opt a symlink to /usr/local, which is a system volume, and the exception is when I have "commercial vendor" software that goes into /opt that is large (e.g., Oracle).

SIDE NOTE: About the only other time I have ever made a separate /opt filesystem is when I'n NFS mounting a "pre-built" /usr/local for the specific OS/version (e.g., binsrv:/usr/local.glibc23 programs is mounted as /usr/local on a Linux system with GLIbC 2.3 libraries). But in the case of a server, I would NEVER do this as a server should never be a NFS client (and only sparingly do any automounting of NFS shares on any server -- and NEVER, absolutely NEVER, do I put NFS mounts in /etc/fstab on a server).

On newer Linux Filesystem Heirarchy Standard (FHS) version 2.3+ implementations, there is a new top-level directory for service data called "/srv" instead of using select portions of the "/var" directory. While this is a data volume, many services house temporarily files, and might belong in more as a system volume. In reality, most of the time, on a server, you're going to make subdirectories of /var and, on newer systems, /srv for your specific applications, and those can go on the "Data Volume". E.g., /srv/www, /var/lib/mysql, etc...

With that said, this is how I typically slice the System Volume:
  sda1  1-16GB  reserved (type 6h, only 250MB allocated)
sda2 1-16GB reserved (type 83h)
sda3 1-16GB / (type 83h)
sda4 all pv00 (type 8Eh)
---- vg00 ----
lv00 swap 1-16GB (type 82h)
lv01 /tmp 1-16GB (type 83h)
lv02 /var 1-16GB (type 83h)
lv03 /srv 1-64GB (type 83h)
lv04 /usr 4-64GB (type 83h)
lv05 /usr/local 4-64GB (type 83h)
---- optional ----
lv## /opt 4-64GB (type 83h)
lv## /var/lib 4-64GB (type 83h)
lv## /var/lib/* 4-64GB (type 83h)
lv## /var/log 4-64GB (type 83h)
lv## /var/spool 4-64GB (type 83h)
---- vg reserve ----
10-30% of vg00 for future lv
Now for all those "Fine Print" bullets:
  • I'm big on symmetry and consistency. Although it's less of an issue now with LVM, I liked having slices that could be used (as well as reserving a slice of the same size, no longer needed with LVM). This has saved my bacon so many times when someone screwed up a partition table (e.g., another admin accidentally ran "fdisk" and we didn't know until the next boot weeks or even months later ;-). By using the same size slices, even if I don't have a print out of the partition table and/or LVM assignments, I can quickly find the "boundaries." This is also why I don't "get fancy" -- I use 2000M, 4000M and 8000M almost entirely for the smaller "elementary" systems, and 16000M and 32000M for the larger ones. No attempts to dork with 8096M or 8G or whatever binary (base 2) sizes -- always MB, always a decimal (base 10) multiple. It's the most consistent practice I have come to year after year, especially with how disk vendors do geometry (decimal, inconsistent geometry, etc...).
  • Slice 1 (physical/legacy PC BIOS Partition 1) is reserved. Being that not all servers have floppy/optical drives, I keep a "dd" image of a 250MB DR-DOS 7.03 install around for firmware updates and other vendor diagnostics that require DOS. Although there is 1-16GB between the beginning of /dev/sda1 and /dev/sda2, I make it exactly 250MB (cylinder start 1, end 31, for 255 heads, 63 sectors/track) so I can plop down this dd image so I can target GRUB at it with a chainloader command.
  • Slice 2 (physical/legacy PC BIOS Partition 2) is also reserved. This is in case I need to install a "helper/recovery" Linux install on the same system. This might be the case where my root (/) filesystem is corrupted or, more real-world yet, the system has been compromised and I've yanked the plug the second I've realized this. I can take the entire system off-line, but install a new OS and new MBR and boot it directly on the array, while not touching anything else. Now it's easy to look around the other filesystems mounted read-only -- although in the case of a compromise, a dd of the full filesystems to another system over the network might be better (which is still made "easier" with the local "helper/recovery" install).
  • Slice 3 (physical/legacy PC BIOS Partition 3) is my root (/) filesystem and always my root filesystem. This is my "known quantity" of ANY server I have. There is always a legacy PC BIOS/DOS Disk Label (Partition Table) with its 4 slices (the primary partitions), and the 3rd party slice is always a legacy BIOS primary /dev/sda3 for root (/). Because of my use of hardware RAID, I never need a separate /boot, and disk geometry isn't an issue until I hit beyond 133GB (128GiB).
  • Slice 4 (physical/legacy PC BIOS Partition 4) is my Logical Volume Manager (LVM) partition. On older systems, or if you believe LVM is not idea, then use a type 0Fh (Extended, LBA -- Extended Partition beyond the 1024 cylinder limit, so using Logical Block Addressing). For LVM, I do a 1:1 physical volume (pv) to volume group (vg) approach, so /dev/sda is pv00, which is also vg00. If you want to play with some of the new LVM2+DM (Device Mapper) and use multiple pv's in a single vg00 for software spanning, striping, mirroring, etc..., feel free, but I won't be caught dead doing it. And if you choose to use the traditional PC BIOS/DOS slicing where you slice an legacy PC BIOS/DOS "Extended" Disk Label, instead of the logical volumes (lv), you'll have legacy PC BIOS/DOS "Logical" Slices of /dev/sda5, /dev/sda6, etc... as appropriate.
  • I prefer swap, /tmp, /var and, on newer systems, /srv of equal sizes to /. This is a legacy approach I used, prior to my adoption of LVM with resizing (which I still am very hesitant to trust, especially kernel 2.6/LVM2). If you are not using LVM, but sticking with the Extended slice, then create an extra Logical slice in case you need to copy a filesystem over, or add another of the same size.
  • So if you have applications with temporary or service data under /var and/or /srv that have very, very large size requirements, they should be separated out as separate filesystems. For me, the "golden rule" is an application that could possibly use 1GB in the "worst case scenario" -- that means it's ripe to overfill your /var (or /srv) and affect other things. Whether it goes on the "System Volume" or the "Data Volume" depends, but it shouldn't be left in the /var (or /srv) filesystems.
  • Limiting impact on and of /var is extremely important. So many, arbitrary temporary files get created under /var from so many, arbitrary services, system operations, etc... For example, after just a few YUM (or other) package management updates, the RPM cache can start bloating and start eating away at other things in /var. In other words, keep /var limited, and segment out anything that should be affected by other things in /var or should affect /var (such as databases, mail, print, proxy, other spools, etc...).
  • The static binary filesystems of /usr and /usr/local each should be 4x the full install of all your applications for each. I've found I've never needed more than 16GB (with rare exception, see next bullet), although 4GB/each is probably enough for anyone.
  • Again, I rarely create a separate /opt filesystem, and typically symlink it to /usr/local. If I'm installing 3rd party, binary-only, "commercial vendor" software, then I'm more inclined to keep my "custom /usr/local/bin" away from "/opt[/blah]/bin".
  • Another, good option for servers that are "part-time" database or other service application is to make a separate /var/lib, which is where many "service data" types are located. If your server's primary role is a database server (where the database comes with the OS, so its data is in /var/lib or /srv), then you SHOULD make it a SEPARATE filesystem, possibly in the Data Volume. But if it is -- again, "Golden Rule" -- never going to break 1GB ever -- then a separate /var/lib is probably a good idea, and on the "System Volume" (of RAID-1, 1e or 10).
  • Some administrators like a separate /var/log, and if Internet servers are the target, then you can really, really enahnce security by making /var/log separate. That includes filesystem mount options on what can be changed, executed, accessed, etc... on /var/log, as well as remounting /var/log as read-only (assuming you use lsof immediately to identify what is writing to it and kill them so you can) when you need to immediately look at "weird circumstances" without yanking the plug on the Internet server (especially when you're remove via SSH).
  • Lastly, if you're doing any mail, print, Squid or other spooling, you NEED a /var/spool directory, period. In fact, if you are doing serious Mail services, or are a dedicated print server and definitely if you are a Proxy server running Squid, not only should you make such dedicated filesystems (e.g., /var/spool/mail, /var/spool/squid, etc...), but you should NOT put them on a RAID-5 Data Volume, but the RAID-1, 1e, 10 System Volume for sheer performance reasons.
There are probably 100 different considerations for the System Volume I have forgotten here, but these are the major "best practices" that I've come up with over the years. I'll probably add one or two more comments/filesystems/bullets in the coming weeks as I remember them.

The Data Volume (sdb/vg01/lvm1## used in examples)

Data volumes -- not just user, but service -- are also obvious, but there are some data type filesystems that you might not consider. Again, although some filesystems are clearly "data," they may actually be a better fit (recovery/performance-wise) for the "System Volume" than the "Data Volume."

In keeping with the prior nomenclature, the disk is /dev/sdb and vg01 for the Data Volume, from /dev/sda and vg00 for the System Volume. But more differentiating is the move to lv1##, a "1" prefix. This is just my idea (feel free to assume it's eccentric and useless) -- the last two Logical Volume (lv) digits always count from 00 to 01 to 02 regardless of the pv/vg I'm on, but there is a "prefix" for Data Volumes. This means I could possibly have more than one Data Volume, and the "X" in lvX## in the filesystem device always tells me what Volume Group it is in. If it doesn't have that prefix, then it's the System Volume (which is always vg00, and filesystems start counting from lv00).

With that all said, there is actually only 1 essential Data Volume, /export/systemname. If this filesystem will never be exported via NFS, then I symlink /home/systemname to it. If it is unlikely that you will never have more than -- again, Golden Rule -- 1GB of user homedirectory data, then you can probably just leave /home this on the root (/) filesystem. But for countless recovery, security and other reasons, I recommend you always create a separate /home filesystem, and on a LAN (where network filesystems are in use), I highly recommend it be /export/systemname. Do this even if you have no plans for NFS or another network filesystem protocol, and symlink it to /home/systemname.

Now if your server is a network fileserver, then I recommend you actually create at least two (2) /export/systemname filesystems. I covered why in my other blog on filesystem fundamentals -- in case you have to fsck one user data volume, you can still bring up the other. And then there's the standard localization of corruption, etc... I typically name the second "home" filesystem /export/systemname2 -- e.g., on a file server server named "bssrv", I would have at least a "/export/bssrv" and a "/export/bssrv2." Note, I like to start numbering the second on-ward volumes at "2" instead of "0" or "1" -- reserving two two numbers inserting another filesystem (possibly on a different Data Volume), "just in case" (possibly as a symlink to another volume, or countless other, eccentric operations I've done in the past).

But more "real world" on a file server is to breaking them down by department/usage/users. E.g., /export/accounting, /export/engineering, /export/marketing, etc... Groups, filesystem types, filesystem sizes/creation/usage (important for fragmentation considerations), etc... tend to be similar for those in the same department, using the same applications, etc... But pr4obably the "common denominator" of doing things by department is security -- not just Groups or even for Discretionary Access Controls (DACs, traditional UNIX as well as newer Extended Attributes like Access Control Lists, ACLs), but more for Mandatory Access Controls (MACs, like Extended Attributes such as SELinux, or various alternatives). I could literally write a book on this (and I just might at some point ;-), which is why I feel strongly about it being a "best practice."

Which is why we use definitely use /export for user home directories and do not symlink it to /home on a LAN NFS file server -- because our NIS/LDAP Automounter maps will mount those /export filesystems into /home. I.e., /home becomes the root for an automounted subdirectory (typically the standard auto.home/auto_home map -- but I won't go any deeper). There is no performance loss of accessing the local /export/systemname from the /home/systemname NFS mount because it occurs over loopback (and directly inode access resulting at the kernel level).

With that all said, this is how I typically slice the Data Volume:
  sdb1  1-16GB  reserved (type 6h, only 250MB allocated)
sdb2 1-16GB reserved (type 83h)
sdb3 1-16GB reserved (type 83h)
sdb4 all pv01 (type 8Eh)
---- vg01 ----
lv100 /export/(name) 16+GB (type 83h)
---- optional ----
lv1## /expafs/(name)* 16+GB (type 83h)
lv1## /export/(name)2 16+GB (type 83h)
lv1## /export/(name)* 16+GB (type 83h)
lv1## /export/(dept) 16+GB (type 83h)
lv1## /export/temp* 16+GB (type 83h)
lv1## /srv/ftp 16+GB (type 83h)
lv1## /srv/www 16+GB (type 83h)
lv1## /var/lib/ldap 16+GB (type 83h)
lv1## /var/lib/*sql 16+GB (type 83h)
lv1## /var/spool/* 16+GB (type 83h)

---- vg reserve ----
10-30% of vg01 for future lv
Now for all those "Fine Print" bullets:
  • Slices 1-3 (physical/legacy PC BIOS Partitions 1-3): I'm sure by now you think I'm a chronic disk space waster. I mean, even if you think I might have "some justification" for reserving a little space at the beginning of the "System Volume" for a DOS boot and/or possible "Helper/Recovery" installation of Linux, you think it's totally ludicrous to do the same for the separate (when it's not unified) "Data Volume" too, right? Well, there are 2 "real world" scenarios why I still reserve 3-48GB in on the Data Volume too.
    1. Major Version Upgrades, Simultaneous Boot. Let's say I'm going to do a major version upgrade of a server (let's say from Red Hat Enterprise Linux 3 to Red Hat Enterprise Linux 4, or even Fedora Core 1 to Fedora Core for that matter). I can either attempt to "upgrade" the live "System Volume" filesystems (which is not really supported in RHEL, although supported in RHL/FC), or I can make use of that extra space -- those extra, BOOTABLE "primary partitions" (hint, hint) on the "Data Volume." If I have enough space to fit into /dev/sda2 (on the System Volume), then I might not need to do it. But many times, I need 2-3 filesystems (of over 10GB free) to do it, so I just install the new version into 2-3 primary partitions on the Data Volume -- all while not only having a "way to go back" to the previous version (using the existing System Volume) if things get bad, but also keeping all configuration and all as a reference (again, still on the System Volume). And once I'm done, I'll reformat filesystems in the "System Volume" and copy over the new filesystems. If I'm really, really confident on the changes of the upgrade, I'll actually copy the existing version over to the "Data Volume" first, and then install the new version directly onto the "System Volume" -- still leaving the old version bootable, and all it's config files. Yes, there's not a 1:1 slice/filesystem relationship, but by using cd /(fs); find . -mount; cpio -pmdv /newroot/(fs) I can copy over the complete trees from any set of filesystem mounts to any new set of filesystem mounts (under /newroot).
    2. "We've Been Compromised, Yank the System Volume". This is not what I like to do, because I ALWAYS want to "pull the plug" on ALL volumes -- including DATA -- when I discover a compromise. But in a situation where the data has just gotta be up ASAP, I can still "pull the plug," then remove the "System Volume" as-is, then install another OS in a single root (/) of /dev/sdb3 which is on the Data Volume (possibly using /dev/sdb2 as /var, maybe /dev/sda1 for something else, like /var/log) in record time. Yes, there's still a chance that a specific user has been compromised (e.g., the user's public SSH key has been changed to match a black hat hacker's keyset, so the second SSH is re-enabled, the hacker's back in). But since most data volumes are mounted "noexec" (you're are, correct?), the compromise is on those system volumes. And now I have the System Volume off-line, untouched, unmodified, while the Data Volume is still there, using the spare primary partitions to boot and hold the system I just loaded "clean."
  • Again, there should always be at least an /export/systemname (possibly symlinked to /home/systemname), although for file servers, you should avoid just creating many /export/systemname# filesystems -- while also not creating the proverbial "all eggs in one basket" as I've preached before. In a nutshell, the /export/ sytemname[#] convention is what I use for creating local, exportable data filesystems on UNIX workstations, cluster nodes, etc... where there are cross-automount of each other, as well as by UNIX desktops (accessing workstation, cluster, etc... storage). For LAN file servers, you're typically serving up data to different departments and, again, those departments have similar needs, usage, criteria, resulting security, result fragmentation, resulting application, etc... Build your filesystem names (for all your servers) around those departments, including any "shared" directories for a department underneath (E.g., /export/dept/project/XYZ with a "catch-all" of /export/dept/temp and I explicitly use temp because I want users to realize there is no "permanent catch-all" directory for arbitrary projects -- usage, security, groups, etc... need to be defined when they are formalized).
  • For Andrew Filesystem (AFS), my source filesystemss that house the virtualized AFS volumes (for those that don't know, you do not share "local files" out via AFS -- AFS is a virtualized filesystem that is more like a set of "database files" that you can't directly read) have their own tree. Since /afs is where they are typically mounted, I call the local filesystems on AFS file servers /expafs (for exported filesystems via afs). Again, the /expafs/systemname* filesystems are NEVER exported themselves, but putting them in a different tree PREVENTS me from accidentally exporting them or otherwise making them available via another protocol (which would not only do nothing, but possibly cause corruption).
  • The next, /export/temp*, should be obvious for file servers -- it's a temporary share area for everyone. And just like my prior comments on why you shouldn't call it something "more permanent," the idea here is to keep people from creating sprawling repositories that are not well definied, not well secured, etc... Formal directories and/or even separate filesystems should be setup for projects, etc... as necessary. In most cases, this is department level. When it is not, then /export/temp* filesystems could hold temporary work until decisions are made how to control access (either by department, multiple departments, cross-department access, etc...).
  • Now we get to the more "application-specific" directories that are more for application servers, as we've done little but discuss LAN file servers prior. /srv/ftp (or /var/ftp) and /srv/www (or /var/www) should be obvious, localize FTP and/or HTTP service data. I canNOT recommend this highly enough, as Apache and most FTP daemons set EXTENSIVE security defaults that can be well enforced by segmented filesystems for their service data. E.g., you can (and many Apache builds come by default) with settings at the service level (not rule/option level) that prevent symlinks from crossing filesystem bounderies (even if you're allowing symlinks). chroot jails and other newer, better developments can take advantage of these too.
  • Likewise, database and other data management services (e.g., LDAP directory) should be segmented off. This is especially the case when they are located under the /var filesystem. Although some of these services are being moved into /srv because they are "more persistent" data than the traditional "variable/temporary" files in /var, most are still under /var/lib or somewhere in /var. You do NOT want to put your critical database, LDAP or other services at the risk of corruption, fragmentation, etc... with all other (and always extensive/excessive) /var usage. And you can typically improve security as well with not only mount options, but service/application defaults or options that prevent access from crossing filesystem boundaries.
  • Lastly, the more persistent spool data can have filesystems on the Data Volume. The more excessive the writes are, the more the spool directories are used for temporary files than retainment of information, the more they should go on the System Volume instead. But for longer term mail (pop3, imap), printing (e.g., PDF, form, other print spool-based document generation), etc..., these are clearly longer-term data stores that aren't just areas to temporarily write to.
As with the System Volume, there are probably 100 different considerations for the Date Volume I have forgotten here, but these are the major "best practices" that I've come up with over the years. I'll probably add a good half-dozen, significant "real world" Data Volume notes/filesystems/applications in the coming weeks as I remember them.

2005-09-04

Fake RAID (FRAID) sucks even more at RAID-5

Well, this keeps coming up, so it's time for a blog.

Fake RAID (FRAID) is not hardware RAID, it never will be

I continue to loathe Fake RAID (FRAID) implementations. I regularly run into discussions from both end-users and even MCSEs with servers who love FRAID. They think it's a cheap way to heaven and redundancy. And they have the CPU utilization to prove it (or so they think -- as we'll discuss)!

For those that don't know, Fake RAID (FRAID) is extremely popular because it requires *0* additional hardware. It's not hardware RAID, because it uses your main CPU -- host RAID (not host adapter RAID, which is an intelligent RAID card). Your main CPU does all RAID functionality at all times.

When in the BIOS, you use the 16-bit Int13h disk services that have been added to an ATA channel. This is what turns an ATA controller into a FRAID controller. In fact, many "regular" ATA cards could be turned into a FRAID controller with a simple BIOS flash (maybe with a jumper trace added or pull-down resistor) because ATA cards and their FRAID versions are *0* different in hardware.

Once the 32-bit/64-bit OS loads, the FRAID driver is required. The FRAID driver is both an interface and, more importantly, the RAID logic. The RAID logic is typically licensed from a 3rd party**, meaning its proprietary** and different vendors/cards have slightly varying versions. It means the driver is a bloated mess of CPU commands to do software RAID. All data much travel up the CPU, instead of direct memory access (DMA) from memory to I/O directly. It's not the CPU or xor instruction that loads the system, it's all the load, stor and other duplication in the system interconnect.

**NOTE: Hence why Linux GPL drivers are virtually impossible, and even though a GPL FRAID logic exists (ataraid.c), the vendor interface drivers (hptraid.c, pdcraid.c, silraid.c, etc...) are never well-aligned with various card implementations that vary by release.

ICH7R/MCP-04, RAID-5 goes mainstream with 15MBps writes (yeah, it sucks hard!)

RAID-5 is absolutely detrimental in software. New benchmarks at GamePC clearly show how bad it gets with the new Intel I/O Memory Controller Hub 7 RAID (ICH7R) and nVidia Media and Communication Processor 04 (MCP-04) peripheral controllers. Everyone wants to talk read performance, but they don't like to talk about write or -- gasp -- rebuild performance. Well, just looking at write performance for a single write that is 1GBps or less:
GamePC RAID-5 Page 9

Now this operation is PURE DESKTOP! One large file copy of sizes no larger than 1GB. That's not even putting a dent in the memory, let alone it's only one operation. And in the ICH7R and MCP-04, the SATA channels are on a DEDICATED 250MBps PCI-Express x1 channel (PCIe x1). But even then ... the result?

15MBps! Welcome back to i486-era Programmed I/O (PIO)!

That absolutely sucks. You have disks today of 50-80+MBps, and you can't even break old Programmed I/O (PIO) Mode 4 or Mode 5 (16.6MBps or 22MBps, respectively) performance. In fact, that's basically what the problem is. Instead of pushing the data stream via direct memory access (DMA) transfers from memory to I/O, without bothering the CPU, the FRAID driver is doing programmed I/O (PIO) through the CPU. The FRAID driver has turned your DMA capable drives into CPU PIO driven devices -- quite technically! As any ATA storage benchmark shows, it's very, very difficult to get more than 15-20MBps with PIO today -- because the CPU interconnect is completely saturated with operations that it was NEVER intended for.

Back in the days of a non-superscalar i486 that could barely push more than 133MBps, and not even that close before synchronous timing, it was fine to do 8-16MBps for that period's Enhanced Small Device Interface (ESDI), the father of Integrated Drive Electronics (IDE). So now seeing a limitation back to 15-20MBps over the 250MBps PCIe x1 interface that the SATA channels of the ICH7R/MCP-04 use, did not surprise me one bit -- because they match the expected PIO mode 4 performance of IDE, even just writing a single desktop process. The PC CPU-memory interconnect is not an I/O processor. It never has been, it never will be. Yes, Opteron's partial mesh of 2x DDR and 2-3x HyperTransport tunnels per CPU helps, but it's still not an I/O designed as a storage host and servics host in one.

It's the same problem in using a PC for a router or network switch. Your CPU is well away from the Network Interface Card (NIC) through layers of interconnect, overhead and the fact that your CPU is a software driven processor. A network router or switch is an Application Specific Integrated Circuit (ASIC) or I/O Processor (IOP) that taps the network interfaces directly, processes frames/packets without much separation between it and the raw device. Specification wise, your CPU should be a much, much faster router/switch than a little, dedicated hardware device -- but it's not. Hence why your CPU cannot match a "storage switch" or "buffering storage controller" any better than it can a "network switch" or "buffering router."

You'll also note in the same article the write performance still at a single write (second graph):
GamePC RAID-5 Page 10 (see second graph for writes). Here's where even the partially software-based Broadcom cannot compete with the Intel IOP331 (superscalar XScale I/O Processor) Areca ARC-1110. Now GamePC, in its continued ignorance, thinks it's a cache-based reason. They even (at the end of the article) say the Areca product is overpriced and don't know why. Well, duh, it's not just some "dumb" ATA channels with software -- it's a true, locally intelligent, off-loading RAID card.

And this is just DESKTOP performance. On a server, the multiple I/O requests would TRASH FRAID or even software RAID in queuing -- rendering the host system into a role that is primarily dedicated as a storage device. Much like putting a PC as a network switch and/or router would be -- quickly detrimental and not fit for the role.

POST NOTE: Software RAID will never cut it, unless that's all your system does (storage)

Which brings me to my final point. I do NOT call Software RAID done at the OS level as FRAID. FRAID is Fake RAID done in a "dumb" ATA controller because ... why? ... the vendor can. And 90+% of consumers will believe it is hardware RAID. In fact, I often recommend Logical Disk Manager (LDM) on NT5+ (2000+) and Linux Logical Volume Manager (LVM) and/or MultiDisk (MD) instead for higher performance. But in the end, whether FRAID or software RAID, it's not hardware RAID, even if software RAID isn't as bad as FRAID.

Now there's no end to Linux administrators who swear by Multi-Disk (MD) over Hardware RAID. In both cases, they don't look at all the facts, and make statements about hardware RAID that were NOT valid even 5 years ago. I get tired of these people, because they think I'm some fool who hasn't been deploying both Linux MD and intelligent RAID hardware solutions for 7+ years. Most of them are still just trying to get MD off-the-ground, or have run into their first MD "hiccup" or, worse yet, the (pun)myraid(pun) of issues of software storage layer upon software storage layer (a major issue of RACE CONDITIONS in the Linux kernel).

Intelligent Hardware RAID Sucks Falicy #1:
I can move my disks between hardware

This is a typical answer from someone who has only used FRAID, or maybe an old i960 controller from DPT that is now dead and beyond its end-of-life. Most of these administrators have never dealt with the small changes in LVM/MD in various Linux versions. I have. I really HATE it when some small "layer" in the Linux kernel turns my LVM or MD volume into bits of no organization. So I really HATE it when I'm given this totally infactual statement. I've yet to me someone who has moved MD volumes between systems of 3+.

3Ware, on the other hand, has maintained 5+ years of volume upward compatibility -- from the 5000 series to the latest 9000 series. I have no problem taking volumes from older devices to newer. Heck, I've even take a RAID-10 volume from a newer 7500-4LP series to an older 6400 because RAID-10 has not changed since the 6.9 firmware, even in the 7.x firmware. It just works.

Case-in-point: There are hardware RAID vendors with extremely poor Linux history (e.g., Adaptec) and those with very good Linux history (3Ware) and those with a fairly good history (e.g., Symbios/LSI) and those that are now dead (e.g., DPT now Adaptec = crap, Mylex now LSI = good). I have stuck with 3Ware and Mylex/LSI with great results.

Intelligent Hardware RAID Sucks Falicy #2:
Hardware RAID is slower

Now this is more directed at intelligent hardware RAID, and my absolute favorite! Major OEMs like Dell continue to sell 10 YEAR OLD hardware RAID designs with the Intel i960/IOP30x series. These are old, slow designs that can't break 50MBps with RAID-5. I stopped using them 6+ years ago, when I moved to 3Ware Escalde 5000 (and, subsequently, 6000/7000 series shortly afterwards) as Mylex eXtremeRAID 1100/2000 (DAC960) as well. People complain about cost, but 3Ware wasn't that expensive at all for ATA, and Mylex was the way-to-go if you were deploying SCSI (where disk cost is the biggest issue).

Every single time -- EVERY SINGLE TIME -- I get people talking about i960 solutions. I didn't use them 6+ years ago, so STOP USING THEM AS EXAMPLES! And start by stopping your purchases with OEMs that still sell that crap. ;->

I've heard all the excuses, and they are NOT valid with my PROVEN use of specific products. Most people do NOT "do their homework" and that's their problem. I did my homework long ago, and my clients have reaped the benefits from it. And that includes 5+ years of volume support, no messy issues with multiple layers in the kernel and upward compatibility of volumes with new devices -- for a few hundred bucks, and BETTER performance. It's well worth the reduction in headaches.

Intelligent Hardware RAID Sucks Falicy #3:
How can RAID ASICs/IOPs compete with a modern CPUs performance?

If this was true, why don't we just use PCs instead of dedicated Ethernet switches and routers? We don't because even a "slow" 100-1,000MHz MIPS or XScale embedded microprocessor/microcontroller (uP/uC) or core in an Application Specific Integrated Circuit (ASIC) is designed to push data around, whereas a PC is designed to process data. The interconnect is everything.

The 10+GBps of CPU-memory interconnect is NOT designed for I/O! I'd much rather have a 1-3GBps I/O Processor (IOP) interconnect or 2-4GBps switch fabric that is designed to push data around, replicating (RAID-1), striping (RAID-0/3/4/5) and XOR'ing (RAID-3/4/5) in-line with my data coming over the I/O -- than pushing redundant and multiple copies up through a CPU-memory interconnect.

You see. In a RAID-5 write to a hardware RAID device, the data stream goes directly from memory over the PCI[e|-X] bus to the intelligent storage card. That card then handles all caching/buffering for duplication/XOR directly to the channels locally. If you use software, then that data has to go from memory up the CPU for duplication/XOR -- it's not the CPU processing that kills it, it's the redundance data streams that eat up your I/O. Now that is then pushed back a second time (be it the duplication or parity) to memory before being committed to disk. And if a disk read is required for verifiation and other operations (such as during a rebuild) -- forget it! Your system is TOAST with load!

Linux people are the worst to discuss this with because the Linux kernel has POOR utilities for measuring interconnect I/O. The only thing Linux can do is stat the amount of I/O services used by the CPU. Although it's a good way to detect how much I/O the CPU is directing, it doesn't tell you when the interconnect of the system -- especially a non "front-side bottleneck" design -- is being HOSED by redundant I/O streams. In fact, your CPU utilization can actually go down because the CPU is STARVED by the data transfers -- although the performance will clearly show on writes.

The ONLY time software RAID is useful is when you have a DEDICATED storage device. That means all the device is doing is being a storage device. Your services are on ANOTHER host. So the CPU can be dedicated to those operations. On a system that is both storage and service, hardward RAID is always the best choice. E.g., 3Ware cards will keep not only I/O down, but keep the traffic off of your CPU interconnect (regardless of the 3% or less CPU utilization that 3Ware maintains just in overhead). And they will queue up a massive number of requests.

If you are both storage and services on one host, truly consider not putting the I/O burden on your CPU interconnect, and paying a few hundred bucks to save yourself headaches. Especially when it comes to volumes, etc... Especially given the management tools that 3Ware has provided, and continues to provide, in services like 3DM2.

Still don't believe me? Then why is Intel starting putting IOP33x on server mainboards?

Intel is moving to address the issue by starting to put its superscalar XScale I/O Processor (IOP) into new server mainboards, possibly future I/O Controller Hub (ICH) chips designs for servers themselves. The idea is to off-load IO operations onto a processor that is dedicated for such functionality, and not tie up the Memory Controller Hub (MCH) with redundant operations that trouble the CPU with redundant copies/processes. The queuing, buffering and other operations that can be eliminated is a major bonus.

There is nothing worse than tying up a service host that is supposed to be servicing and operating on data with I/O operations that can be handled much better, much closer to the actual storage and its buffer. Intel realizes this, and the advanced its server mainboards can gain with an IOP processor on-board, or even in the ICH itself. But the ICH7R isn't it, and it probably will never be on a desktop mainboard anyway -- even though that's what I see 80% of sysadmins and even some "fly-by-night" system integrators still use for servers.