2005-08-28

Filesystem Fundamentals and Practices

[ UPDATED 8/28 with more XFS Comments/Clarifications ]
There's not a week that goes by without someone questioning filesystems, layouts or fragmentation in Linux deployments on at least one list. I try to tailor my recommendations to various considerations, but I often get overridden by comments that do not respect the context I try to make mine in. So here is my "superpost" on the matter, and I will leave everyone else to debate.

Overview:
- Windows Deprogramming (the Windows You Don't Know)
- Traditional UNIX Mindset (Why It Still Works After 30 Years)
- My Professional Linux Practices (Even More, Why?)
- Addendum: XFS List Comments/Clarifications

Windows Deprogramming (the Windows You Don't Know)

A lot of posts I see on Linux lists are from typical Windows users new to UNIX. We'll get to my "UNIX Mindset" comments in a second, but it is important to understand that the overwhelming majority of Windows filesystems issues are Windows-only. There are many aspects where past Windows experience, even extensive Windows Server administration experience, is wholly inapplicable to UNIX. Filesystems are the biggest of differences. IBM-Microsoft systems have always used a File Allocation Table (FAT) approach, whereas UNIX uses an inode approach, as we'll discuss -- the two are night and day.

Compounding the fact is that most MCSEs are very oblivious to most of the issues with Windows filesystem design, especially the New Technology Filesystem (NTFS), that can make it very problematic. Coming from OS/2 in the early '90s, I warned my contacts at Microsoft of the dangers in early NTFS "false security" and other changes made from OS/2's High Performance Filesystem (HPFS), not that HPFS was an ideal design either. I recently did a series of presentations on low-level UNIX and Windows interoperability that covered the inherent design issues of NTFS, among other disk considerations:
- Low-Level Interoperability Part 1
- Low-Level Interoperability Part 2
- Low-Level Interoperability Part 3

But to start, let's go over the history of issues with Microsoft's filesystem designs.

- Microsoft (MS) DOS 1.0

MS-DOS 1.0 was a direct (and illegal) port of Digital Research's CP/M from the 8080 to 8088. There is a long history on that (MS bought it from Seattle Computer Products, the original piraters, for $50,000, which IBM later settled out-of-court with DR for $800,000). But the limitations of CP/M were clear, no directories, only 1,024 files in the filesystem, and filesystem reference was by drive letter (e.g., A:, B:, C:).

The File Allocation Table (FAT) approach was simple, but effective. The filesystem was a simple set of sectors, with two (2) file allocation tables, one original, one backup. The allocation tables were to track allocation of sectors. If a file was allocated space, if it only took up one sector, then the relative FAT entry for that sector would be noted as the end of the file. If the file took up more than one sector, then the initial FAT entry would note the next sector of the file. Each file is a chain of entries in the FAT referencing the next sector.

The FAT references were 12-bit, allowing up to 4,096 sectors to be addressed. With sector sizes of 512, 1,024 or 2,048 bytes, FAT12 could handle up to a 8MiB device. With up to 1,024 filenames, the FAT of a FAT12 only took up 1.5KiB (12,144 bits) of space.

- MS-DOS 2.1

MS-DOS 2.1 finally introduced the concept of a directory. To do this Microsoft "borrowed" source code from Santa Cruz Operation (SCO) Xenix, a port of UNIX source code to the PC and its 8088 processor. Microsoft helped found and fund SCO in 1978, before the United States broke up the monopoly of AT&T on the telephone infrasturcture of the US, so UNIX was a non-commercial endeavor with source code available from AT&T (as well as popular derivatives such as the University of California at Berkeley). As anyone who has been around Microsoft and Linux a long time, Windows currently has more original UNIX/SCO source code from before the AT&T v. UCB settlement in 1993 (and the creation of the "UNIX(R)-free" 4.4BSDLite), than any free UNIX or UNIX-like system such as Linux (which only use the UCB-owned 4.4BSDLite code).

One issue Microsoft ran into was the forward slash (/) for directory names, CP/M and, therefore, MS-DOS 1.0 used those for command line options (e.g., /?). The workaround was to use the backslash (\), which is where the drive letter-backslash (e.g., C:\) comes from. Most multi-user operating systems, including UNIX, did not use drive letters, and used forward slashes, since the 1960s -- pre-dating this decision by Microsoft in 1982 by almost 2 decades. I rather tire of people new to UNIX/Linux who complain why didn't UNIX/Linux follow Microsoft's so-called "lead."

MS-DOS 2.1 also introduced a 16-bit FAT, now allowing up to 32MiB filesystems to exist with a 512 byte sector size typical of fixed disks at the time.

- MS-DOS 3.31

Largely a major set of contributions by Compaq, MS-DOS 3.31 further expanded some 16-bit FAT features. One was the allocation of multiple sectors into a single block allocation unit. So instead of assigning a single sector per FAT entry, a block or "cluster" of sectors could be used, increasing the limit beyond 32MiB. Up to 64 sectors could be used for one (1) 32KiB "cluster" and raise the maximum filesize of FAT16 to 2,048MB (2GiB).

- MS NT 3.1 and the New Technology Filesystem (NTFS)

Windows NT 3.1 hit beta test in 1992, over 2 years before MS-DOS 7 and Windows 4 were formally recognized and bundled into a product codenamed "Chicago" and known a year after that as "Windows 95" on the shelf. The original intent and purpose was to make MS-DOS 6.22 and Windows 3.11 the last generation of the 386Enhanced (i.e., Real86 DOS that constantly shunted in and out of Protected386 to run Win16/32s applications). Windows NT 3.1 borrowed heavily from OS/2, including its High Performance Filesystem (HPFS), as well as benefited from a mass exodus of talent from Digital Equipment Corporation's (DEC's) Virtual Memory System (VMS) team (almost resulting in a lawsuit until Digital and Microsoft came to an agreement on support of NT for their new Alpha processor, the OEM with the most MS NT certified professionals, etc...).

As detailed in the links on my low-level interoperability presentations, NTFS does a lot of things for "false security" that cause massive compatibility issues with NT itself. But NTFS is, in essence, a modified version of FAT. It still uses a FAT design, but has far fewer limitations (e.g., no more 8.3 limitations), uses a more intelligent approach. One is that the FATs are located closer to the middle of the filesystem, to reduce seek times (FAT filesystems allocate them at the start). And there are now formal approaches to discover which copies of the FAT are correct when they differ. Lastly, like HPFS, NTFS marks and forces filesystem integrity checks when the system is not properly shutdown and the filesystem taken off-line (and uses the same CHKDSK.EXE program, although radically different than the legacy DOS program of the same name). NTFS one-ups HPFS by adding journaling, which reduces the recovery time requires for brining the filesystem on-line as consistent.

FAT16 was only supported in NT 3.x and even 4.0, although a NT-only 64KiB cluster size was an option that allowed up to 4GiB FAT16 filesystems to be created. This was due to the fact that through even NT 4.0, the installer could not install directly on NTFS, and could require up to 4GiB of space. There were also installer and boot-time issues with non-PC architectures as well (which used the ARC firmware -- long, long story).

- MS-DOS 7.0 (Windows 95, 95A)

MS-DOS 7.0 is at the heart of Windows 95 and 95A (OEM Service Release 1). It is still FAT, but adds hidden files for indexing long filenames. Ironically enough, MS-DOS 7.0 did not use the OS/2-NT functions for filesystem support, but when the 386Enhanced Windows 4.0 kernel loaded, it extended the existing DOS Interrupt 20-3Fh(typically 21h) services for long filename operations.

This means that until 1999, the filesystem mechanisms in all releases of Microsoft Windows NT and Windows 95/98 completely differed entirely. Programs had to be written for both, and many were not -- and this was just the tip of the iceberg for Windows NT v. 95/98 compatibility. 100% -- ALL of Microsoft's own programs -- FAILED their own "Designed for Windows 95 and NT" logo program, which caused Microsoft to scrap its own certification program because of such issues. But that is another, long, long story.

- MS-DOS 7.1 (Windows 95B on-ward)

In 1996 Microsoft released OEM Service Release 2 (OSR2) also known as Windows 95B with a new 32-bit FAT design. This allowed filesystem sizes up to the 133GB (128GiB) limitation of the 28-bit Advanced Technology Attachment (ATA) specification -- commonly found in Integrated Drive Electronics (IDE) storage. FAT32 offers a few advantages including storage of long filenames in the FAT32 design directly (instead of hidden index files), but no consistency checking or other benefits.

Windows 95 OSR2.5 (95C), Windows 98, 98 OSR1 and 98 OSR2 aka "Second Edition" (SE) all used the same MS-DOS 7.1 core.

Windows Millenium Edition (ME) was a Microsoft experiment to remove a lot of the legacy DOS 20-3Fh services and various interface options to force its own software application developers and indepenent software vendors (ISVs) to stop using the legacy DOS interfaces and start using the native NT/Win32 filesystem interfaces (among others). It was an utter-failure as it did little to force change, all while destroying compatibility.

- MS NT5 (Windows 2000)

Windows 2000 was released in early 1999 and was the first NT release to finally support a subset of the legacy "Chicago" interfaces first introduced in 1994, 5 years earlier. By then, it was too little, too late, but it did help speed Windows 2000's adopting in corporations where Windows NT had been adopted far less due to compatibility issues with most Windows software. Also introduced with Windows NT 5 aka "Windows 2000" was the Logical Disk Manager (LDM) disk label (partition table format), which replaces the legacy BIOS/DOS disk label (partition table format) of primary/extended/logical (NOTE: it actually looks like a single primary partition of type 42h).

Most of the benefits of the LDM can be found in my interoperability presentation. Use of the BIOS/DOS disk label is quickly becoming a serious compatibility issue in MS NT5.1 (Windows XP) Service Pack 2 and greater, and newer ATA 48-bit addressing and legacy DOS/NT compatibility are conflicting. Again, see my interoperability presentation.

- Issues With the FAT Filesystem Design

There were some serious design flaws to FAT that still plague FAT even in MS-DOS 7 (95/98/Me) and NT5 (200x/XP) today, as well as NTFS itself.

When it comes to FAT, one is the two (2) FAT copies. Although the use of two copies were designed for redundancy, when the two FAT copies differ, there is no way to know which one is correct. Some 3rd party tools attempt to do so, but they can and often do pick the wrong one.

Another issue is the simplistic design of the chain of FAT entries. Lost chains -- whereby a single FAT entry is incorrect -- results in the rest of the file being lost. A related issue is the cross-linked chain -- whereby two FAT entries point to their next sector as the same. Even if a copy is made for each, it means one chain is now incorrect.

But probably the worst issue with FAT, which still plagues NT5 (200x/XP), is the lack of any filesystem integrity whatsoever. If the system crashes, there is no way for FAT to report is was not taking off-line correctly and left "consistent." Which means that when the system boots up, it does not know if the FAT filesystem is consistent. Although the CHKDSK.EXE program was introduced in later versions to do run-time integrity checks of the FAT filesystem, it still does it on a live FAT filesystem, and is rarely perfect. Microsoft later bundled SCANDISK.EXE, based on a license of Norton Disk Doctor (NDD.EXE), which had better recovery logic.

Most people think it is lack of filesystem journaling (i.e., recording transactions and ensuring they are completed, somewhat like the "Atomicity" -- the first part of ACID in a good database design -- in a filesystem) is the issue, but it's actually not at all. Because even non-journaling filesystems in UNIX have at least a mechanism to not only ensure consistency, but force a check to make the filesystem consistent. All of Microsoft's FAT filesystems, even in NT5 (200x/XP), never force the user to fully check a filesystem for consistency before mounting. In UNIX, we only allow filesystems to be mounted "read-only," if at all, until they are checked for consistency -- so no changes could occur on a possibly inconsistent filesystem.

Because of this, the FAT filesystem is often left in a far worse state -- with a FAT table with errors, which means new files written can and do get lost. Running SCANDISK.EXE during boot helps mitigate some risk, but by the time SCANDISK.EXE runs, some processes have typically written to the fielsystem. Ironically enough, the lack of a "read-only" mount in not only DOS, but even NT5 (200x/XP) today, is the root cause of this issue that affects even the New Technology Filesystem (NTFS) as well, unlike almost every UNIX filesystem design.

In other words, even NT systems have no concept of a "read-only" mount. During start-up, NT expects everything but the "System" volume (the "System" volume is BIOS fixed disk 80h with the MBR, NTLDR, BOOT.INI, optional NTBOOTDD.SYS 3rd party disk driver, etc...) to be read/write, including the "Boot" volume (the "Boot" volume, which ironically comes after the "System" volume/stage, is the volume with \WINNT or \WINDOWS). This is due to the fact that many NT services expect the filesystem to be writeable during boot, including before any filesystem integrity check and/or journal replay of NTFS is made -- which could result in corruption.

Adding insult into injury, NTFS is very, very aggressive in its journal playback. Unless forced by explicity user option, NTFS often replays its journal. In rare, but eventually probable cases over extended usage and time, NTFS will self-destruct and leave itself unable to recover from a manual CHKDSK. Therefore, it is important that NT system administrators regularly force a manual CHKDSK at next boot to enforce regular, full filesystem integrity checks and minimize the chance of a future, improper journal replay.

- Fragmentation: Why FAT Sucks

Filesystem fragmentation -- the state in which files are not contiguous blocks, but allocations all over a filesystem which reduces performance as the disk must read a file from different parts of a disk -- is virtually unheard of in the UNIX world, and we will discuss why. But for now, let's look at the reasons why disk fragmentation occurs in Windows, starting with DOS/FAT and moving to NT.

DOS/FAT uses a simple "first available" allocation scheme. When DOS/FAT needs to allocate a new file, it scans from the beginning of the FAT table for the first available FAT entry. When it finds one, it uses it. If the file needs more than one "cluster," it looks for the next, which may not be the immediately next entry, and often is not. This instantly results in fragmentation, and it was not been improved in even the latest MS-DOS 7.1 code in Windows 95B through ME. NT's implementation of FAT has also remains little changed. Through even MS NT 5.1 (Windows XP), FAT is still largely a "first available" allocation scheme.

NTFS is a bit better, but also a bit worse. In addition to locating the FATs more centric to the disk, NTFS separates directory and file entries to speed up directory indexing. As an additional note, the directory stores also accompany meta-data, which is why SIDs and other NT-installation/registry-specific meta-data is tied to the directory entries and should NEVER BE MODIFIED on any NT installation except the one that created the NTFS filesystem (see my low-level interoperability presentation). At the same time, the separation of directory blocks results in a serious race-condition for run-time fragmentation tools, and required extra code to implement (and was not included as standard in Windows NT's defragmentation -- although later versions licensed the code from a 3rd party that solved the problem). Fragmentation of the directory entries is why even when a NTFS filesystem might be continguous, file references can still be very sluggish -- typically much slower than the worst UNIX inode design.

But these are just the FAT design issues. Windows itself is far worse.

- Fragmentation: Why Windows Really Sucks

The legacy "single drive letter" and "everything goes everywhere" approach of Windows has its ultimate bane from a fragmentation, let alone reliability, standpoint. There is no strict separation of key boot-time, core operating system, user software binaries, temporary files and, ultimately, user data. Although after many FAILED and often CONFLICTING approaches (especially between the "Chicago" and NT groups inside of Microsoft -- the former winning over the latter out of sheer numbers) have come and gone, there is still no strict separation of files to segment both critical files from not-so-critical, as well as keep fragmentation from occurring.

Some of Windows' key issues:

- Dynamic Pagefile

Microsoft uses a dynamic pagefile by default, which quickly fragments into several sections spread over a disk. A quick workaround is to pre-allocate a large, static area early on. Even legacy Windows NT was typically smart enough to allocate a large, contiguous block, as long as their was one -- which is why it should be done early.

- Temporary, Log and Other Small Files: Less Reliabilty, More Fragmentation

This is my personal favorite, there is absolute NO STANDARD in Windows to temporary files. Although the environmental variables of C:\TMP and C:\TEMP are typically used, or C:\WINDOWS\TEMP is often the default in newer versions of Windows, these are on the C: drive. And that's before even looking at the various Windows registry, log and other system created files that regularly occur, also on the Windows "Boot" Volume (the volume where \WINNT or \WINDOWS is located, typically C:).

This means that temporary files -- the absolute worst thing for a filesystem -- are shared with everything else. Temporary files are often small files that eat up random places of the disk (as some are deleted quickly while others are left behind). As we'll talk later, even filesystem designs with extents (Microsoft does not offer one) can mitigate the fragmentation issues of filesystems that have both large, static files and small, temporary files, extents cause additional overhead that can quickly be self-defeating. As such, temporary files should always be on their own filesystem to prevent fragmentation -- let alone their continuous creation and deletion exponentially incresaes the probable case where an incomplete write during an improper shutdown (such as a system hang) can affect other files on the filesystem.

In all versions of Microsoft Windows, YOU WILL ALWAYS HAVE THE MAJOR RISK OF THE EXTENSIVE NUMBER OF TEMPORARY FILES ON YOUR *CRUCIAL* "BOOT" VOLUME BEING CONSTANTLY WRITTEN/DELETED -- IT IS UNAVOIDABLE, IT IS THE LEGACY OF WINDOWS AND CANNOT BE CHANGED, AND WILL NOT BE CHANGED.

Then there are the issues with the defaults of C:\My Documents and Settings (previously also C:\WINNT\PROFILES, among other, conflicting NT v. Chicago non-sense) and C:\Program Files. These are almost always defaulted to the same volume as the Boot volume as well (hence C: in most cases), and no matter what re-allocation, programs just want to write to those default locations.

Lastly, there is the continuing reality (and major security issue) that Windows programs want to write to the same directory where they are located. This is known as the "startup directory" in the program's settings. All Windows programs -- including the purest of Win32 applications -- have a "startup directory" and most programs -- using the Microsoft created Visual Studio loader code -- often assume they can write to that directory. Although the registry is being used by more and more programs as standard, along with C:\My Documents and Settings or the profile of a user, there still exists a real lack of standards. In almost every case, it goes back to the history of Microsoft's own Visual Studio and other development products -- conflicting back'n forth between NT v. Chicago and countless other, almost professional laughable non-sense that make up the core of almost every Windows application, all introduced by Microsoft itself (and not ISVs like Microsoft likes to blame others for).

The result is that there is absolutely NO SEPARATION OF SMALL, LARGE, TEMPORARY, STATIC, DYNAMIC OR OTHER FILE TYPES IN WINDOWS, NOR WILL THERE EVER BE IN THE NEAR (POSSIBLY FAR) FUTURE. Fragmentation is a fact of life for Windows, and it's not going to be solved because both the legacy and current issues with Windows today are the continuing issue.

- The Little Known NT5+ (2000+) Hack: Anchors

One thing I always love to test MCSEs on is if they know about Anchors. Anchors were finally introduced in Windows NT5 (2000) in 1999, nearly 7 years after early Windows NT 3.1 Beta Testers suggested that Microsoft would be well advised to "mount everything under C:." Well, Anchors do just that, bring the concept of mounting other NTFS filesystems inside of another at a specified directory, in NT5+ (2000+).

That way, one can mount a C:\My Documents and Settings that is actually located on another filesystem. Maybe a C:\Temp that is also its own filesystem, assuming a system-wide set of variables point all programs at it -- instead of defauting to C:\WINNT\TEMP or C:\WINDOWS\TEMP (God I want to [virtually, of course] shoot someone at Microsoft for defaulting to put temporary files in the HEART of the OS directory!). And best of all, I can locate the C:\Program Files on its own, static filesystem, separate from all the writing, overwriting, deleting and general filesystem "clusterfun" of the System and/or Boot volume (typically C:\).

The problem? Anchors still break everything. Unlike in the inode UNIX world where multiple filesystems are everyday life, today's Win32 -- which is a set of DOS Int21h function hacks with only partially followed Win32 functions by programs built with Visual Studio itself -- often results in programs breaking when the allocation units of C:\something aren't actually on the original C:\ filesystem. So Anchors are still limited as an option. I typically only use them on NTFS filesystems dedicated to data -- i.e., filesystems I'm sharing out via SMB to other systems, where the access is not by a local program.

But I figured it was important to note Anchors none-the-less, even though they should have been introduced 7 years ago (although that probably wouldn't have reduced their compatibility nightmare).


Traditional UNIX Mindset (Why It Still Works After 30 Years)

UNIX has been around a long, long time, and much of UNIX's design was lessons learned from other "time share" (multiuser -- something "hacked on" to Windows NT by a company named Citrix, not Microsoft itself) operating systems. Microsoft was not interested in this designs because Microsoft never designed an OS itself -- DOS' limitations come from its predecessor, and NT is largely all the limitations of OS/2 prior, already based on DOS'ism, to OS/2 before JFS (circa 1995, which we'll discuss).

- inode filesystems

UNIX systems use inode filesystems. Each filesystem entry, typically a directory or file, has an inode that stores both meta-data, and points to the data blocks. A key difference and mindshift from a FAT design is that FAT has a dedicated allocation table with a 1:1 reference to data blocks -- whereas inode filesystems actually use two different data block types, the data blocks and the inode blocks that point to them. FAT uses a dedicated allocation table of all possible blocks that could be allocated, inodes do not -- in fact, some filesystems (that pre-allocate inodes) could "run out of inodes" when a filesystem contains lots of small files and there is not a 1:1 inode to data block (e.g., run "df" and "df -i" and note the actual data blocks and inodes used).

Pretty much every data block in an inode filesystem has an inode pointer using it, or reserving it (although designs differ), except the rare Superblocks. The Superblocks contains the core filesystem information (basic filesystem values, location of key inodes, free blocks, etc...) and only type of a few kilobytes (typically one data block, 4KiB is commonplace), and several, redundant copies are spread all over the disk (typically at fixed locations that are easy for experienced administrators to find in case a filesystem can and should be mounted with an alternate superblock).

Which is better, FAT or inode? It depends. Taking the "extra features" that inodes offer over FAT (which we'll discuss later), reliability can be a pro/con thing.

A FAT filesystem makes it easy to check for free blocks. Inode filesystems are more arbitrary, and a filesystem consistency check is always recommended on a regular basis to ensure the number of free blocks, as well lists of blocks that are available or have been freed, are consistent. A FAT filesystem also makes it much easier to check for cross-linked files, whereas all inodes need to be inspected to see if multiple data blocks are referenced by different inodes (although there is an interesting bonus to this, as we'll discuss). Inode operations definitely make checks longer and more involved, although there are bonuses to this in consistency (which we'll discuss).

A big one to start is actually a surprise to many. Most Windows users assume that the "root inode" makes an inode filesystem more suseptible to corruption than a FAT filesystem, because it points to everything else. In recovery, it's actually the opposite, a major benefit. FAT filesystems separate the allocation entries (in the fixed FAT) from the directory references (in special data blocks) which means there are 2 different points where a failure could destroy the same data. Again, this is because the FAT design comes from MS-DOS 1.0 before directories were added in MS-DOS 2.1. Although NTFS improves somewhat on this separation, by allocating directories separate from files, it's still 2 points where either can cause the same, severe damage. Anyone who has had even a CHKDSK on a NTFS filesystem result in unknown "FILE####.CHK" files that no one knows about has experienced this issue first hand.

In an inode filesystem, if directory links are severed between the root inode, or any parent directory inodes below, at least the inodes below that inode are now their own tree. This is because inodes store both the directory tree and pointers to data blocks in one structure, the inode itself. If a filesystem integrity results in the portion of the tree being "severed," the portion of the tree typically shows up a its own, self-contained tree under the typicaly "lost+found" directory -- names, directories, subdirectories, etc... intact. It all depends on the locality of the corruption or other fixed inconsistency, but if there was only a few points of actual "corruption," inode filesystems tend to be much easier to "piece back together" than FAT designs as a result of the "reference and allocation information as one" inode design.

As an additional benefit, the superblocks also keep a set of "reality check" values -- allocated data blocks, inodes used, inodes free (if pre-allocated), free data blocks, etc... These are regularly checked on boot against other values, and far more interrogated during a full filesystem integrity check (fsck). In fact, the CHKDSK used for even NTFS is not nearly as interrogating as an inode filesystem fsck because, again, the separation of allocation from reference in a FAT design affords little in the way of allowing a good transposition. FAT is the FAT, and if things are corrupted in the FAT itself, the day gets really to be bad. Inode filesystems at least have their own, self-contained reference lists -- -- both directory, subdirectory, files and pointers to data blocks.

Lastly, in FAT filesystems, two allocations that points to the same data block is cross-linked file. In an inode filesystem, this is a "hard link." Although Microsoft has been trying to come up with hacks to implement similar in NTFS, they are still not nearly as useful (and are very dangerous in NTFS). Although hard links sometimes introduce unforseen issues in rare cases, they are typically very useful for many things.

NOTE: UNIX filesystems also define a meta-data file/directory reference that can cross filesystems as a symbolic link (symlink). Symlinks are far more useful than Windows .lnk files, and far, far more transparent as well. I will not go into the benefits and issue sof hard links and symlinks, just know they work differently, and inode filesystems offer greater flexibility that results in their usage (as well as a history that accomodates their existance).

- UNIX Filesystem Hierarchy

The first thing many Windows users really "hate" about any UNIX/Linux system is the filesystem layout. The reality is that the filesystem hierarchy of UNIX and UNIX-like systems, although slightly varying between implementations, is far, far superior to the "free-form Windows" layout. In reading my initial "Windows Deprogramming" that basically shows Windows has NEVER had any notion of any layout strategy, combined with this discussion that will quickly show UNIX has always, you'll quickly come to appreciate UNIX and you should have an epiphany (if you already haven't).

In sticking with Linux, let's look at the Filesystem Hierarchy Standard (FHS), ignoring virtual filesystems like /dev, /proc, etc... that are not physical:

- System Directories Required for Boot/Maintainance
/bin
/boot
/etc
/lib
/sbin

- Temporary, Log, Variable Files
swap
/tmp
/var

- User/Service Progams
/usr

- User/Service Data
/home
/srv

The absolute hierarchy required to boot a Linux system into a maintenence mode is /bin (elementary programs), /boot (boot-time files), /etc (system configuration), /lib (kernel modules/drivers, core libraries) and /sbin (system/superuser programs). Pretty much everythign needed is in these directories, and other than /etc, they are completely static in nature. So the root (/) filesystem contains these directories at a "bare minimum." Linux can and does mount this filesystem "read-only" at boot, so basic programs can be used to check the system (including fixing the root filesystem if needbe) before anything else.

Now we get to those nasty temporary/log/variable filesystems. The dedicated swap (swapfile) filesystem is commonplace in any UNIX flavor, and Linux is no different (although you can use a "swap file," it is strongly discouraged and most installers do not even offer it as an option). /tmp (temporary) is the standard in almost every UNIX flavor where almost all programs assume they can write (and is typically UNIX permissions 1777 -- all access with "sticky bit," so only the creator of a file/directory can modify/delete by default). UNIX programs have absolute no concept of a "default" directory, and most will only use /tmp or the user's home directory when they need to write. /var (variable) is probably the biggest and most troubling filesystem of all, because all log files, spool directories, and [temporary] user and service files (/srv is used in FHS 2.3+ for service data files) go. These filesystems are almost always separated out from all others because they are so heavily modified, often with small, temporary files.

Now we get to /usr, which has a wealth and, by far, the largest collection of files and space requirements of a standard Linux installation -- all static and largely unchanging (except for patches or additions) -- /usr/bin, /usr/lib, /usr/share, /usr/sbin, /usr/X11, etc... -- stuff that is not required to boot, but is used after boot.

Next we have the user/service data files -- traditionally /home (or subdirectories of home) and, now for service data files in FHS 2.3+, /srv (previously and traditionally different portions of /var, like /var/www, /var/lib, etc..., possibly /home/www, /home/lib, etc... before that). These are a mis-mesh of small and large, dynamic and static files and directories depending on usage.

So, what we have are:
- The core, "avoid changing this stuff because we need to boot/fix things" set of files
- The temporary, "this stuff changes all the time and shouldn't mix with others" set of files
- The static, "the unchanging meat of the OS, only updated, patched and added" set of files
- The data, "changes for different types of usage, which can vary" set of files

This is UNIX at its finest, strict separation of boot, temporary, programs and data.
It makes it so damn easy to not only localize corruption, but inhibit fragmentation.

- Where UNIX goes even further

Now let's assume you are going to at least segment your UNIX filesystems into at least the four I listed above (which I'll detail further in the next section). What else does UNIX offer in this strategy that Windows does not when you do?

- Reserving usage
- Localizing security
- Localizing the unexpected

Most UNIX/Linux filesystems do many things to reserve usage of a filesystem. A big one is the common 2-10% (Ext2/3 use 5% by default) reservation of a filesystem. When a filesystem reaches 90-98% full (95% full on Ext2/3 by default), the kernel will prevent any further writing to the filesystem by anyone but root. Not only are the regular users, but most processes, are not running as root, so the disk stops allowing writes at 90-98%. At first this seems foolish and, in fact, many people complain about it, but it is for one very big reason -- fragmentation.

Fragmentation exponentially increases as a filesystem fills up. This reservation is a long taught, long learned lesson for UNIX/Linux administrators that should be very respected. Anyone who has filled up a Windows server volume should appreciate this given how poorly a Windows server performs afterwards, which is the same problem a UNIX server would suffer if it allowed it too. But unlike Windows servers, almost all UNIX/Linux distributions and filesystems (with a few, notable exceptions) enact this reservation -- to combat sudden and horrendous fragmentation that occurs as a filesystem becomes nearly full.

Regarding localization of security, most UNIX filesystems have extensive mount-time options, including the ability to prevent programs from executing, accessing specific capabilities, as well as the default of many well-designed UNIX applications to even allow access to cross filesystem boundaries (e.g., if /srv/www is a separate filesystem, the Apache web server will not allow access outside of /srv/www by default without a specific override). Segmentation and security go hand-in-hand when you wish to not allow one service to affect any other service, which is why many services (databases, web servers, print/spool servers, mail servers, etc...) use segmented filesystems for their user/service data and/or variable log/temporary files.

In continuing those compounding thoughts, I hinted at further reliability -- expect the unexpected -- localize for the unexpected.

Keeping filesystems, especially the root (/) filesystem consistent, largely unchanged, and reliable. It doesn't seem to "sink in" at first to most administrators, even some seasoned Linux administrators, but after years of UNIX/Linux exposure, you learn to quickly appreciate the existance of a "small" root (/) filesystem versus "one big" one. The common instinct is to move towards a "single C: drive" coming from Windows, or after an UNIX/Linux administrator has one filesystem "fill up" on them, but understand it is that filesystem localization that is the best advise I can give anyone.

If something is already "out of control" and going to fill up a single, segmented filesystem, giving it "more room" to go "out of control" is not only not solving the problem, but any time afforded in "starving off" the eventual "out of room" event is going to be offset by the additional files and mess created. Simply put, I have yet to see where not segmenting out /var was a good idea -- and have explicitly caught several people who created "one big root (/)" only to see a rogue process to adversely affect other server data, or at least cost a good 2-20 hours of "clean up time" as a result (possibly affecting server performance too).

Lastly, as a UNIX administrator, the very nature of UNIX filesystems is (most often) the rule of conservitism. In other words, UNIX journal replays, filesystem integrity checks and other automated processes often don't want to automatically fix things if there is a chance that data will be lost -- quite the opposite of a NTFS journal replay that avoids the time of a CHKDSK to its own demise when it should have done a full CHKDSK. In many cases, UNIX systems will require you to do a full fsck/repair on a filesystem off-line (non-mounted) before it will let you continue, which means the larger the filesystem, the more time it will take to do so.

By segmenting UNIX filesystems, you can not only reduce the time when a filesystem needs to be checked by making it smaller (because when a journal misplay occurs, or a full fsck is required, it is typically only one filesystem of the entire lot), but if it is going to be an extended check, you can bring up the rest of the system without that one filesystem. E.g., I never make just one data filesystem, I make at least two. That way, if a full fsck is required on one, I can bring the server up and let half my users work while the other half waits 15-60 minutes, instead of having all my users wait on the 30-120 minute fsck required on one, big data filesystem.

- "But I Just Gotta Defragment My UNIX/Linux Filesystem!"

Okay, some of you are just so programmed that even though you appreciate and even believe that UNIX/Linux filesystems need to be checked less, you just want to defrag for maximum performance. So what program do you use? Well, it depends on the filesystem. Instead of going into a huge HOWTO on every filesystem, I am going to cover this under the "best practices" for the two Linux filesystems I deploy. I would rather give correct info/recommendations for those two, based on my experience and in that context, than to try to tell you what to do for any, arbitrary setup.

My Professional Linux Practices (and Even More, Why?)

Okay, now you think I'm going to get into Ext3 v. ReiserFS v. etc... Right? Not exactly. I'm not here to debate the merits of Linux filesystems, I'm here to tell you why I deploy the Linux filesystems I do, in what way, to what end and to -- MOST IMPORTANTLY -- mitigate risk to my systems uptime and their data. Nothing else matters to me and I have no problem if you disagree with me, but at least respect my comments as calculated and proven for myself and my clients.

- Volume Management

Regardless of OS, I use Volume Management on the PC. Although some RISC/UNIX platforms have good disk labels (aka partition table formats) that are well-designed for their architectures, the massive issue with the PC and the legacy BIOS/DOS disk label (aka partition table format) using primary, extended/logical slices (partitions) is the fact that it is at the mercy of the varying/conflicting disk geometry issues as well as has not means to store meta-data for volume information.

This means when I deploy Windows NT5+ (2000+), I configure a slice of type 42h for a Logical Disk Manager (LDM) Disk Label (aka "Dynamic Disk"). When I deploy Linux 2.4 or 2.6, I configure a slice of type 8Eh for Logical Volume Manager (LVM) -- version 2.6 using LVM2. I almost always do this regardless of whether or not I'm doing RAID, snapshots, etc..., I do it for flexibility. I leave it up to individual sysadmins to decide for themselves, but I encourage you not to avoid learning LDM and LVM/LVM2 because there are sound reasons for doing so.

In fact, the common physical volume (pv), volume group (vg), logical volume (lv) 3-level approach i Linux's LVM is basically ubiquitous across a host of UNIX flavors and their various platforms. Learning the elementary terminology, and how to do basic, harmless operations like allocation new space, is highly recommended. If you don't know your way around any UNIX LVM, then learn it so you are ready to deal with most implementations.

- Segmented Filesystems

There is a lot of debate on this, but based on my previous comments, I will not change my mind. Although I do agree that creating too small of a filesystem is a bad thing, and I regularly run into it on existing systems that I wish I could easily change. In fact, prior to LVM in Linux, I adopted an "equal size" arrangement/approach that many have mirrored. In a nutshell, I never make any filesystem smaller than system memory, and I make the essential filesystems of equal size, and any subsequent support filesystems the same size that is a multiple of those essential filesystems.

For example, I consider the "bare minimum" Linux filesystem segmentation to be:
/
swap
/tmp
/var

I absolutely and positively will not tolerate /tmp and /var on anything else. If I am absolutely hurting for space, I will symlink /tmp -> /var/tmp. I like to avoid putting even /tmp on the root filesystem. When implementing these "bare minimum" Linux filesystems, I assume /home will be mounted remotely. If not, then a separate /home filesystem is of great consideration, although there are one or two workarounds (see /usr/local below).

Regarding size, I typically try to stick with the typical, maximum size of the largest, but common removable media for each The reason why is because that's the size of a typical OS install (and the maximum I could expect all updates for an existing install to be), and typically a small multiple of the common system memory. About 5 years ago, this was a CD-R, so I typically used 0.5GB or 1GB. Now this is the DVD-R, so I have typically been using 4GB or 8GB. And I always make sure these sizes are never less than the amount of system memory -- so raise them if they are not.

Next comes the big enchillada, the /usr filesystem for static user/service binaries.
/usr

In a system where my space is limited, I will do a smaller install and just use the root (/) filesystem to store /usr. But when I have space, I make /usr fairly large -- typically 4-8x the amount of space that the OS will put in /usr. These days, for a full Linux install, that is roughly 16GB or 32GB. Over the life of the system, with updates and even a number of concurrent additions of necessary distro-provided (or an associated repository) packages added later, this is more than enough. Any other accomodation should be done with a separate filesystem (see /opt or /usr/local below).

Next we are left with the user and service data of the system. The size of this will vary, and they may or may not need to be separated out.
Workstation Optional (choose 1): /home, /export/local, /export/(systemname)
Workstation Optional: /srv
Server Standard (2+): /export/(systemname)(#)
Server Standard: /srv

For workstations, things vary.

Regarding local /home, if my users are going to mount data over a remote NFS mount and not require any local storage, then a /home filesystem is optional. If there will be no regular NFS mount, I will often create a /home directory. If I NFS mounts to other systems will occur frequently and/or local storage is required, I actually like to create a /export/local. If I know local data will be regularly shared out via NFS or SMB on a specific LAN regularly, /export/(systemname). In the case of the latter two, I will either symlink /export/* to /home/*, or NFS export /export/local or /export/(systemname) and locally mount into /home/local or /home/(systemname). In most cases, the (systemname):/export/local is possibly in the local network's NIS/LDAP automounter table maps and done automagically anyway. It all depends.

Regarding /srv, if the system is a workstation/desktop, then there's probably little need for /srv, or it should be the same size as the essential filesystems (/, swap /tmp, /var) previously just for storing a few services (like maybe a quick FTP-SSL or other data access option). In most cases, you can forget all about /srv on a workstation/desktop.

For servers, things change drastically.

If the system is a data file server, then I create at least 2 /export/(systemname)(suffix#) filesystems -- at least 2 for the reasons I previously explained -- for user data.

For service data, /srv is the new FHS 2.3+ location, although older systems might be /var/lib, /var/www, etc... The base /srv should be at least as big as the essential filesystems, if not as big as a data filesystem if the server is providing a lot of different services. If the server has a primary role as a mail, web, file and other service, then I like to separate those out for both localization and security reasons.
Server Optional: /srv/ftp, /srv/www, /srv/...

Furthermore, if the server is a mail, print or other spooling service, then additional /var/spool, /var/mail and other /var/* subdirectories should be created as appropriate:
Server Optional: /var/spool, /var/mail, /var/...

[ NOTE: You should consider an IMAP directory a "service data" directory and not a "service temporary/spool." IMAP directories, assuming mbox is used, is like a collection of large files that are running. This is impor

Far, far, FAR too often do I see Samba servers handling print operations rendered useless and bring down the whole network because the /var/spool directory was on root (/), or possibly on a small /var that is now holding up any other services using /var.

Lastly, we are left with the "Optional/Local" filesystems. These are for directory trees that are not part of the standard distribution packages or locations. They are commonly /usr/local and/or /opt, and have their own root-like structure underneath with bin, etc, lib, sbin, var, etc... -- especially for src (e.g., /usr/local/src). They should be used sparingly! But if there is a lot of customizations going on, they are a good idea -- and should probably be equal in size to at least /usr, possibly a data filesystem if needbe. Unless a standard 3rd party app wants /opt all to itself, I symlink /opt to /usr/local. In a few cases, I keep both, and /usr/local is actually a NFS mount to a common, shared tree among systems of the same release/version (e.g., appserv:/usr/local.SunOS5.9 -> /usr/local on a Solaris 9 system).
Optional: /usr/local (commonly symlink /opt -> /usr/local)
So to summarize ...
The "essential" filesystems (typically 4-8GB today, 1-2GB bare minimum):
/, swap, /tmp, /var
The "binary" filesystem (typically 16-32GB today, 4-8GB bare minimum):
/usr
The "data" filesystem(s) (typically at least /usr, if not bigger):
/home, /export/local or /export/(systemname)[#]
The "service data" filesystems(s) (from as small as "essential" to as large as data):
/srv (workstation optional)
/srv[/ftp,/www,etc...] (application-specific services only)
The "service temporary" filesystems(s) (typically same as /usr):
/var[/lib,/mail,/spool, etc...] (spooling/some application-specific services only)
The "optional/local" filesystem:
/opt [ -> /usr/local ] (rarely its own filesystem, only when 3rd party dictates)
/usr/local (same as /usr, can host all added files in most cases, most 3rd party)

Spares: If you do not use LVM, I recommend at least one primary/logical slice reservation per "same size." If you use LVM, reserve a good 10-25% of the volume group for future need.

- Why I Do Not Deploy ReiserFS and JFS

Let me first say that I highly respect every Linux filesystem development lead, from Steve Best (JFS), to Hans Reiser (ReiserFS) to Nathan Scott (XFS) to Stephen Tweedie (Ext3). Each is fairly good at explaining their focus, results and advantages without too much of the non-sense I see most users engaged in "filesystem pissing contests" do. Over the years I've gotten a few facts wrong myself, but I've come to the same preferences over and over via my methodical usage and approaches.

It was because SuSE shipped ReiserFS as standard that I could not consider SuSE, and even representatives from SuSE recommended that I not explore ReiserFS. Why? Because I had an engineering network that used NFS, and there was no other network filesystem that could give the type of access and push the kind of data we needed. Nothing against Hans Reiser and ReiserFS, and I've actually only been impressed by his ideas and implementation -- including the meta-data journaling approach, as well as other innovative features of ReiserFS 3 as well as work on ReiserFS 4.

ReiserFS continues to builds a revolutionary filesystem that lacks traditional UNIX inode layout and interfaces, which is why ReiserFS lacks a lot of kernel feature compatibility, and not all of the Linux Virtual Filesystem (VFS) layers can abstract these features to ReiserFS that just isn't of the same, traditional design. This prevents me from using ReiserFS. As an additional consideration, by his own admission, Hans Reiser has stated that filesystems should be redesigned every 5 years. As much as I've seen ReiserFS handle dynamic changes without incident, as much as I've never seen ReiserFS make a journal misplay, the fact remains that with a continually fluid design, or significant changes on a regular basis, the off-line tools continue to lag the on-line kernel implementation. So while I might be okay as long as a ReiserFS filesystem is matched against the proper kernel, the second ReiserFS does properly not trust its journal replay, I'm at the mercy of the off-line tools. And so far, I've had horrendous luck when that happens.

In OS/2 Warp, IBM began a new, revolutionary filesystem design. Being that Microsoft no longer had access to IBM's technologies and code (legally since 1993, after 1993 is a long story), a radical replacement for HPFS was devised. The result was the IBM Journaling Filesystem (JFS) and it was extremely innovative. IBM spent the next few years porting JFS to its AIX UNIX operating system, added all the traditional inode structures and kernel filesystem support necessary and expected by a set of standard UNIX interfaces. By 2001, the job was completed and JFS2 was born.

Naturally the JFS2 port would have and should have been the foundation for the Linux port -- even if it started in 1999 before completion on AIX. But as contracts would have it, IBM had a Non-Compete Clause in their Project Monterey (64-bit UNIX) agreement with SCO which prevented IBM from porting code from AIX (Monterey for IBM Power is known as AIX 5L), so IBM ported JFS from OS/2. This meant that IBM had to re-create, "clean room," all those interfaces they had spent 4 years doing for AIX. As such, by 2001, when JFS was considered "production quality" for Linux, it lacked almost all major feature support for Linux -- quotas, NFS, etc...

To this day, JFS still suffers from some compatibility issues with standard Linux VFS features, although its fairly static design and traditional layout does make it more compatible than most ReiserFS developments. In any case, it has been a non-consideration for myself, even if others have deployed it to great success.

- My Experience with Ext3 and XFS

I adopted Ext3 in early 2000 for kernel 2.2 when it was still only the "[full data] journaling" mode. It was little more than simple "double-buffer" commit. It was easily converted to Ext2, as well as back, and it did the job to drastically reduce fsck times. Probably the biggest sell for Ext3 was the ability to drop into a full fsck when necessary -- something that saved me dearly when a physical disk error occured (and my RAID card firmware and driver were not compatible -- long story). To use a trusted fsck of 10 years on a filesystem whose structure had not changed in the same period of time was convincing enough.

Since then, I have only trusted my "essential" filesystems to Ext3 without reservation. I have never lost a Ext3 filesystem, and I have had no unexpected data loss with either "journal" or "ordered writes" mode. I purposely avoid "write back" mode due to its inherent issues that it could affect files that are not being modified. With newer directory indexing features, I find the performance of Ext3 to be more than adequate for filesystems under 100GB. It should be noted that I purposely avoid using Ext3 on filesystems greater than 1TB (even though newer versions support up to 8.8TB/8TiB).

The Ext3 base feature set -- full NFS compatibility, most other, standard Linux features in mid-to-late 2.4 (quotas, POSIX EAs/ACLs, etc...) were sufficient for most operations -- especially in the early days of Ext3 back in kernel 2.2.

Unlike JFS, XFS was a direct port from Irix to Linux. Unlike any other filesystem, XFS brough a lot of heafty requirements that prevented it from being in the stock kernel. The good news is these capabilities were ported into kernel 2.5, and most other filesystems now benefit. Other than some paging features tied to Irix that had to be written for Linux, XFS was a clean implementation on Linux. And that included the wealth of features that were standard in XFS. This included full extened attributes (EA) in the inode itself, a feature still lacking from most other Linux filesystems (let alone most other UNIX filesystems) that hack on a hidden file. And best of all, like Ext2/3, the structure had remained unchanged from its traditional UNIX design since the mid-'90s, despite all its advanced features. There was even Linux quota support for XFS before Ext3, while NFS compatibility is just as good (among other standard Linux kernel features).

XFS uses both extents (which JFS also does) and delayed allocation (which ReiserFS 4 also does) to combat fragmentation. This makes XFS ideal for filesystems where files, both large and small, could be written. In traditional filesystems, the combination of lots of large and small files causes all sorts of allocation issues that typically increase fragmentation. Delayed allocation helps pack smaller files better, but cannot do the same for large files. Extents help separate small and large files into their own allocation areas of the disk, but small files are not always packed well. Only the combination of both delayed allocation and a proven extents strategy -- which XFS was designed for and implemented on Irix from day 1 in the mid-90s, now ported 100% to Linux -- gives the best of both worlds. Now there are limitations to the combination of pre-allocation and extents. Most of it has to do with its additional overhead, which will be covered later with regards to fragmentation.

But the major, key differentiation of XFS is built upon its existing, proven, stable structure on Irix. That included the full suite of off-line tools with 5+ years deployment -- xfs_repair, xfsdump/xfsrestore, xfs_growfs, etc... The off-line repair tool was very trusted. The dump/restore , combined with the native inode storage of any EAs/ACLs info directly in the inode**, but it could be safely run against a mounted XFS filesystem and did not require a snapshot or other volume management "freeze" (unlike Ext3). It already had the ability to be grown, managed, reorganized (defragmentor), etc... with the existing suite of off-line tools that pre-existed, not what was being promised to be developed, etc...

[ **PROFESSIONAL NOTE: Ironically enough, XFS was ready for SELinux before Ext3 (it's XATTR format is a fully support XFS inode EA type), which begs the question on what Red Hat is waiting for?!?!?! XFS is a perfect complement to Ext3 to address its deficiencies in data and larger filesystem deployments. ]

For data filesystems, I was sold on XFS and started using it immediately. I tested XFS for other filesystems as well, but quickly stopped considering it after both the performance of "temporary" filesystems was not optimal combined with the fact that I had two /var filesystems get hit by the XFS 1.0 bug. The bug was an oversight in the design of the one additional requirement for the Linux port, the paging facility that was previous tied to Irix -- something that has been long fixed and is now trusted (especially in 2.6 where the paging facilities are part of the stock kernel code).

- Specific Practices for Ext3 and XFS

Now even though Ext3 and XFS work quite well for myself "out-of-the-box" (with a few distro exceptions/workarounds), there are still some specific practices and recommendations I utilize for each.

Ext3 gets the call for all "essential" filesystems -- /, /tmp, /var. It's static nature means I can read it with almost any boot disk (although I try to stay with the distro's recovery CD/mode). I also use it for all temporary filesystems, including mail, spool and most service directories that are 32GB or smaller.

The only issue of major concern with Ext3 is the pre-allocation of inodes. The ratio of inodes to blocks is typically 8-16 or so (one inode for every 32-64KB on the typical filesystem with 4KB blocks). On the /var filesystem, or another temporary filesystem with lots of small files -- possibly a mail or news spooler (although not nearly as much in the case of mail these last few years with MS-TNEF flying around ;-), this is not ideal. It is very often the case that a "df -i" will result in twice as many inodes used than actual blocks -- although newer logging defaults in most distributions/services are not nearly as bad as of late. So using the "-i" or "-T" option to "mke2fs" when creating /var or a /var/"spool" directory is recommended for Ext3 /var and /var/spool filesystems. E.g. (1:1 inode-to-data block assuming a default data block is 4KB):
# mke2fs -i 4096 -j -L var /dev/vg00/lv04
# mke2fs -j -L var -T news /dev/vg00/lv04

See "man 8 mke2fs" for more information.

XFS, on the other hand, dynamically allocates inodes (just like JFS and ReiserFS) so the number of inodes is not an issue. Furthermore, XFS uses advanced packing techniques so data can be stored directly in its own inode (instead of using a data block) when small enough, as well as other usage reduction approaches (the most of any Linux JFS design).

However, I typically deploy XFS on user data filesystems, and the rare, large service directory (e.g., database, IMAP spool, etc...). On user data filesystems, I typically wish to take full advantage of Extended Attributes (EAs) like Quotes, ACLs, SELinux, etc... support. The default 256 byte size of a XFS inode is not ideally suited for storing POSIX ACLs, as less than 64 bytes are typically left for EAs. Should an inode need more space, a full data block (typically 4KB) would be allocated, which is not always ideal, plus it means not all of the meta-data is stored in a single inode. So when using ACLs and/or SELinux, increasing the inode size in XFS to 512 bytes (possibly 1024 bytes when using both heavily) is recommended, at only a small disk penalty overall (a tad more noticable with 1024 bytes). The option to use a larger inode size when creating a XFS filesystem is "-i size=value" such as follows:
# mkfs.xfs -i size=512 -L engr_unclass /dev/vg01/lv01
# mkfs.xfs -i size=1024 -L engr_secret /dev/vg02/lv01

I absolutely cannot live without xfsdump when it comes to filesystem backup. Instead of having to deal with various backup of hidden file ACL and SELinux information in other filesystems, that information comes over in the inode itself during a xfsdump (again, Red Hat why don't you support XFS for these key enterprise features, removing the need for further hacks to/for Ext3 once and for all?!?!?!). And it was designed to be run on a mounted XFS filesystem, taking away the need to make a snapshot or other freeze-in-time/off-line-equivalent technique (other than for databases -- which is another, non-filesystem related issue). The existance of xfs_copy is also a nice off-shoot utility for quick cloning of an existing XFS filesystem that might not be the same size (unlike dd), without losing all of the ACL and SELinux information in the inode meta-data (that would not be preserved with a findcpio, tar, etc... copmmand). I've done one xfs_growfs without incident atop of a logical volume -- pretty quick and straight-forward -- all while the filesystem was mounted too. ;->

- Defragmenting Ext3 and/or XFS

Defragmenting Ext2/3 has basically one rule, don't do it. Although the [e2]defrag utility exists, it always seems to lag the Ext2/3 developments, severely. For the most part, it typically can't hurt to try [e2]defrag -- if an attribute is detected that it doesn't support (such as journaling -- requiring Ext3 to be converted down to Ext2), then it will fail to run. Some guides suggest disabling attributes with "tune2fs" until it runs, but that is a huge mistake -- those attributes are set for a reason.

[e2]defrag much be run on an off-line filesystem. But at this time, I cannot recommend it. I typically limit my usage of Ext2/3 filesystems to filesystems under 100GB. Although I use them for temporary filesystems which fragment heavily (e.g., /tmp, /var, /var/spool/mail, etc...), I localize that fragmentation by appropriately segmenting those filesystems. That seems to limit the degredation.

XFS was designed off-the-bat to completely eliminate fragmentation. The combination of extents -- by which small files and large files are allocated in completely different areas of the filesystem to prevent packing issues -- along with delayed allocation -- which ensures small files are packed well and not merely allocated "first free block" -- prevents nearly all fragmentation. While some people advocate XFS because it does avoid such fragmentation, it should never be used for filesystems with lots of small files -- especially not lots of small, temporary files with lots of writes. In those cases, the overhead of extents and delayed allocation completely negate the benefits of reduced fragmentation.

In other words, it's probably better to segment /tmp and /var out as separate Ext3 filesystems which does quick indexing and writes/deletes (even if with more fragmentation) than to bog down the root (/) filesystem into the overhead of writes/deletes to /tmp and /var subdirectories with XFS. Especially since there can be boot-time considerations with XFS (the filesystem does not offer a "bootstrap" at the beginning of a slice/ partition -- so boot must be in the MBR), and it's always good to not put root (/) at the mercy of continuous writes/deletes on /tmp and /var files anyway. In general, every attempt of mine to use XFS for /tmp, /var or other temporary filesystem with heavy small file writes/deletions has been less than ideal compared to Ext3.

With that said, SGI did come out with a filesystem reorganizer (xfs_fsr) tool after a few, rare applications did show significant fragmentation over time (such as large files that grow regularly). Like nearly all of XFS' toolsuite, the filesystem reorganizer works directly with XFS' journaled implementatin on-line while the filesystem is mounted. By default, the reorganizer runs for 7200 seconds (2 hours), user settable with the "-t" option. It makes as many passes with each pass attempting to reorganize the 10% worst fragmented files in each pass for each filesystem. With no options, it attempts to run on all mounted XFS filesystems (i.e., /etc/mtab), although a filesystem specific list can be passed. Options also exist to pick up where a previous run left off.

For more information on the XFS filesystem reorganizer (xfs_fsr), see "man 8 xfs_fsr".

Addendum: XFS List Comments/Clarifications

Some comments came up from people on the SGI Linux XFS list about issues with XFS, I wanted to repost what I posted that addressed my XFS roll-outs. In most cases, the issues are race conditions that have little to do with XFS, also affect Ext3, but are often based on the backport of XFS to 2.4 in the stock kernel (and not SGI's releases for Red Hat Linux 7). I could also go into many details on the layers upon layers of storage/filesystem that is quickly getting out of control (and I would argue is based on poor beliefs/limited deployments of "good" hardware RAID).

Notes/clarifications on my specific XFS deployments ...

1. Kernel 2.4

I have _never_ used the XFS backport to kernel 2.4. Frankly, I don't trust it. Not because of XFS, but because of kernel 2.4, and because it doesn't come directly from SGI, tested and blessed.

I have only used the official XFS releases for kernel 2.4, largely XFS 1.2 for Red Hat 7.x, with limited use of 1.3 on Red Hat Linux 9. In fact, I kept deploying only Red Hat Linux 7.3 and, to a lesser extent, Red Hat Linux 9 with XFS until late last year (once Fedora Core 3 came out), tapping FedoraLegacy.ORG for updates. Again, this means I'm back at kernel 2.4.20 -- and I really never trusted newer 2.4 kernels anyway! I have not had the NFS issues others have complained about, and I've had a real crutch on xfsdump for backups includingACL information, as well as quota support.

2. Kernel 2.6

With Fedora Core 3, I started deploying XFS on kernel 2.6, but I don't put my faith in it yet with 4K stacks. I was very disappointed when Red Hat forked Red Hat Enterprise Linux 4 development and did not bring XFS over. I think it was a huge mistake to not put in the efforts to see XFS ready for 4K stacks (NOTE: 4K stacks are something I do agree with Red Hat on doing). Red Hat could offer a lot to XFS if they had to maintain it equally with Ext3 under RHEL. Again, I will assert it is in their best interested to do so. With Fedora Core 3 I have quotas, NFS, ACLs and, now, SELinux, but it is not as tested and proven as my old XFS 1.2 deployments on Red Hat Linux 7.x (and I assume I'll start running into stock kernel implementations soon enough).

Fedora Core 3 should be supported until mid-December when Fedora Core 5 is currently planned for Test2. Reality will probably dictate that FC5T2 slip to early next year -- and even then, with Fedora Core 3's stability and popularity, I see FedoraLegacy support continuing it forsome time (unlike Fedora Core 2 or 4).

It should also be noted that CentOS (a 1:1 rebuild of RHEL from SRPM) also offers XFS in its CentOS Plus (packages that are different than stock RHEL) kernels. The CentOS 4 Plus kernel basically seems to be the same as Fedora Core 3, XFS from stock 2.6 kernel implementation. While I trust it more than the 2.4 backport even though it's now in the latter, stock 2.4 kernels, I still can't trust it as much as the prior, official SGI XFS releases that had their blessings on Red Hat Linux 7.x and Red Hat Linux 9.

3. LVM/MD Usage

I limit my use to LVM to volume slicing. Let me start by saying that I'm a huge fan of volume management. I use both LVM and LVM2 for flexible, on-line additions/modifications of logical volumes. In a nutshell, Ilargely use it to slice my disks with more flexibility -- reserving space, create new volumes as necessary and theoccassional expansion (although I typically try to stick to new mounts/symlinks).

But with that said, let it be known that I don't trust LVM and especially not LVM2 with snapshots, more complex resizing and definitely not any RAID operations. I do not trust DeviceMapper (DM) with either LVM2 or EMVS right now. Why? All I keep reading is about is race condition after racecondition after race condition. And in each case, it's not limited to XFS.

When it comes to MD, I really avoid it. I always have. I've seen a lot of people talk about how software RAID is better, faster, etc... I've seen people state that it allows them to use different disk controllers and other hardware, and not be tied to a vendor. They also claim its more flexible and gives them more options. While I believe they are sincere, I can quickly and easily point out they are not comparing software RAID to solid, proven hardware RAID products from select vendors. I've just had a different set of hardware RAID experiences.

First off, I've limited myself to only 3Ware and select LSILogic (including former Mylex) products over the last 5 years. 3Ware uses an ASIC-driven "storage switch" and I have only deployed LSI Logic (and former Mylex) products thatare XScale (which is based on StrongARM). These are very, very high performing -- able to move a lot of data with not only little CPU overhead, but more importantly, without the extensive use and duplication of data streams through the CPU-memory interconnect. I.e., it's not the XORs that get you, but the duplicated data streams tying up the interconnect that data services could be using. It's the same reason why hardware switches/routers are better networking equipment than PCs -- these "storage switch - I/O processors"are the same. Their on-board RAID intelligence is self-contained meaning their drivers are simple, GPL block drivers. Even Intel is moving to put its XScale I/O Processors (IOP) on Xeon mainboards, possibly in the I/O Controller Hub (ICH), directly -- to off-load these unnecessary operations for today's network/storage (RAID, layer 2/3/4 frames/packets/transports, iSCSI overhead, etc...) off of the CPU-memory which it is not designed for (and only unnecessarily duplicates data streams taking time away from actual data processing).

Secondly, I've also had excellent "forward product" volume compatibility -- especially with 3Ware of 3+ generations over5 years, full support moving from older to newer, far, far better and longer than MD (let alone LVM/LVM2). And many people have never seen 3Ware's 3DM/3DM2 tools foradministration and monitoring, they are much easier to deployand have saved my butt in several cases. LSI's tools aregetting better too. So it is this abstraction of RAID into hardware that removesthe multiple layers that often cause the "race conditions" between LVM-MD and other kernel-level operations. This is not just an issue for XFS, not just an issue for Ext3 and Linux in general, but many other OSes as well. Which is why I have been deploying XFS for a long time, provided I "do my homework," alongside Ext3. All the issues I've heard about off-list have surrounded configurations that are an issue with Ext3 as well -- not limited to XFS at all.

4. RHEL 64-bit with 4+GiB mem and 4+TB disk calls for XFS

I'm starting to see the potential for some system integration projects that will involve data volumes of 4+TB. In all my recent Opteron 2xx/8xx integration projects, I have put in no less than 1GiB DIMMs per DDR channel, which means a minimum of 4GiB for Opteron 2xx (4 DDR channels) at a premium of only $100-150 over 2GiB. Opteron is the commodity 2-4 way server solution for just about everything now. And while I know the 4+TB data volume on a file server is not a staple for Red Hat who is catering to grid computing clusters, web servers or possibly Oracle SQL databases using "raw" slices (instead of filesystems), they are still the"flagship" carrier of any distribution when it comes to fileservices with NFS (or NFS+SMB) in my book.

From all I've read, x86-64in PAE mode using 52-bit register (48-bit "Long Mode") is 4K pages, although I do note 2M and 4M pages as well. Again, I'll agree with Red Hat that 4K stacks are probably the correct move for x86/x86-64 in the VM. All the arguments I've seen that attempt discredit it are not only not in agreeance from what I've seen on Red Hat's plate, but the actual work Red Hat has put forth in working on 4K stacks (for the future benefit of all). So I won't even touch that issue. My argument is that Red Hat needs to add XFS to its plate for RHEL 5, including ensuring reliability on 4K stack kernels.

I don't see how Red Hat can offer a solution that scales to these data volume needs if it continues to offer Ext3 -- let alone the continued issues of lack of features. It's almost like Red Hat is two-faced when it discredits (appropriately) ReiserFS and JFS for lack of both standard kernel interfaces and user-space support, then turns around and not only acts like, but many Red Hat developers flat out state that XFS does not offer anything that Ext3 does not. Beyond just the big scalability difference, I don't know how Red Hat can push SELinux and other filesystem extended attributes (EAs) when they don't offer a way to back it up on Ext3 -- while XFS does! And that's before we even touch the fact that XFS can do _live_ operations for dumping, copying, defragmentation (file reorganizing), etc...

Red Hat can choose to ignore us system integrators and lose a lot of business. In fact, I'm really getting to the point I'm half-way serious about getting some investors to build a new enterprise distribution and offer Service LevelAgreements (SLAs). The distribution would always be based ona fork of the 2rd or 3rd Fedora Core release -- as I believe very, very strongly in the 1-2-3 x 6-month (although it's turning more into the 1-2 x 9-month) release model that Red Hat has followed over 15 releases since Red Hat Linux 4.0 that results in the "best balance of feature adoption v. stability" by the 3rd release. I typically do agree with Red Hat's design decisions at the core -- but not the end-focus of Red Hat Enterprise Linux as of late for companies more than willing to pay $3,000 for Advanced Server.

In fact, I'm 100% in agreeance with Sun technical analysis when they say that Red Hat is not addressing the storage/filesystem aspects (among other things) -- especially the layers upon layers of LVM, LVM2-DM, MD, etc... God knows SuSE is not by supporting ReiserFS (which has always made SuSE a non-consideration for traditional UNIX shops with large data warehousing, NFS services, etc...) because only XFS can offer the same features and compatibility that you'd get out of a traditional UNIX platform (before you disagree, please read up on what ReiserFS has issues with -- traditional UNIX interface compatibility, off-line tools/support, etc... is very important). So if I had to implement a commodity 2 or 4-way Opteron 2xx/8xx solution today with multi-TB volumes, I would go Solaris 10, not RHEL 4. If Red Hat decides to put forth the effort on XFS for RHEL 5, then I would most likely change that recommendation (and very much want to do so).

2005-08-27

Intel's Continued Marketing Evolution

As an engineer, I had real hope for Intel. I had long theorized that Intel's project codenamed "Yamhill" to bring the AMD x86-64 instruction set to the Pentium Pro (i686) series of processors was actually a 2-part endeavor. I had believed the 2nd more complete effort would result in a new series of innovative Intel products that would challenge AMD to match Intel in intertwined multi-core, cross-core scheduling (true threading) as well as PAE36/52 virtualization atop of a new, true 64-bit architecture.

Oh how could I be so assuming and so DEAD WRONG.

Intel is a company that stoped desiging x86 in 1994 and has A) spent the last 12 years marketing an aging set of designs while they B) slapped on lossy math pipelines, C) extended stages with clock marketing and D) only maintains marketshare thanx to the combination of 1) Tier-1 distribution control and 2) the raw dollars it can put into its leading-edge fabs resulting in a 9-12 month packaging technology lead over AMD. And as Intel Design Forum (IDF) 2005 draws to a close, make no mistake, INTEL IS NOT CHANGING.

Intel has no innovative designs or strategies left for hardware, other than packaging -- which Centrino has proved to be very, very profitable, even if not innovative at all -- and could only be considered marginally evolutionary. Everything is going to be the same as it has always been for the last 12 years for Intel -- reuse what it can, extend what it must, leverage software hacks and software-based solutions, and never -- ABSOLUTE NEVER -- redesign its aging x86 architecture.

To summarize, I will define this as "Intel's 5-step program" for the next 5+ years ...


Step 1: No new x86 = x86 "Refit Attempt #2"

The last, full x86 design by Intel was 1994, the Pentium Pro (i686). All current Pentium products hold lineage to this design, which itself was largely a set of "lessons learned" from its massive set of flaws in the 1992 Pentium (i586), the first superscalar x86 from Intel (although not quite the first superscalar x86 -- NexGen, now AMD, actually holds that title with the Nx586). By 1997, Intel believed the future was Explicitly Parallel Instruction set Compuation (EPIC) and Branch Predication, replacing the x86 architecture, traditional out-of-order execution, register renaming and branch prediction logic with compile-time optimizations.

By 2000, Intel realized it had made a colossal mistake, and compile-time optimizations for EPIC/Predication could not replace the real need for traditional out-of-order execution, register renaming and, most of all, branch prediction. Itanium flopped and Itanium2 offered little in the way of competition compared to its now 6 year-old i686 designs in the then current Pentium 3. A quick, 18-month refit known as the NetBurst architecture resulted in the Pentium 4, using largely long, staggered stages in the pipes for much higher clock speeds, as well as the additional of more extensions and "lossy math" pipelined dedicated to them. The result was a power hungry, 50% slower MHz for MHz architecture than the P3, that benefited from a few interconnect tweaks.

Intel now regrets that decision, and is now moving back to the last, true i686 design in the Pentium 3. But instead of designing a new architecture, it is just respinning the existing i686/P3 design -- and is little different than the Socket-479 Pentium M. A few improvements in the staging (hovering around 14 stages on average, little changed from the P3), the ability to issue work to 4 pipes simultaneously, and newer interconnect technologies for DDR2 memory. Again, only a 18-24 month refit, instead of a full, true 36-48 month redesign.

Most notably absent in the new architecture announcements is HyperThreading. As I have repeated myself time and time again, HyperThreading is a hack that is only applicable to the Pentium 4. With pipes as long as 40 stages, the Pentium 4 spends a lot of time doing absolutely nothing. HyperThreading is a simple hardware hack that lets the OS schedule processes on the CPU as if it was two processors in the attempt to make use of those unused stages, as well as mitigate the ultimate of stalls in the P4 -- the branch mispredict -- from forcing a complete flush of the p4 to only those stages of the thread where the branch mispredict occured. It works well for P4, with the added context switching only reducing performance by 5% or less, and often increasing performance by more than 20% in many cases.

By going back to a tighter, more efficient design with only 14 stages, new out-of-order optimizations and other capabilities, HyperThreading is totally inapplicable to Intel's new processor. When Intel uses the term "Multithreading" in the future, they will be using the term the same as AMD -- threading multiple strings of instructions over multiple cores, not the same core. The concept of doing threading on the same core was designed explicitly for, and dies very much with, the grossly inefficient P4 architecture.

Step 2: Tier-1 commodity volume = Integration, not innovation

If it has not become obvious by now, if you are anyone but a Tier-1 original equipment manufacturer (OEM) that does nothing but PCs, Intel is not a partner you want. Intel has shifted more and more focus to high volume, lock-stock'n barrel system designs of little difference. Now more than ever there is virtually no difference between a Dell, Gateway or other major Tier-1 PC product -- they are Intel designed, Intel integrated, Intel specified and everything short of shipped from Intel itself boxes.

This is an excellent strategy for Intel, more and more integration at a constant, guaranteed cost and, more importantly, guaranteed profit margin per unit. Tier-1 OEMs don't incur any R&D costs, and reap the direct margins as well, doing little more than marketing and service -- and even then Intel has R&D money to throw at them as well. Since this is where 80% of Americans get their PCs, it's an avenue that is not likely to change anytime soon. So there is little incentive for Intel to open up their products to 2nd or 3rd party designers -- they want to sell one integrated product for all.

Not surprisingly, Intel's new commodity products will start to integrate the entire memory controller hub (MCH) and graphical processor unit (FPU) into a single package -- in essence, the northbridge on-CPU. This is different than AMD's approach in the Athlon 64/Opteron (see the next step), but more akin to what Cyrix / National Semiconductor did with the Geode but more for embedded (which AMD has licensed), or even non-x86 processors like the Sun UltraSPARC "i" (e.g., IIi, IIIi, etc...) products that integrate the memory and I/O control into a single, uniprocessor-only design.

So there will soon be two types of Intel desktop processors. One with everything integrated for desktops and general consumer usage, and another for more enthusiast or workstation users.

If I would be so bold, it would not surprise me if the sheer volume of the former will quickly outstrip the volume of the latter. Especially since the volume of the more enthusiast/workstation user is going to AMD and its "2nd/3rd party inclusion" of major designers of the US, Europe and Tawain more and more. Intel might not mind one bit giving up this segment because it will almost always be a sub-25% marketshare, whereas AMD cannot break into the 75%+ marketshare of heavy integration.

Which may leave a very permanent mark on the industry with Intel continuing to dominate in volume and at the Tier-1, while AMD caters to more flexible designs and performance. In fact, we're already starting to see this where AMD no longer sells on cost, but on quality (to myself included), with pricing matching no longer a rule for AMD. Which brings me to the server.

Step 3: GTL here to stay = bridging, bridging and more bridging

Probably the most pitiful aspect of Intel's strategy is the utter-failure to introduce a commodity systems interconnect, but continuing to rely on the late '80s designed Gunning Transceiver Logic (GTL) of legacy IBM PC/AT signals. The Advanced GTL Plus (AGTL+) "bus" is just that, a bunch of wires that share all the same controls -- again, tied largely to hardwired IBM PC/AT signals -- and no more than two components talking at the same time. Anytime Intel has to add another component to the bus, it bridges that component, so it now ties up the bus if it needs to talk to anything else. Whether Pentium or Itanium, this is Intel's approach -- and to find otherwise is to go with a costly proprietary design from HP, IBM, SGI or another.

More recently the simple bus design through a single memory controller hub (MCH) has become a challenge for Intel with dual-core processors, requiring additional bridging. So it is not surprising to hear that Intel is now introducing MCH designs with two independent connections for two processors. At first it might seem like an innovation, but it's really just an evolution of the additional bridge logic that was required internal to the processor for dual-core. Plus, without a full redesign of the processor with a real systems interconnect, there is only so much bridging that could be done inside of the CPU before diminishing returns resulted.

Make no mistake, the two-processor MCH AGTL+ is not even to the same redesign level that AMD's original adoption of the upto 16-port Alpha EV6 "crossbar switch" was. In other words, the performance of two processors on this new MCH has more to do with signaling improvements than an actual, even although quite aged now, 32-bit Athlon MP or Alpha 264 approach. And Intel is still very, very far from coming close to anything like the partial mesh used in Athlon 64 / Opteron, with glueless, non-uniform memory architecture (NUMA) to each CPU, as well as tunneled HyperTransport for multiple inter-CPU and inter-I/O access.

Other than proprietary designs like the few from HP, Intel and SGI, AMD has one the commodity 2-8 way (currently 4-16 core) battle, especially for low-cost, Infiniband-connected supercomputing clusters where AMD has a price and performance lead over 2x.

Step 4: XScale to the rescue = I/O Processors in the chipset

About the only "cool thing" I have noted in Intel's server designs is something I have argued for a long while. It will be interesting if AMD comes up with somthing similar directly in a HyperTransport I/O tunnel, but it's definitely an area where Intel has actually done good. I noted that in Intel's commodity dual-processor designs it is now starting to embedded an IOP332 (PCI-X/PCIe 500-1000+MHz XScale I/O Processor) on the mainboard, possibly into the I/O Controller Hub (ICH) aka "southbridge" itself. So, what does this afford Intel?

In a nutshell, instead of network/storage controllers to either be "dumb" and rely on the host CPU/memory for software-based processing (i.e., lots of redundant, inefficient data streams), or "expensive" ($500+) with their own intelligence on-board, the chipset can offer some direct intelligence of its own. This intelligence works at the chipset, without bothering the CPU, using main memory for buffering. So instead of pushing all data from the disks directly up the CPU and affecting other service loads just to do a RAID-1 mirror or RAID-3/4/5/6 XOR operation, or when dealing with network layer 2/3/4 frame/packet/transport resolution for general network services, possibly iSCSI, etc..., the chipset can do this directly.

It may seem like a small, insignificant addition to the chipset, but the 500-1000+MHz superscalar microcontroller resource in an embedded XScale offers a lot of off-loading of traditional I/O services that can process such network/storage data streams directly and save a good 2-5x as much load on and/or duplication in the host CPU interconnect, which was _never_ built for such operations (but processing data). I can personally see a lot of drivers for many OSes that can now start taking advantage of cheaper network/storage hardware but giving the same performance and reduced CPU load of products that cost 3-5x as much.

It might be the one thing that could cause me to reconsider an Intel server purchase over an AMD one -- although only if Intel's new EM64T processors have an I/O MMU so it removes the need to use I/O performance-killing "bounce buffers" with more than 1-4GiB of memory.

Step 5: Virtualization is for software = software-based hardware products

And, alas, Intel's failure is complete. As much as Intel loves to bash Sun as proprietary -- no matter how ironic that statement is with SPARC being a documented IEEE standard available for license under "fair and non-discriminatory" terms, and Itanium of no such option -- Intel seems to be following Sun's playbook. Instead of coming out with a real, hardware-based virtualization option (one I seriously hope AMD does in the next 2 years -- and I suspect they will), Intel is completely going software-based virtualization for loads over a network. If this sounds exactly like a replay of Sun's recent virtualization moves, it is.

The larger question is if Intel is going to be a virtualization enabler for Microsoft Virtual Server (fka Virtual PC), EMC VMWare ESX/GSX, or possibly a competitor in the long run? I mean, if Intel is offering NO hardware virtualization features, and doing everything in software, at what point is Intel needed -- other than maybe for access to design/interface information by Microsoft, VMWare, etc...? It's clear that without a serious x86 redesign/rethink, all Intel can offer is multi-threading over multi-core in the future -- no different than AMD (except that AMD might do it in hardware, which will be interesting if they do ;-).

Because my past assumptions of what Intel might be up to have no fallen completely dead, as Intel's plans are to continue to reuse what they designed over a decade ago, leverage their 17+ fabs (~6 leading-edge) compared to AMD's 4 (and only 1 leading-edge) to maintain its 9-12 month packaging technology lead, and build a new Pentium world of software, not hardware options.

Because hardware innovation is dead at Intel. It was dead long ago and with the failure of IA-64 (which Digital Semiconductor predicted back in the mid-'90s), it is going to be dead in the future at Intel too.

2005-08-24

New, Intelligent PCIe Storage Controllers ...

Until recently, I knew of only one, intelligence PCI-Express (PCIe) storage controller.

LSI Logic MegaRAID 320-2E

  • PCIe x8 slot
  • Intel IOP332 X-Scale (internal PCI-X to PCIe x8 bridge)
  • Two (2) Ultra320 SCSI (U320) Channels
  • 128+MB of SDRAM buffer
  • Street Cost: ~$600
  • Linux Support: Yes, Open Source (GPL?)
Now it appears that more and more of the Intel IOP33x driven solutions are appearing.

Promise SuperTrak EX8350
  • PCIe x4 slot
  • Intel IOP333 X-Scale (internal PCI-X to PCIe bridge)
  • Eight (8) SerialATA (SATA) Channels (true SATA-IO 3GHz capable???)
  • 128+MB of SDRAM buffer
  • Street Cost: sub-$400
  • Linux Support: Yes, Open Source (GPL?)
Intel RAID Controller SRCU42E
  • PCIe x8 slot
  • Intel IOP332 X-Scale (internal PCI-X to PCIe bridge)
  • Two (2) U320 Channels
  • 128+MB of SDRAM buffer
  • Street Cost: over-$650
  • Linux Support: Yes (Open Source?)
Tekram Areca ARC-12x0 series (no vendor link? link down?)
  • PCIe x8 (and PCI-X versions too, be wary)
  • Intel IOP332 X-Scale (internal PCI-X to PCIe bridge)
  • Four (4=1210) to Sixteen (16=1260) SATA Channels (true SATA-IO 3GHz capable???)
  • 128+MB of SDRAM (256MB is standard on 16 channel)
  • Street Cost: over $500 for 1210 to over $1200 for 1260
  • Linux Support: Yes (Open Source?)
I'm still getting more information, but it seems all of these products are Intel IOP33x based. I can safely assume at least the block drivers are similar (just like the old i960/IOP30x used Intelligent I/O, I2O), and given the intelligence on-board, Linux support is probably GPL (although driver-to-IOP33x reliability might vary at this stage). How much in user-space tools, dynamic resizing, etc... for Linux might be a different story though. But it's still a good start, especially the sub-$400 Promise solution for good 8-channel SATA in a PCI x4 design that even some consumer boards are now shipping with.

I'm still waiting on a complete, 1-chip solution using the Broadcom BCM8603 to bring end-device costs down to sub-$300. It offers Serial Attached SCSI (SAS) and Serial ATA (SATA), as well as native PCI-X and PCIe arbitration (an can optionally bridge them for embedded designs). There still seems to be no lower-cost 2 and 4 channel solutions in the PCIe world equivalent to the costs of the $125 3Ware Escalade 8006-2 and $275 3Ware Escalade 8506-4LP for 64-bit PCI/PCI-X.

Serial Storage Is the Future ...

The storage world is going serial. One thing I'm regularly having trouble with is pointing out to some "die-hards" that Serial Attached SCSI (SAS) IS SCSI-2! The only difference is that the interface is different, but the drive, protocol, etc... are the same!

- ASICs are the name of the game

SCSI is typically implemented with a host adapter. This host adapter handles queuing, targetting, transfer setup and other details of the SCSI bus it controls. In legacy SCSI, this is one or more parallel SCSI devices that share the same bus, possibly with two or more busses (never more than 3 or 4) per host adapter. In a typical, high throughput SCSI implementation, you limit the number of drives to only 3 or 4 maximum per bus so the bus isn't saturated with contending transfers (even though wide SCSI allows upto 15 devices, plus the host adapter).

Serial changes the game. Instead of a single bus shared by all devices, serial busses are typically implemented with only 2 points per disc -- one for the host adapter, one for the end device. This is because serial only requires a few lines -- typically only 2 or 4 lines, plus ground. So a single, typical Application Specific Integrated Circuit (ASIC) can easily handle 4, 8 or even 16 devices, in place of a traditional parallel device where 25, 40, 50 or 68 conductors are used. This concept of an ASIC "switch" for I/O is not unheard of in the networking space, or even some storage device (e.g., 3Ware's Escalade ASIC+SRAM "Storage Switch" design for its ATA RAID controllers).

So now instead of controlling a parallel bus that everyone contends for, I can directly drive devices independently from my host adapter.

- Keeping the advantages of SCSI

Serial Attached SCSI (SAS) is SCSI-2 protocol. SCSI devices are easily made SAS without much difficulty. There is still sector remapping**, host adapter queuing, etc... These are not commodity options in other interfaces, even if some lower-cost SCSI drives use the same commodity disks (or some other interfaces use enterprise disks -- e.g., Western Digital's Raptor series are SATA versions of Hitachi's 36, 73 and 146GB SCSI/FibreChannel products).

[ **NOTE: Many intelligent ATA RAID controllers, like 3Ware, reserve parts of the ATA disk for sector remapping when they use a block volume -- i.e. RAID-0, RAID-10, RAID-5. RAID-1 mirroring on 3Ware is the only time it uses the "raw" (non-volume) disk. ]

One thing I regularly point out as an advantage of SCSI, and now SAS, over Serial ATA (SATA) is that command queuing is done at the host adapter level. In ATA with Native Command Queuing (NCQ), the queuing is still done at the per Integrated Drive Electronics (IDE) level. That means a SCSI host adapter can queue up operations for any drives it controls, whereas NCQ just means the host OS can queue up operations for each individual drive separately. And even then NCQ is still not well arbitrated by ATA controllers between the host OS and end IDE.

Some might point to the Advanced Host Controller Interface (AHCI) of ATA. Understand AHCI is a _software_ organization standard so you can target up to 32 ATA devices as a single, functional unit. It is not hardware, but done on the host system -- i.e., software. Queuing, c/o NCQ, is still done on a per end-IDE device basis, because ATA is dumb. It's just a bus arbitrator between the host system and end IDE device -- a few switches, a few timing registers, etc...

- And then adding some

SAS goes beyond what SCSI can do. Instead of being limited to 12m of an entire Low Voltage Differential (LVD) SCSI bus for all devices, with minimum spacing between devices, I can now have up to 8m per SAS twisted pair cable.

Furthmore, because it does connections on a point-to-point basis, I can plug in different devices that may not operate the same as all the other disks on the host adapter. So it is also backward compatible with 1.5GHz/150MBps SATA-I/II** as well as 3GHz/300MBps SATA-IO**. Of course, there are length limitations to use SATA (1m typical spec), but it's still an option. So nearly SAS host adapters can do SATA for free.

[ **NOTE: SATA-II is now a _marketing_ term much like USB 2.0 is. You have to have a SATA-IO drive for 300MBps, just like you have to have an EHCI controller for 480Mbps/60MBps with USB 2.0. I haven't checked to see if SATA-IO requires a twisted pair cable, but the original SATA committee expected 3 and 6GHz signaling to require it. This might be way they have created the SATA-IO spec, whereas vendors are claiming and shipping SCSI-II with only a 1.5GHz capable cable/EMF and no considerations in the logic for using a twisted pair cable. ]

Third, RAID-0 (striping), 1 (mirroing) and a simultaneous combination (sometimes called RAID-10 or RAID-1e[hanced]) only costs a little extra overhead in logic. The host adapter is already a "storage switch" of 4, 8 or even 16 channels, so it can mirror, stripe or otherwise distribute data easily between channels. Most first generation SAS devices offer these integrated, transparent "hardware RAID" functions, possibly with software RAID-0 or even RAID-4/5/6 across multiple cards.

Lastly, trunking SAS channels is an option. This is the new "killer app" of SAS -- trunking 4 or even 8 lines into 1.2 or 2.4GBps to the next hub. Although the distance is not nearly as far as FibreChannel or iSCSI (SCSI over IP), it is far cheaper than FibreChannel and far less overhead than iSCSI, while being as fast as FibreChannel or even faster than most iSCSI. In a nutshell, SAS is a great solution for multi-targetable storage in the same data closet / server room, without having to shell out for FibreChannel or deal with the inefficiency/overhead of iSCSI.

- Which is faster? SATA (ATA) or SAS (SCSI)?

Well, in a nutshell, the protocol is not the root issue. Interface speed, which has *0* to do with data transfer rate (DTR) of the disk itself, is the main consideration. Most commodity capacities (160, 200, 250, 300, 320, 400, 500GB) can't break 80MBps yet, and most enterprise capacities (36, 73, 146GB) are about 50MBps -- individually. Most vendor specification sheets list the maximum internal DTR in Mbps (divide by 8 for MBps).

Yes, this means that 10,000rpm and even 15,000rpm "enterprise" spindle disks typically have _lower_ DTRs than more "commodity" 7,200rpm disks because they are greatly reduced capacities. Their spindles cannot overcome the fact that more data density is swept out by the higher density commodity disks. Now this is, of course, assuming a continuous, linear transfer. The more seeks, the more quickly higher spindle can and does make a difference (even on single user workstations).

Although DTR _does_ become a consideration when your bus is _parallel_ with _multiple_ devices. I.e., an Low Voltage Differential (LVD) [parallel] SCSI bus like Ultra2/80 (80MBps), Ultra3/160 (160MBps) and Ultra4/320 (320MBps) has to share all that DTR across all the devices on a channel. Hence why SAS is looking realy good these days as even Ultra5/640 (640MBps) doesn't solve the root problem! Serial is the future.

But looking back at interface considerations ...

ATA (including SATA) is a _dumb_ bus arbitrator between PCI[-X|e] and the Integrated Drive Electronics (IDE). ATA is dead _dumb_ and other than some registers for bus timing/configuration, it's the system memory/CPU talking to the drive. ATA provides dead _dumb_ block I/O without any blocking. That's great for 1 drive at 1 operation, such as typical desktop usage -- especially in the latest densities where DTR is absolute, and seek is of minimal consideration.

SCSI has its own _hardware_ host adapter with intelligent management and queuing, plus a full command set. SCSI host adapters are already half-way to a full, intelligent hardware RAID design. The second you start queuing a lot of operations, SCSI wins. ATA can't service requests at all, it relies on the system CPU/OS. Especially with higher spindle rates, which are typically scarce in the ATA world (and only a few exceptions, like the WD Raptor SATA version of the Hitachi 10k SCSI/FC/SAS series, but no 15k).

Note that the _dumb_ nature of ATA-IDE is why ATAPI (ATA Peripheral Interface) was required for non-simple block transfers like most optical drives require. But even then, ATAPI is done in software, between the system memory/CPU and the end-drive. It's still not intelligent, it just adds some commands for the end-device at the host system/OS level.

Again, ATA with NCQ may now add queuing, but it only does for _individual_ drives. That means it's great for a desktop or even a workstation with 1 drive, but once you start adding drives, then NCQ loses it's benefits. SCSI host adapters queue for _all_ drives, not just 1, and it can better balance I/O requests, especially in a RAID configuration (although an intelligent ATA RAID card can do the same -- see RAID levels below).

Straight Just a Bunch of Disks (JBoD) really depends on the application, and ATA is typically all you need today. Things change once you start talking about an intelligent ATA RAID controller. Now you have ATA with intelligence, queuing, SRAM (non-blocking) or DRAM (buffering).

- What's good for what RAID levels?

For RAID-0, 1 and 10 (simultaneous RAID-0 and 1 in hardware), ATA with a non-blocking ASIC and SRAM is most ideal. That's 3Ware's legacy Escalade design (pre-9000 series), using the direct I/O of ATA. You have non-blocking end-to-end -- from the ASIC+SRAM to the storage interface, especially with today's commodity disk densities.

As I noted above, some of the new generation of SAS host adapters come with RAID-0, 1 and 10 "for free." They are a consideration as well because they too are doing "non-blocking I/O" for their channels (which can be SATA as well as SAS). Especially with today's commodity disk densities.

For RAID-3, a non-blocking ASIC and SRAM, plus a little DRAM for extra XOR buffer, is also ideal -- especially when the width of the bus matches the data channel (not including parity). That' the NetCell SR3x00 (32-bit -- 2 drive + partiy) and SR5x00 (64-bit -- 4 drive + parity). ATA is still ideal because it's direct I/O, and RAID-3 is not a blocked I/O (unlike RAID-0, 4 and 5).

For RAID-4 or RAID-5, you're now going blocks of (typically) 32KB striped, with dedicated parity (RAID-4) or striped (RAID-5). Now you want a microcontroller with lots of buffer (DRAM). ATA or SCSI doesn't matter -- the I/O isn't direct, so non-block is useless. Furthermore, SCSI can have lots of benefits with its higher spindles for response time (especially for RAID-5), let alone other features (like sector remapping standard -- although a few intelligent ATA RAID controllers reserve ATA sectors as well).

In the end, the future is Serial Attached SCSI (SAS). Almost _all_ new intelligent RAID controllers being designed are SAS because they also do SATA. SAS is basically an intelligent host, point-to-point SATA with SCSI-2 atop. It's basically like talking about the difference between the quality of an ASIC in Ethernet hardware, only now the concentrator is the storage controller -- a storage switch.

2005-08-21

Secure Shell (SSH) Service Do's ...

With Secure Shell (SSH) probes and attacks increasing on the Internet, more and more UNIX/Linux users have experienced attempted, and even successful, compromises of their systems. For those running SSH Services, especially facing the Internet, here is my list of "Do's":

1. Disable root access
2. Enable only SSH protocol version 2
3. Enable "AllowUsers" (OpenSSH) option
4. Run on an alterate port
5. Enable only "PublicKeyAuthentication" option

#1 should be a no brainer, never allow root access. Almost every script kiddie these days is testing for root, and the last thing you want to leave root access in SSH. If you feel you must allow root access, then do it from select IPs and, more ideally, only with "PublicKeyAuthentication". In most cases, a root access requirement is more of a result of poor system/network configuration than anything. E.g., for typical configuration management, the root user of a system should "pull" operations/files from a non-user account on another system, instead of allowing a system to "push" operations/files to to another using its root account.

#2 is also a no brainer, do not allow SSH protocol version 1 anymore, period. If you are relying on old clients, they are a major risk and their removal should be top priority. Use modern SSH version 2 clients.

#3 is a major risk mitigation by only allowing select users. In addition to taking away root with #1, take away all other common UNIX usernames and only allow specific, enabled users. Under OpenSSH, this is the "AllowUsers" directive which automatically disallows anyone else except the lists.

#4 is clearly a "security through obscurity" move, but it does fend off 99.9% of script kiddies these days. Move your SSH away from port 22.

#5 is a top recommendation, disable any other authentication except Public Key Authentication, including disabling password authentication. Most administrators do not implement this because not everyone keeps their keys on them. But with USB dongles and USB systems commonplace, it is far less difficult to assign SSH RSA/DSA keys to users than it was just a few years ago. Almost every major SSH hole targets password authentication, and SSH servers that implement only Public Key Authentication have not been suseptible to most attacks to date.

2005-08-18

We DO Use Our Constitution!

- The Latest Joke

If you haven't heard it, then let me queue you in. With all the Freedoms we have lost as of late in the US, many people are joking that the new Iraqi Constitution could be solved by giving them ours ... because we're not using it. While it does make for an amuzing joke, especially in wake of 14 of 16 of the portions of the Patriot Act becoming permanent, it's actually not very accurate.

Because our Constitution as an organization of government is about SEPARATION, NOT FREEDOM. But, ironically, it is that separation that helps keep our freedom, even if we seem to lose it at times. Because the Constitution can be Amended, as it was with the original Ten Amendments in the Bill of Rights (the part that supposedly guaranteed far more freedom), and it can change. But THE SEPARATIONS HAVE NOT CHANGED!

- Why the Separation Of Powers?

Well, it all begins with the fact that WE DIDN'T TRUST ONE ANOTHER! Seriously, I look at other countries, their governments and one thing that scares me about some is that the legislative branch selects the leaders of the executive branch! Yikes! And you think we have it bad in the US if the two parties control both the Congress and that 1600 address?! Not even close!

As I always say, regardless of what you think about the current status of "Freedom" in the US, there are two things that the US has that NO OTHER COUNTRY has both ever had:
1. Strict separation of legislative and executive, plus a judicial that is appointed by one, reviewed by another
2. No military leader has ever made policy, and military leaders have always bowed to their commander-in-chief, who was a civilian

Now #2 is outside the scope of this discussion, but #1 is a key, key importance. Especially for the next part.

- The Great Compromise

Sometimes I wonder if we teach our children too much about "freedom," and not enough about "questioning authority." Oh wait, I guess that would be self-destructive for teachers, so I guess I can understand why we don't. ;-) But seriously now, governments that are setup to question themselves work best, because in reality, they really DO NOT TRUST EACH OTHER in design!

The US' "Great Compromise" was because the small states felt like they would be overruled by the larger states in representative size. At the same time, you can't have small states having an equal say when they have 1/10 the populous of other, larger states. The result was our two-house Congress -- an upper-house (Senate) of states with equal representation, and a lower house (House of Representatives) of states with populous-based representation.

Heck, even back then, the US government did not trust its citizens, and errected a more representative form of government. Not only were Senators sent by the state government, not its people (although this was changed later, although still an important note when it comes to elections -- but that's another story), but the Electoral College was errected as well. The Electoral College has a purpose, to ensure that one portion of the nation cannot dominate another. E.g., today it is very easy for the urban populous in some states to dominate the more rural voters.

Which all ties back into the "Great Compromise" from over 200 years ago, the large cannot dominate the small. Pleasingly both ironic and coincidental is how we didn't trust each other over 200 years ago, but is no longer an issue, is at the heart of an institution today that still prevents some dominance of one over another. In fact, it's important that any "Great Compromise" continue to focus on "inclusion."

A great example of the FAILURE of "inclusion" was the creation of the Confederate States of America. Ignoring the revisionist history focus on "slavery" (which was important from many in the North, especially as of 1963), the primary driver and nationalistic justification for the separation of the Southern States was the feeling that they did not have equal representation. The Electoral College, from their viewpoint, had failed them.

[ SIDE NOTE: There were even a minority of Southern leaders who believed slavery should be abolished, which would remove much of the popular support from the Northern invasion. Of course, that did not take hold. ]

Now I am NOT justifying what the South did anymore than I would justify Sunnis who advocate boycotts of voting, the Constitution, a new government, etc... (let alone terrorism). But if you want to drain the will from a minority of people who do not want to be part of an united nation, you include them as best as you can. And the way you do that is by putting their DISTRUST in PLAIN VIEW and working in a way to GUARANTEE that inclusion EQUALLY.

We'll come back to that in a bit ... now for some more history ...

- The Folly of Europe

Okay, now I'm going to dump on Europeans for a bit (sorry). Whether you want to talk about the former, conquerored nations (like the every more numerous Balkin states) or you want to talk about former colonies (like Indian, which is now modern-day Indian and Pakistan), one thing Europeans seem to be so good at is building smaller nations out of nationalistic groups. Instead of getting people to work together, they'd rather just separate nations into nationalistic entities -- which means those new nations will just go to war with the old nation sometime in the future (and typically near-future at that).

If the United States has stood for one thing, it is that people can live together in peace, and have radically different views. I don't need "diversity training" to tell myself that our nation survives on the fact that we argue, disagree, generally think people in our country have stupid, idiotic and just plain wrong ideas from each other's viewpoints, I see it everyday! And you know what? I LOVE IT! Why? Because it means we're still questioning ourselves, each other, everything we have built, because we know it's not perfect, and we could do better.

I'm tired of people just saying "X needs a homeland" or "Y needs their own government." Get over it, learn to accept the fact that you have different believes than your neighbors and tolerate his like he tolerates yours. Because if you get a homeland or nation of your own, they you will just go off on your own with your bretheren as a separate nation, and collectively conclude you don't like the other people of the other nation existing anyway. So it actually INCREASES VIOLENCE, it does NOT decrease it!

That's the problem. Segregation and seperation just breeds ignorant nationalism. When you include different ideas by not segregating and separating people who differ, you learn tolerance, you learn to work together. You do this by errecting separation in an unified federal government, which then and quite naturally binds people of diversity over time. But it only works as long as you recognize each minority and try to accommodate -- INCLUSION is everything, but you have to be REALISTIC. Again, even our "Great Compromise" in the design of our Congress, which defines the make-up of our Electoral College today, was about A REAL COMPROMISED BASED ON DISTRUST!

- It's This Simple: Build the Iraqi Constitution on NO TRUST!

Three major people who don't trust each other. The Kurds even want the right to separate in eight (8) years and build their own inner-sphere of self-nationalism. I'm sure the Turks know all-too-well is not a good idea. And if I'm the American President, I say "tough, you're going to learn to be Iraqis with everyone else." Especially since it's all-too-easy to predict a Turkish-Kurd war, and it's better to keep the Kurds part of Iraq than separate.

Then you have the minority Sunnis who fear what the Shites could do as a majority. As much as some of the Sunnis might deserve it, it can't happen. But as much as the Shite leaders might guarantee inclusion and equality, the populous might override them in pure attrition. That is not going to simply "go away" with some Constitution, it's going to take time. The goodwill in unification is not shared by all, and any Constitution drafted in the hope of such will flouder REGARDLESS OF WHAT LEADERS SAY OR DO!

So here's the deal, you build a three (3) House Legislative branch in Iraq. That way the Sunnis feel they have equal say, while the Shites have their own, and the Kurds can feel like they are ruling themselves too. All 3 Houses must agree to pass a bill as law, just like in the US' 2 House system. No exceptions. This is inclusion built around the fact that people don't trust one another. Stupid, self-interested bills will not see law, and only good bills that all houses can agree on after compromising will become law.

- A Government Based on Differences? That Doesn't Sound Like Unification?!

Now I'm sure people are questioning, "What's the purpose of an unified Iraq if the government itself is based on separation?" It's not based on separation. It's based on unifying people on the REALITY OF DISTRUST. There is a lot of separate nationalism in Iraq, especially from the Kurds, but also between the Sunnis and Shites. You can't ignore that and it's better to address it than to hope for some "I love you" Barney song to rip through the nation today. Even if the overwelming majority of leaders continue to believe in unification, all it takes is a few rogue leaders to kill it.

If anything, in recent history (not even looking farther back), learn from India. The worst thing is to let a minority people build a new country of distrust than to accommodate their minority in an unified nation, even if it seems seperate. Over time the walls will come down, but it won't happen overnight. You have to errect your nation around the understanding that there IS DISTRUST, by guaranteeing NO ONE MAJORITY CAN OVERRULE A MINORITY! Just as the original purpose of the "Great Compromise" of the US has long been lost, it still serves today in the Electoral College, even if in a totally different purpose.

With a three (3) House system in Iraq, the Kurds, the Sunnis and the Shites representatives can concern themselves with the needs of their people, but at the same time, are still unified as a single nation. They will work together when things are worth doing, and reconsider when they are not. It won't be simple, it won't be easy, but sticking with a "separation of powers" is the best move, and that separation is built on distrust. Because over time, the separation will REMOVE THE DISTRUST AS THE HOUSES WORK TOGETHER!

- What About the Religious Leaders? Make 'em Judges!

That's even easier. Common Law is just one of many things Americans stole from the British, and who says Iraqi Common Law must match Anglo-American? I think Spiritual Muslim leaders are the best Iraqis to take Iraqi law and interpret them in the tradition of Iraqi views. And for Americans who might differ, need I remind them that despite all of the arguments surrounding the Ten Commandments as of late, the 100% legal reality is that ALL MAJOR ANGLO-AMERICAN COMMON LAW DOES ORIGINATE from documents like the Ten Commandments. Hence why the Ten Commandments are not considered religious UNTIL they are displayed in a religious context (and not the historical context, with other documents).

So the Islamic faith and the rich Muslim traditions very much have a place and have a right to shape the future of Iraq. In fact I think they should, because they best represent a diversity that most Americans are not exposed to, and we cannot even think of having them start to accumulate in a document like the Iraqi Constitution. So I think the first thing the new, three (3) House Iraqi Congress should do is appoint the first Judicial body. Each House will appoint several candidates, and once the other two (2) Houses approve three (3) candidates as Judges from each House, you will have the first, nine (9) Judge Iraqi "Supreme Council." And these leaders will serve until death, or until they decide to step down.

The balance is this. Just like the US Supreme Court, the Iraqi Supreme Council must judge under the Constitution, when applying those existing Laws and Traditions. It's a delicate balance, but if you empower intelligent, respected Muslim leaders to shape the new Iraq in a mesh of traditional views with the reality of unification, I think you might just be surprised at the results! And it would not surprise me one bit if there is not at least one nomination of woman, especially if she was a conservative Muslim herself (and respected for her practice of tradition).

- I Could Be Dead Wrong

And maybe I'm just an ignorant American. But if there is one thing I realize, it's that our US Constitution of separation and balances is alive and well. We might not like the laws, we might feel we've lost our freedoms, but I believe in the balance that results, the laws that are struck down as Unconstitutional, the leaders who step out as well as out-of-bounds to be returned to where they should be. It's not perfect, I get upset all-the-time -- but we stay unified, while we still don't trust each other.

I think that's pretty universal way to deal with any humanity as a federalized government if you ask me.

2005-08-16

Microsoft's Next Target: Adobe

This is hardly news, but the actual plan is just starting to come out. This is yet another "we buy the 3rd best product/company on the market, tie it to Windows, then market the hell out of it." This is how Microsoft extends its dominance through the distribution channel, which is where 90% of consumers get all of their products. Until Firefox, over 90% of consumers use Spyglass Explorer, more commonly known as Internet Explorer now. In the case of Adobe, the company Microsoft purchased back in 2003 was Creative House, largely for their Expression software.

Renamed "Microsoft Acrylic," it ties in perfectly with their plans for the "Avalon" presentation system of Windows Vista (NT 6.0, codenamed "Longhorn," client). Although the first rendition of the Windows Graphics Foundation (WGF) 1.x which "Avalon" uses is based on aging DirectX 9.x and no where near the capabilities of Apple's QuartzExtreme, FreeDesktop.org's Cairo or Sun's Looking Glass, all based on OpenGL, the idea is eventually to get there with WGF 2.0 based on work-in-progress DirectX 10 (no longer to be named such) due in late 2007. As such, all future Microsoft developments are targetting Avalon.

This means a seemless experience from 2D/3D print off-screen to presentation on-screen, including use of and storage in the graphics framebuffer, instead of the aged set of CPU-memory driven, overlapping pages of the legacy Graphical Display Interface (GDI). For those already running MacOS X with QuartzExtreme, possibly the Xgl Server or maybe Sun's Looking Glass preview on Linux, you've already discovered how powerful this presentation can be for 0 overhead (because it's done on your idling graphics processor unit, GPU, and its memory framebuffer). Again, it won't be until WGF 2.0 before Microsoft gets to the same level of capability, and the WGF 1.1 that will ship in Windows Vista will be rather limited and a resource hog (unlike QuartzExtreme). But Microsoft is still planning for the future.

Integration with Microsoft Office 12 (aka 2005) will be limited, but you can be sure that Acrylic is a preview of how Microsoft Office 13 for Vista will be presentation-wise. This includes finally bringing an unified equivalent to Windows that MacOS X and, more recently on the leading-edge, Linux users have had in the OpenGL-Postscript-SVG combination. As such, Windows could truly be considered a "graphics desktop," and Microsoft will be right there with a leading-edge application in Acrylic. Thus leaving Adobe to ponder if they could have prevented their demise better.

Maybe Adobe should have supported MacOS X, or possibly even Linux, better? Especially given the fact that Linux now dominates the Computer Generated Imagery (CGI) world. In fact, when Disney finally put through the effort to get Photoshop running under WINE (WINdows Emulator) on Linux because the only application they were running on 30,000 Windows desktops was Photoshop (everything else, like Maya, was already running on Linux), it should have been a "wake up call" to Adobe. But now Adobe might be moving too little, too late, even though we won't see the end-game for Adobe for another 3-4 years when their bottom-line starts getting stretched.

I mean, should Microsoft actually deliver on the promise of WGF 2.0, which all their application developments seem to be based on, it's going to be a tough sell for Adobe, at least on Windows. So that leaves Adobe to maintain a staple on MacOS X (possible) and Linux -- especially Linux which has been adopted by the CGI world NOT for cost considerations, but capability. Again, Disney should have been the "wake up call" years ago that there are hundreds of thousands of people willing to pay big money for professional graphics suites on Linux.

Related eWeek Article

2005-08-09

"Small Enough" Form-Factor PC

UPDATED 2005Oct26, 2005Sep29

- Engineer-level Integration


Unknown to most users, in the last few years, the semiconductor industry has gone through an explosion of commodity peripheral logic (e.g., ports, audio, network, drive, etc... interfaces) whereby it is getting easier for engineers to take a logic core and add various peripheral components to it into a single chip at a much lower price point (in volume designs, as always). As such, more and more PC mainboards are coming with more and more peripherals on-mainboard, if not inside the chip(sets) themselves, at a sub-$100 price point.

- Newer Interconnects Now Balance Integrated Value Versus Performance

New, dedicated device peripherial busses like PCI-Express (PCIe), as well as logic directly connected on the system interconnect (like AMD HyperTransport) have also helped with performance and reduced I/O device contention -- all while brining cost down at the same time. Today, thanx largely to PCIe, many desktop PCs now have more I/O throughput on even the non-video slots and integrated peripherals than server mainboards did just a few years back. So the need for expansion in a desktop solution is relatively limited for nominal desktop uses.

[ SIDE NOTE: Admittedly, there is still a darth of quality PCIe storage and other controllers. They seem to be limited to PCIe x1 and x4 NIC (although excellent, server-quality PCIe x4 NICs do exist) and cheap ATA/SATA PCIe x1 controllers. The lack of PCIe x4 and x8 products (which still work well in PCIe x1 slots), especially higher-end storage and audio controllers, is currently hampering both server and workstation adoption of PCIe other than for video. But newer ICs should be changing that in the next 6-9 months. ]

- Intel Sheer Volume Commodity Pricing at Major Tier-1 PC OEMs

Probably the hallmark vendor of integration in mass volume is Intel, with its newer FlexATX form-factor that is the staple of PC major tier-1 OEMs who have 0 of their own R&D and leave everything to Intel and Microsoft subsidy. Users get a fairly small case design with integrated graphics, networking, I/O ports and a few expansion slots at an extremely affordable price from Dell, Gateway 2000 and many others short of Tier-1 OEMs who are also R&D Powerhouses like HP and IBM (who has even shed their PC division now).

[ SIDE NOTE: Intel has also introduced a new form-factor to replace both FlexATX OEM and MicroATX at the assembler/whitebox in BTX that better addresses cooling with mid-front intake fan and and mainboard orientation flip that does not trap heat under the AGP/PCI/PCIe PCB like current ATX design do. But this has failed to catch on as it is more for non-gaming/expansion designs, doesn't do dual-processor, AMD is sticking with ATX and some whitebox products have flipped the ATX mainboard orientation on their own. ]

- Not-So-Commodity: Proprietary Small Form-Factor (SFF) PC for Enthusiasts

More known to users is the fact that there has been a number of vendors introducing Small Form-Factor (SFF) PC systems. From ViA there is the ITX form-factor (~6" x 6") mainboard, built around its low-power C3 processors (same lineage as the AMD/Cyrix/NS Geode). And from some mainboard vendors, there have been their own SFF solutions with their own, proprietary mainboard design. In every case, these systems are not cheap, although some "bare bone" designs help defray costs as they are spread over many, additional components included (for $300-500). Then again people are typically not buying them for cheap, but for portability, so there's no reason for some not to "cash in" on the market.

- The Growing Whitebox SFF Solutions Pool

Of course, other manufacturers outside of the mainboard vendors were taking note too. Antec, like their Aira product, is just one of many PC enclosure/accessory vendors now offering their own enclosure and power supply solutions. But instead of offering a proprietary form-factor in a mainboard, they often take the standard MicroATX (~9.6"x9.6") form-factor. MicroATX is basically ATX with "3 slots chopped off", reducing the size by almost 3" in length, and typically keeping the depth of the mainboard to 9.6" (and typically less). This gives enthusiasts more of a "choice" in mainboard solution, almost to the same expansion as a full ATX, while still delivering a fairly small form-factor -- especially with a MicroATX power supply or an even smaller, proprietary one.

- Power and Cooling Still Far Too Neglected In SFF

Unfortunately, most FlexATX, proprietary SFF and even most MicroATX form-factor solutions still neglect the issues of power and cooling for general purpose use. Now this is typically by design in ViA's ITX for its low-power C3 products, and most FlexATX solutions are low-cost Intel OEM products with on-board video sucking up little juice other than the P4 processor. But more and more SFF products in the tier-2/whitebox market are being marketed to the "LAN Party" crowd of gamers and enthusiasts who use some of the latest processors and video cards that need lots of juice and lots of cooling.

Although a few proprietary SFF products are shipping 80mm, 92mm and even 120mm fans, they typically use a proprietary power supply form-factor that is meager and non-upgradable. And looking at the MicroATX form-factor, it was designed to be a thin desktop or mini-tower. The result is that most MicroATX power supplies are typically not of any great power (although I have found and used a 460W MicroATX 24-pin+4-pin PS, although it's very long and probably too long for some cases). and have only 40mm or 60mm fan options.

- The Holy Grail? MicroATX mainboard plus standard ATX power supply?

Which brings me to the new crop of "Small Enough" enclosures that might be ideally suited not only for the performance/enthusiast self-assembling PC user, but possibly everyone who wants a tier-2/whitebox product half the size of some ATX products today. Such products are the Chenming 118 series (affordable), its Aspire X-QPack brother (also affordable) or a new OEM design being sported by SilverStone in its SG01 (very expensive, unfortunately) and likely other vendors to come (typically 1/2 the price of SilverStone).

The typical design calls for a "deep cube" that is around 10-11" x 8-9" and about 14-15" deep. The front dimensions are clearly much "taller" than a desktop or "wider" than a mini-tower FlexATX or MicroATX, and it's height is clearly just as "wide" as ATX tower. But at only 10-11", it is clearly almost a cube and much, much less than the typical 15-17" "height" of a ATX, and most ATX depths are also 20"+ in depth! Given the size and weight, especially in an aluminum material (~10lbs.), a built-in handle is often included for easy porting. But that's just the beginning. The "efficiency" of the design is clear when the system is looked at internally.

- Chemning/Aspire v. SilverStone: Power and Cooling

Instead of putting the power supply "next to" one side of the board, it is put above either the CPU area (8" high versions) or the slot area of the MicroATX (9" high versions). And best of all, it's a FULL ATX power supply -- meaning power is never an issue and almost every option is available (except for maybe a really deep ATX PS on 14" deep models). Whatever you can get in an ATX size power supply is now fully usable on your board, allowing the latest CPUs and video cards being used while delivering adequate power

Chenming 118 series (11" x 9" x 14") with 120mm exhaust fan and split +12V 300W 24-pin+P4 ATX 2.0 power supply for under $80.

Aspire X-QPack series (11" x 9" x 14") with 120mm exhaust fan and legacy 420W 20-pin+P4 ATX 1.0 power supply for around $80.

SilverStone SG01 series (10" x 8" x 15") with 80mm exhaust, 60mm internal for HDs and no power supply (expensive, but OEM models at half-price should come out soon).

The other area then has a large outtake fan, even 120mm in some models (like the Chenming/Aspire). Ironically enough, with a bigger fan, the less spindle is required to move the same airflow, and the sound can be 30db and much lower than most SFF. Especially considering between the PS' fan(s) and the outtake fan, it's adequate enough for cooling even today's highest-end CPUs and video cards (possibly even SLI).

UPDATE (2005Oct26): AnandTech (full AnandTech article on the SilverStone SG01B) on the cooling and sound of the Chemning 118 / Aspire X-Qpack design versus the SilverStone SG01:
Component Temperatures

SG01 X-QPack
Exhaust Air 33.7 C 28.3 C
CPU 34 / 47 C 32 / 44 C
GPU 53 / 69 C 49 / 64 C
HDD 29 C 30 C
Northbridge 36 C 34 C
Power Supply 32 C 32 C




- Chemning/Aspire v. SilverStone: Expansion

Next comes what I call the "Good Enough for 95%+" expansion. There are typically two (2) exposed 5.25" drive bays. Since they sit at the front and top of the 8-9" height case, this leaves a good 4.5" for any clearance of the CPU fansink and other mainboard components. The 5.25" drives are never in front of the four (4) slot area (the left 3-4" facing the case from the front, hence why bays will either be centered or right-offset), and that is typically open (for up to 8-10"+ long cards). If the case is a 9" height, there might be one (1) 3.5" drive bay exposed as well. For internal storage, two (2) 3.5 drives are typical, either side mounted in the front, with inlet holes that pull air over them (and directly back to outtake fan behind), or maybe a sideways mounted component area that makes access easier, possibly with their own fan.

This gives you an optical drive and other 5.25" bay option (removable drive, audio/multi-function, etc...), and up to 2 hard drives for redundant or striped (performance) storage, as well as just enough I/O expansion slots for just about any combination (possibly even SLI video in the future?) given today's on-board peripherals. With a large exhaust fan and full ATX power supply with its additional fan(s) and well-placed intake slits (in a small volume case, an important note), that's plenty of power and cooling option for even the most serious of gamers. It's a heck of a lot more portable and practical than the typical 7-10 bay ATX case that people go to by default, and are left with a lot of empty space (and 2-3x weight, even when using alumnium).

Standard Components, Full Size, No Compromise! AnandTech puts in a full size ATI Radeon XT series video card which fits under the full size ATX power supply in the Chenming/Aspire design, but the 14" depth doesn't cramp the full size 5.25" optical drive either.



Commodity breeds cost effectiveness, and for the tier-2/whitebox assembler/consumer, this might be "The Holy Grail" for PCs -- from desktops to LAN Parties to, and I know I'm going to get slaughtered for even suggesting it, "SOHO servers" or "small, application-specific servers" (be it using 2 internal drives mirrored, or even a 3-bay, 5.25" x 2U internal enclosure). The 10-11" x 8-9" x 14-15" MicroATX+ATX-PS "box" form-factor can and should be the "killer enclosure" that everyone should be embracing IMHO.

- Missing-In-Action: The "Powerful" MicroATX Mainboards?

UPDATE (2005Sep29): See new nVidia C51 (GeForce 61x0 / nForce 4x0) designed for MicroATX

Which begs the question, where are the powerful MicroATX mainboards? To date, I have one of the very few nForce4 Standard mainboards from Foxconn. It would be nice to see some nForce4 Ultra and maybe even an SLI product (2 PCIe16, 1 PCIe1 and 1 PCI is definitely doable) in a MicroATX. I guess the problem is that until these form-factors catch on, and I believe they will, mainboard manufacturers don't see a market given the typical non-enthusiast/gamer deployment of MicroATX where power and cooling are limiting factors.

Foxconn nForce4 series MicroATX, one of the few "Power" MicroATX solutions bringing the latest in PCIe x16 and Socket-939 Athlon 64/FX/X2 power with all other peripherals built-in (4xSATA, 2xATA, 10xUSB2.0, 1xIEEE1394, 5.1 Audio) to the new "Small Enough" Form-Factor PC at less than $80.

Frankly, I think the sooner people embrace this "Small Enough" form-factor, the more everyone will be happy. The expansion is typical and "just right" for 90%+ of consumers, the cooling and airflow is perfectly adequate and efficient (especially with a slower spindling 120mm at ~30db) and there is no compromise when you can always put in a full ATX power supply. Especially when the Chenming 118 with a split +12V rail 300W (for newer, 24-pin+P4) ATX2.0 and the Aspire X-QPack with an older 420W ATX1.0 power supply (for older, 20-pin+P4 MicroATX mainboards) run only about $80 and no more costly than a similar ATX with similar quality power supplies.

The Chenming 118 and its standard 300W ATX2.0 powers my nForce4 with an Athlon64, 100W+ sucking GeForce 6800GT, two hard drives, a floppy+card reader and an all format DVD recorder/rewriter. The X-QPack and its standard 420W ATX1.0 powers my wife's Asus ViA KM400 with an Athlon XP2600+ (Model 6 which sucks more power than my Athlon64), also a GeForce 6800GT (AGP), two hard drives, a floppy+card reader and also an all format DVD recorder/rewriter.

Mitsumi FA404 series (1.44MB floppy + 8-in-1 USB2.0 card reader) fits in the Chemning/Aspire's exposed 3.5x1" bay for only about $20.

The units weight less than 20lbs. fully loaded and are easily portable via their plastic front-handle. Unless I need something with dual-processor, I don't see any reason to assemble anything but these "Small Enough" Form-Factor designs, although I would like more choice in mainboards. Hopefully popularity will dictate this change.

2005-08-02

Only Nixon could go to China ...

As someone who agrees about 80% with the US Libertarian party, 40% with the US Republican party and 20% with the US Democrat party, I really avoid most politics -- both partisan and non-partisan. Most of the US TV media provides me with 2 sides, and I don't agree with either. I found my blood pressure when down 20 points when I stopped watching the TV news years ago. The US radio media isn't much better most of the time, the only difference is that they can't show me flashy graphics and video that contradict what they are saying (God I hate it when I see that, especially when the agenda comes through on why they did that).

So occassionally I hit Google News and read up the largely foreign newspapers. It's always good to see what Americans are being thought of by other nations, and to get a good chuckle when the both get it wrong and, even more scary, when a few get it far more correct than our own TV journalists. Especially when I don't like GWB, didn't vote for him in 2004 (because the Libertarians finally put up someone I would vote for -- although I voted for him in 2000 because the economy was going to crap according to Clinton's own quarterly OMB figures, something other people didn't read). And he's pretty much suffering the same political nightmare that LBJ did during Vietnam -- appease Congressmen with entitlement spending while the 1-2 expense whammy of "guns'n butter" are killing our future. Ask not what your country can do for you, but ask what your Congressman can get for you (or him/her) -- same deal some 35 years later.

Now I did catch the fact that Bush used a recess appointment to appoint Bolton to the UN. It's not the first time a recess has been used for an ambassador, but the first time for the US ambassador to the UN. I haven't been a big fan of the partisan politics because, after all, even Democrat President LBJ was able to appoint Robert McNamara -- of the self-admitted (at the time even) fame Vietnam "Robert McNamara's War," but GWB has been trashed for Wolfowitz, who is another "wiz kid" of McNamara's level. Only difference is 37 years later is that there is no Soviet and the TV media questions everything except itself.

My more recent, personal favorite is the revisionist history of Iraq, not only by the President, but by Clinton lovers who forget that the World Trade Center was bombed in 1993, Clinton stayed out of at least one genocide (Rowanda) and most of the evidence that Bush was planning to invade Iraq even before 9/11 -- forgetting that a lot of the facts people are finding are from the 1998-1999 timeframe when Clinton almost did too. God I love that one personally, along with the 2 Executive Orders signed by Clinton in 1998 (that GWB used after 9/11, not the Patriot Act because it hadn't been invented yet!). There's enough to pin on GWB without demonizing him for everything in the last 12 years.

But now this Bolton appointment got me thinking. You take a man who has criticized, harshly, one entity over years, and you put now put him in a higher position where he is forced to tackle the very entity that he says doesn't work, is not worth our bother, etc... Sounds like Richard Nixon if you ask me -- all through those earlier years in his career, what he built his own reputation on. And what did he do when he took a higher office? Sign an alliance in case of nuclear attack, open up one of the biggest markets in the world (even if later lack of enforcement of Intellectual Property, IP, has caused most of our woes).

Only Nixon could go to China. So from my point of view, Bolton might be the ideal man for the job. Especially given the shake-up that the UN needs, much like the Chinese needed over 30 years ago too. He's got 18 months to prove he is.