Filesystem Fundamentals and Practices
Overview:
- Windows Deprogramming (the Windows You Don't Know)
- Traditional UNIX Mindset (Why It Still Works After 30 Years)
- My Professional Linux Practices (Even More, Why?)
Windows Deprogramming (the Windows You Don't Know)
A lot of posts I see on Linux lists are from typical Windows users new to UNIX. We'll get to my "UNIX Mindset" comments in a second, but it is important to understand that the overwhelming majority of Windows filesystems issues are Windows-only. There are many aspects where past Windows experience, even extensive Windows Server administration experience, is wholly inapplicable to UNIX. Filesystems are the biggest of differences. IBM-Microsoft systems have always used a File Allocation Table (FAT) approach, whereas UNIX uses an inode approach, as we'll discuss -- the two are night and day.
Compounding the fact is that most MCSEs are very oblivious to most of the issues with Windows filesystem design, especially the New Technology Filesystem (NTFS), that can make it very problematic. Coming from OS/2 in the early '90s, I warned my contacts at Microsoft of the dangers in early NTFS "false security" and other changes made from OS/2's High Performance Filesystem (HPFS), not that HPFS was an ideal design either. I recently did a series of presentations on low-level UNIX and Windows interoperability that covered the inherent design issues of NTFS, among other disk considerations:
- Low-Level Interoperability Part 1
- Low-Level Interoperability Part 2
- Low-Level Interoperability Part 3
But to start, let's go over the history of issues with Microsoft's filesystem designs.
- Microsoft (MS) DOS 1.0
MS-DOS 1.0 was a direct (and illegal) port of Digital Research's CP/M from the 8080 to 8088. There is a long history on that (MS bought it from Seattle Computer Products, the original piraters, for $50,000, which IBM later settled out-of-court with DR for $800,000). But the limitations of CP/M were clear, no directories, only 1,024 files in the filesystem, and filesystem reference was by drive letter (e.g., A:, B:, C:).
The File Allocation Table (FAT) approach was simple, but effective. The filesystem was a simple set of sectors, with two (2) file allocation tables, one original, one backup. The allocation tables were to track allocation of sectors. If a file was allocated space, if it only took up one sector, then the relative FAT entry for that sector would be noted as the end of the file. If the file took up more than one sector, then the initial FAT entry would note the next sector of the file. Each file is a chain of entries in the FAT referencing the next sector.
The FAT references were 12-bit, allowing up to 4,096 sectors to be addressed. With sector sizes of 512, 1,024 or 2,048 bytes, FAT12 could handle up to a 8MiB device. With up to 1,024 filenames, the FAT of a FAT12 only took up 1.5KiB (12,144 bits) of space.
- MS-DOS 2.1
MS-DOS 2.1 finally introduced the concept of a directory. To do this Microsoft "borrowed" source code from Santa Cruz Operation (SCO) Xenix, a port of UNIX source code to the PC and its 8088 processor. Microsoft helped found and fund SCO in 1978, before the United States broke up the monopoly of AT&T on the telephone infrasturcture of the US, so UNIX was a non-commercial endeavor with source code available from AT&T (as well as popular derivatives such as the University of California at Berkeley). As anyone who has been around Microsoft and Linux a long time, Windows currently has more original UNIX/SCO source code from before the AT&T v. UCB settlement in 1993 (and the creation of the "UNIX(R)-free" 4.4BSDLite), than any free UNIX or UNIX-like system such as Linux (which only use the UCB-owned 4.4BSDLite code).
One issue Microsoft ran into was the forward slash (/) for directory names, CP/M and, therefore, MS-DOS 1.0 used those for command line options (e.g., /?). The workaround was to use the backslash (\), which is where the drive letter-backslash (e.g., C:\) comes from. Most multi-user operating systems, including UNIX, did not use drive letters, and used forward slashes, since the 1960s -- pre-dating this decision by Microsoft in 1982 by almost 2 decades. I rather tire of people new to UNIX/Linux who complain why didn't UNIX/Linux follow Microsoft's so-called "lead."
MS-DOS 2.1 also introduced a 16-bit FAT, now allowing up to 32MiB filesystems to exist with a 512 byte sector size typical of fixed disks at the time.
- MS-DOS 3.31
Largely a major set of contributions by Compaq, MS-DOS 3.31 further expanded some 16-bit FAT features. One was the allocation of multiple sectors into a single block allocation unit. So instead of assigning a single sector per FAT entry, a block or "cluster" of sectors could be used, increasing the limit beyond 32MiB. Up to 64 sectors could be used for one (1) 32KiB "cluster" and raise the maximum filesize of FAT16 to 2,048MB (2GiB).
- MS NT 3.1 and the New Technology Filesystem (NTFS)
Windows NT 3.1 hit beta test in 1992, over 2 years before MS-DOS 7 and Windows 4 were formally recognized and bundled into a product codenamed "Chicago" and known a year after that as "Windows 95" on the shelf. The original intent and purpose was to make MS-DOS 6.22 and Windows 3.11 the last generation of the 386Enhanced (i.e., Real86 DOS that constantly shunted in and out of Protected386 to run Win16/32s applications). Windows NT 3.1 borrowed heavily from OS/2, including its High Performance Filesystem (HPFS), as well as benefited from a mass exodus of talent from Digital Equipment Corporation's (DEC's) Virtual Memory System (VMS) team (almost resulting in a lawsuit until Digital and Microsoft came to an agreement on support of NT for their new Alpha processor, the OEM with the most MS NT certified professionals, etc...).
As detailed in the links on my low-level interoperability presentations, NTFS does a lot of things for "false security" that cause massive compatibility issues with NT itself. But NTFS is, in essence, a modified version of FAT. It still uses a FAT design, but has far fewer limitations (e.g., no more 8.3 limitations), uses a more intelligent approach. One is that the FATs are located closer to the middle of the filesystem, to reduce seek times (FAT filesystems allocate them at the start). And there are now formal approaches to discover which copies of the FAT are correct when they differ. Lastly, like HPFS, NTFS marks and forces filesystem integrity checks when the system is not properly shutdown and the filesystem taken off-line (and uses the same CHKDSK.EXE program, although radically different than the legacy DOS program of the same name). NTFS one-ups HPFS by adding journaling, which reduces the recovery time requires for brining the filesystem on-line as consistent.
FAT16 was only supported in NT 3.x and even 4.0, although a NT-only 64KiB cluster size was an option that allowed up to 4GiB FAT16 filesystems to be created. This was due to the fact that through even NT 4.0, the installer could not install directly on NTFS, and could require up to 4GiB of space. There were also installer and boot-time issues with non-PC architectures as well (which used the ARC firmware -- long, long story).
- MS-DOS 7.0 (Windows 95, 95A)
MS-DOS 7.0 is at the heart of Windows 95 and 95A (OEM Service Release 1). It is still FAT, but adds hidden files for indexing long filenames. Ironically enough, MS-DOS 7.0 did not use the OS/2-NT functions for filesystem support, but when the 386Enhanced Windows 4.0 kernel loaded, it extended the existing DOS Interrupt 20-3Fh(typically 21h) services for long filename operations.
This means that until 1999, the filesystem mechanisms in all releases of Microsoft Windows NT and Windows 95/98 completely differed entirely. Programs had to be written for both, and many were not -- and this was just the tip of the iceberg for Windows NT v. 95/98 compatibility. 100% -- ALL of Microsoft's own programs -- FAILED their own "Designed for Windows 95 and NT" logo program, which caused Microsoft to scrap its own certification program because of such issues. But that is another, long, long story.
- MS-DOS 7.1 (Windows 95B on-ward)
In 1996 Microsoft released OEM Service Release 2 (OSR2) also known as Windows 95B with a new 32-bit FAT design. This allowed filesystem sizes up to the 133GB (128GiB) limitation of the 28-bit Advanced Technology Attachment (ATA) specification -- commonly found in Integrated Drive Electronics (IDE) storage. FAT32 offers a few advantages including storage of long filenames in the FAT32 design directly (instead of hidden index files), but no consistency checking or other benefits.
Windows 95 OSR2.5 (95C), Windows 98, 98 OSR1 and 98 OSR2 aka "Second Edition" (SE) all used the same MS-DOS 7.1 core.
Windows Millenium Edition (ME) was a Microsoft experiment to remove a lot of the legacy DOS 20-3Fh services and various interface options to force its own software application developers and indepenent software vendors (ISVs) to stop using the legacy DOS interfaces and start using the native NT/Win32 filesystem interfaces (among others). It was an utter-failure as it did little to force change, all while destroying compatibility.
- MS NT5 (Windows 2000)
Windows 2000 was released in early 1999 and was the first NT release to finally support a subset of the legacy "Chicago" interfaces first introduced in 1994, 5 years earlier. By then, it was too little, too late, but it did help speed Windows 2000's adopting in corporations where Windows NT had been adopted far less due to compatibility issues with most Windows software. Also introduced with Windows NT 5 aka "Windows 2000" was the Logical Disk Manager (LDM) disk label (partition table format), which replaces the legacy BIOS/DOS disk label (partition table format) of primary/extended/logical (NOTE: it actually looks like a single primary partition of type 42h).
Most of the benefits of the LDM can be found in my interoperability presentation. Use of the BIOS/DOS disk label is quickly becoming a serious compatibility issue in MS NT5.1 (Windows XP) Service Pack 2 and greater, and newer ATA 48-bit addressing and legacy DOS/NT compatibility are conflicting. Again, see my interoperability presentation.
- Issues With the FAT Filesystem Design
There were some serious design flaws to FAT that still plague FAT even in MS-DOS 7 (95/98/Me) and NT5 (200x/XP) today, as well as NTFS itself.
When it comes to FAT, one is the two (2) FAT copies. Although the use of two copies were designed for redundancy, when the two FAT copies differ, there is no way to know which one is correct. Some 3rd party tools attempt to do so, but they can and often do pick the wrong one.
Another issue is the simplistic design of the chain of FAT entries. Lost chains -- whereby a single FAT entry is incorrect -- results in the rest of the file being lost. A related issue is the cross-linked chain -- whereby two FAT entries point to their next sector as the same. Even if a copy is made for each, it means one chain is now incorrect.
But probably the worst issue with FAT, which still plagues NT5 (200x/XP), is the lack of any filesystem integrity whatsoever. If the system crashes, there is no way for FAT to report is was not taking off-line correctly and left "consistent." Which means that when the system boots up, it does not know if the FAT filesystem is consistent. Although the CHKDSK.EXE program was introduced in later versions to do run-time integrity checks of the FAT filesystem, it still does it on a live FAT filesystem, and is rarely perfect. Microsoft later bundled SCANDISK.EXE, based on a license of Norton Disk Doctor (NDD.EXE), which had better recovery logic.
Most people think it is lack of filesystem journaling (i.e., recording transactions and ensuring they are completed, somewhat like the "Atomicity" -- the first part of ACID in a good database design -- in a filesystem) is the issue, but it's actually not at all. Because even non-journaling filesystems in UNIX have at least a mechanism to not only ensure consistency, but force a check to make the filesystem consistent. All of Microsoft's FAT filesystems, even in NT5 (200x/XP), never force the user to fully check a filesystem for consistency before mounting. In UNIX, we only allow filesystems to be mounted "read-only," if at all, until they are checked for consistency -- so no changes could occur on a possibly inconsistent filesystem.
Because of this, the FAT filesystem is often left in a far worse state -- with a FAT table with errors, which means new files written can and do get lost. Running SCANDISK.EXE during boot helps mitigate some risk, but by the time SCANDISK.EXE runs, some processes have typically written to the fielsystem. Ironically enough, the lack of a "read-only" mount in not only DOS, but even NT5 (200x/XP) today, is the root cause of this issue that affects even the New Technology Filesystem (NTFS) as well, unlike almost every UNIX filesystem design.
In other words, even NT systems have no concept of a "read-only" mount. During start-up, NT expects everything but the "System" volume (the "System" volume is BIOS fixed disk 80h with the MBR, NTLDR, BOOT.INI, optional NTBOOTDD.SYS 3rd party disk driver, etc...) to be read/write, including the "Boot" volume (the "Boot" volume, which ironically comes after the "System" volume/stage, is the volume with \WINNT or \WINDOWS). This is due to the fact that many NT services expect the filesystem to be writeable during boot, including before any filesystem integrity check and/or journal replay of NTFS is made -- which could result in corruption.
Adding insult into injury, NTFS is very, very aggressive in its journal playback. Unless forced by explicity user option, NTFS often replays its journal. In rare, but eventually probable cases over extended usage and time, NTFS will self-destruct and leave itself unable to recover from a manual CHKDSK. Therefore, it is important that NT system administrators regularly force a manual CHKDSK at next boot to enforce regular, full filesystem integrity checks and minimize the chance of a future, improper journal replay.
- Fragmentation: Why FAT Sucks
Filesystem fragmentation -- the state in which files are not contiguous blocks, but allocations all over a filesystem which reduces performance as the disk must read a file from different parts of a disk -- is virtually unheard of in the UNIX world, and we will discuss why. But for now, let's look at the reasons why disk fragmentation occurs in Windows, starting with DOS/FAT and moving to NT.
DOS/FAT uses a simple "first available" allocation scheme. When DOS/FAT needs to allocate a new file, it scans from the beginning of the FAT table for the first available FAT entry. When it finds one, it uses it. If the file needs more than one "cluster," it looks for the next, which may not be the immediately next entry, and often is not. This instantly results in fragmentation, and it was not been improved in even the latest MS-DOS 7.1 code in Windows 95B through ME. NT's implementation of FAT has also remains little changed. Through even MS NT 5.1 (Windows XP), FAT is still largely a "first available" allocation scheme.
NTFS is a bit better, but also a bit worse. In addition to locating the FATs more centric to the disk, NTFS separates directory and file entries to speed up directory indexing. As an additional note, the directory stores also accompany meta-data, which is why SIDs and other NT-installation/registry-specific meta-data is tied to the directory entries and should NEVER BE MODIFIED on any NT installation except the one that created the NTFS filesystem (see my low-level interoperability presentation). At the same time, the separation of directory blocks results in a serious race-condition for run-time fragmentation tools, and required extra code to implement (and was not included as standard in Windows NT's defragmentation -- although later versions licensed the code from a 3rd party that solved the problem). Fragmentation of the directory entries is why even when a NTFS filesystem might be continguous, file references can still be very sluggish -- typically much slower than the worst UNIX inode design.
But these are just the FAT design issues. Windows itself is far worse.
- Fragmentation: Why Windows Really Sucks
The legacy "single drive letter" and "everything goes everywhere" approach of Windows has its ultimate bane from a fragmentation, let alone reliability, standpoint. There is no strict separation of key boot-time, core operating system, user software binaries, temporary files and, ultimately, user data. Although after many FAILED and often CONFLICTING approaches (especially between the "Chicago" and NT groups inside of Microsoft -- the former winning over the latter out of sheer numbers) have come and gone, there is still no strict separation of files to segment both critical files from not-so-critical, as well as keep fragmentation from occurring.
Some of Windows' key issues:
- Dynamic Pagefile
Microsoft uses a dynamic pagefile by default, which quickly fragments into several sections spread over a disk. A quick workaround is to pre-allocate a large, static area early on. Even legacy Windows NT was typically smart enough to allocate a large, contiguous block, as long as their was one -- which is why it should be done early.
- Temporary, Log and Other Small Files: Less Reliabilty, More Fragmentation
This is my personal favorite, there is absolute NO STANDARD in Windows to temporary files. Although the environmental variables of C:\TMP and C:\TEMP are typically used, or C:\WINDOWS\TEMP is often the default in newer versions of Windows, these are on the C: drive. And that's before even looking at the various Windows registry, log and other system created files that regularly occur, also on the Windows "Boot" Volume (the volume where \WINNT or \WINDOWS is located, typically C:).
This means that temporary files -- the absolute worst thing for a filesystem -- are shared with everything else. Temporary files are often small files that eat up random places of the disk (as some are deleted quickly while others are left behind). As we'll talk later, even filesystem designs with extents (Microsoft does not offer one) can mitigate the fragmentation issues of filesystems that have both large, static files and small, temporary files, extents cause additional overhead that can quickly be self-defeating. As such, temporary files should always be on their own filesystem to prevent fragmentation -- let alone their continuous creation and deletion exponentially incresaes the probable case where an incomplete write during an improper shutdown (such as a system hang) can affect other files on the filesystem.
In all versions of Microsoft Windows, YOU WILL ALWAYS HAVE THE MAJOR RISK OF THE EXTENSIVE NUMBER OF TEMPORARY FILES ON YOUR *CRUCIAL* "BOOT" VOLUME BEING CONSTANTLY WRITTEN/DELETED -- IT IS UNAVOIDABLE, IT IS THE LEGACY OF WINDOWS AND CANNOT BE CHANGED, AND WILL NOT BE CHANGED.
Then there are the issues with the defaults of C:\My Documents and Settings (previously also C:\WINNT\PROFILES, among other, conflicting NT v. Chicago non-sense) and C:\Program Files. These are almost always defaulted to the same volume as the Boot volume as well (hence C: in most cases), and no matter what re-allocation, programs just want to write to those default locations.
Lastly, there is the continuing reality (and major security issue) that Windows programs want to write to the same directory where they are located. This is known as the "startup directory" in the program's settings. All Windows programs -- including the purest of Win32 applications -- have a "startup directory" and most programs -- using the Microsoft created Visual Studio loader code -- often assume they can write to that directory. Although the registry is being used by more and more programs as standard, along with C:\My Documents and Settings or the profile of a user, there still exists a real lack of standards. In almost every case, it goes back to the history of Microsoft's own Visual Studio and other development products -- conflicting back'n forth between NT v. Chicago and countless other, almost professional laughable non-sense that make up the core of almost every Windows application, all introduced by Microsoft itself (and not ISVs like Microsoft likes to blame others for).
The result is that there is absolutely NO SEPARATION OF SMALL, LARGE, TEMPORARY, STATIC, DYNAMIC OR OTHER FILE TYPES IN WINDOWS, NOR WILL THERE EVER BE IN THE NEAR (POSSIBLY FAR) FUTURE. Fragmentation is a fact of life for Windows, and it's not going to be solved because both the legacy and current issues with Windows today are the continuing issue.
- The Little Known NT5+ (2000+) Hack: Anchors
One thing I always love to test MCSEs on is if they know about Anchors. Anchors were finally introduced in Windows NT5 (2000) in 1999, nearly 7 years after early Windows NT 3.1 Beta Testers suggested that Microsoft would be well advised to "mount everything under C:." Well, Anchors do just that, bring the concept of mounting other NTFS filesystems inside of another at a specified directory, in NT5+ (2000+).
That way, one can mount a C:\My Documents and Settings that is actually located on another filesystem. Maybe a C:\Temp that is also its own filesystem, assuming a system-wide set of variables point all programs at it -- instead of defauting to C:\WINNT\TEMP or C:\WINDOWS\TEMP (God I want to [virtually, of course] shoot someone at Microsoft for defaulting to put temporary files in the HEART of the OS directory!). And best of all, I can locate the C:\Program Files on its own, static filesystem, separate from all the writing, overwriting, deleting and general filesystem "clusterfun" of the System and/or Boot volume (typically C:\).
The problem? Anchors still break everything. Unlike in the inode UNIX world where multiple filesystems are everyday life, today's Win32 -- which is a set of DOS Int21h function hacks with only partially followed Win32 functions by programs built with Visual Studio itself -- often results in programs breaking when the allocation units of C:\something aren't actually on the original C:\ filesystem. So Anchors are still limited as an option. I typically only use them on NTFS filesystems dedicated to data -- i.e., filesystems I'm sharing out via SMB to other systems, where the access is not by a local program.
But I figured it was important to note Anchors none-the-less, even though they should have been introduced 7 years ago (although that probably wouldn't have reduced their compatibility nightmare).
Traditional UNIX Mindset (Why It Still Works After 30 Years)
- inode filesystems
UNIX systems use inode filesystems. Each filesystem entry, typically a directory or file, has an inode that stores both meta-data, and points to the data blocks. A key difference and mindshift from a FAT design is that FAT has a dedicated allocation table with a 1:1 reference to data blocks -- whereas inode filesystems actually use two different data block types, the data blocks and the inode blocks that point to them. FAT uses a dedicated allocation table of all possible blocks that could be allocated, inodes do not -- in fact, some filesystems (that pre-allocate inodes) could "run out of inodes" when a filesystem contains lots of small files and there is not a 1:1 inode to data block (e.g., run "df" and "df -i" and note the actual data blocks and inodes used).
Pretty much every data block in an inode filesystem has an inode pointer using it, or reserving it (although designs differ), except the rare Superblocks. The Superblocks contains the core filesystem information (basic filesystem values, location of key inodes, free blocks, etc...) and only type of a few kilobytes (typically one data block, 4KiB is commonplace), and several, redundant copies are spread all over the disk (typically at fixed locations that are easy for experienced administrators to find in case a filesystem can and should be mounted with an alternate superblock).
Which is better, FAT or inode? It depends. Taking the "extra features" that inodes offer over FAT (which we'll discuss later), reliability can be a pro/con thing.
A FAT filesystem makes it easy to check for free blocks. Inode filesystems are more arbitrary, and a filesystem consistency check is always recommended on a regular basis to ensure the number of free blocks, as well lists of blocks that are available or have been freed, are consistent. A FAT filesystem also makes it much easier to check for cross-linked files, whereas all inodes need to be inspected to see if multiple data blocks are referenced by different inodes (although there is an interesting bonus to this, as we'll discuss). Inode operations definitely make checks longer and more involved, although there are bonuses to this in consistency (which we'll discuss).
A big one to start is actually a surprise to many. Most Windows users assume that the "root inode" makes an inode filesystem more suseptible to corruption than a FAT filesystem, because it points to everything else. In recovery, it's actually the opposite, a major benefit. FAT filesystems separate the allocation entries (in the fixed FAT) from the directory references (in special data blocks) which means there are 2 different points where a failure could destroy the same data. Again, this is because the FAT design comes from MS-DOS 1.0 before directories were added in MS-DOS 2.1. Although NTFS improves somewhat on this separation, by allocating directories separate from files, it's still 2 points where either can cause the same, severe damage. Anyone who has had even a CHKDSK on a NTFS filesystem result in unknown "FILE####.CHK" files that no one knows about has experienced this issue first hand.
In an inode filesystem, if directory links are severed between the root inode, or any parent directory inodes below, at least the inodes below that inode are now their own tree. This is because inodes store both the directory tree and pointers to data blocks in one structure, the inode itself. If a filesystem integrity results in the portion of the tree being "severed," the portion of the tree typically shows up a its own, self-contained tree under the typicaly "lost+found" directory -- names, directories, subdirectories, etc... intact. It all depends on the locality of the corruption or other fixed inconsistency, but if there was only a few points of actual "corruption," inode filesystems tend to be much easier to "piece back together" than FAT designs as a result of the "reference and allocation information as one" inode design.
As an additional benefit, the superblocks also keep a set of "reality check" values -- allocated data blocks, inodes used, inodes free (if pre-allocated), free data blocks, etc... These are regularly checked on boot against other values, and far more interrogated during a full filesystem integrity check (fsck). In fact, the CHKDSK used for even NTFS is not nearly as interrogating as an inode filesystem fsck because, again, the separation of allocation from reference in a FAT design affords little in the way of allowing a good transposition. FAT is the FAT, and if things are corrupted in the FAT itself, the day gets really to be bad. Inode filesystems at least have their own, self-contained reference lists -- -- both directory, subdirectory, files and pointers to data blocks.
Lastly, in FAT filesystems, two allocations that points to the same data block is cross-linked file. In an inode filesystem, this is a "hard link." Although Microsoft has been trying to come up with hacks to implement similar in NTFS, they are still not nearly as useful (and are very dangerous in NTFS). Although hard links sometimes introduce unforseen issues in rare cases, they are typically very useful for many things.
NOTE: UNIX filesystems also define a meta-data file/directory reference that can cross filesystems as a symbolic link (symlink). Symlinks are far more useful than Windows .lnk files, and far, far more transparent as well. I will not go into the benefits and issue sof hard links and symlinks, just know they work differently, and inode filesystems offer greater flexibility that results in their usage (as well as a history that accomodates their existance).
- UNIX Filesystem Hierarchy
The first thing many Windows users really "hate" about any UNIX/Linux system is the filesystem layout. The reality is that the filesystem hierarchy of UNIX and UNIX-like systems, although slightly varying between implementations, is far, far superior to the "free-form Windows" layout. In reading my initial "Windows Deprogramming" that basically shows Windows has NEVER had any notion of any layout strategy, combined with this discussion that will quickly show UNIX has always, you'll quickly come to appreciate UNIX and you should have an epiphany (if you already haven't).
In sticking with Linux, let's look at the Filesystem Hierarchy Standard (FHS), ignoring virtual filesystems like /dev, /proc, etc... that are not physical:
- System Directories Required for Boot/Maintainance
/bin
/boot
/etc
/lib
/sbin
- Temporary, Log, Variable Files
swap
/tmp
/var
- User/Service Progams
/usr
- User/Service Data
/home
/srv
The absolute hierarchy required to boot a Linux system into a maintenence mode is /bin (elementary programs), /boot (boot-time files), /etc (system configuration), /lib (kernel modules/drivers, core libraries) and /sbin (system/superuser programs). Pretty much everythign needed is in these directories, and other than /etc, they are completely static in nature. So the root (/) filesystem contains these directories at a "bare minimum." Linux can and does mount this filesystem "read-only" at boot, so basic programs can be used to check the system (including fixing the root filesystem if needbe) before anything else.
Now we get to those nasty temporary/log/variable filesystems. The dedicated swap (swapfile) filesystem is commonplace in any UNIX flavor, and Linux is no different (although you can use a "swap file," it is strongly discouraged and most installers do not even offer it as an option). /tmp (temporary) is the standard in almost every UNIX flavor where almost all programs assume they can write (and is typically UNIX permissions 1777 -- all access with "sticky bit," so only the creator of a file/directory can modify/delete by default). UNIX programs have absolute no concept of a "default" directory, and most will only use /tmp or the user's home directory when they need to write. /var (variable) is probably the biggest and most troubling filesystem of all, because all log files, spool directories, and [temporary] user and service files (/srv is used in FHS 2.3+ for service data files) go. These filesystems are almost always separated out from all others because they are so heavily modified, often with small, temporary files.
Now we get to /usr, which has a wealth and, by far, the largest collection of files and space requirements of a standard Linux installation -- all static and largely unchanging (except for patches or additions) -- /usr/bin, /usr/lib, /usr/share, /usr/sbin, /usr/X11, etc... -- stuff that is not required to boot, but is used after boot.
Next we have the user/service data files -- traditionally /home (or subdirectories of home) and, now for service data files in FHS 2.3+, /srv (previously and traditionally different portions of /var, like /var/www, /var/lib, etc..., possibly /home/www, /home/lib, etc... before that). These are a mis-mesh of small and large, dynamic and static files and directories depending on usage.
So, what we have are:
- The core, "avoid changing this stuff because we need to boot/fix things" set of files
- The temporary, "this stuff changes all the time and shouldn't mix with others" set of files
- The static, "the unchanging meat of the OS, only updated, patched and added" set of files
- The data, "changes for different types of usage, which can vary" set of files
This is UNIX at its finest, strict separation of boot, temporary, programs and data.
It makes it so damn easy to not only localize corruption, but inhibit fragmentation.
- Where UNIX goes even further
Now let's assume you are going to at least segment your UNIX filesystems into at least the four I listed above (which I'll detail further in the next section). What else does UNIX offer in this strategy that Windows does not when you do?
- Reserving usage
- Localizing security
- Localizing the unexpected
Most UNIX/Linux filesystems do many things to reserve usage of a filesystem. A big one is the common 2-10% (Ext2/3 use 5% by default) reservation of a filesystem. When a filesystem reaches 90-98% full (95% full on Ext2/3 by default), the kernel will prevent any further writing to the filesystem by anyone but root. Not only are the regular users, but most processes, are not running as root, so the disk stops allowing writes at 90-98%. At first this seems foolish and, in fact, many people complain about it, but it is for one very big reason -- fragmentation.
Fragmentation exponentially increases as a filesystem fills up. This reservation is a long taught, long learned lesson for UNIX/Linux administrators that should be very respected. Anyone who has filled up a Windows server volume should appreciate this given how poorly a Windows server performs afterwards, which is the same problem a UNIX server would suffer if it allowed it too. But unlike Windows servers, almost all UNIX/Linux distributions and filesystems (with a few, notable exceptions) enact this reservation -- to combat sudden and horrendous fragmentation that occurs as a filesystem becomes nearly full.
Regarding localization of security, most UNIX filesystems have extensive mount-time options, including the ability to prevent programs from executing, accessing specific capabilities, as well as the default of many well-designed UNIX applications to even allow access to cross filesystem boundaries (e.g., if /srv/www is a separate filesystem, the Apache web server will not allow access outside of /srv/www by default without a specific override). Segmentation and security go hand-in-hand when you wish to not allow one service to affect any other service, which is why many services (databases, web servers, print/spool servers, mail servers, etc...) use segmented filesystems for their user/service data and/or variable log/temporary files.
In continuing those compounding thoughts, I hinted at further reliability -- expect the unexpected -- localize for the unexpected.
Keeping filesystems, especially the root (/) filesystem consistent, largely unchanged, and reliable. It doesn't seem to "sink in" at first to most administrators, even some seasoned Linux administrators, but after years of UNIX/Linux exposure, you learn to quickly appreciate the existance of a "small" root (/) filesystem versus "one big" one. The common instinct is to move towards a "single C: drive" coming from Windows, or after an UNIX/Linux administrator has one filesystem "fill up" on them, but understand it is that filesystem localization that is the best advise I can give anyone.
If something is already "out of control" and going to fill up a single, segmented filesystem, giving it "more room" to go "out of control" is not only not solving the problem, but any time afforded in "starving off" the eventual "out of room" event is going to be offset by the additional files and mess created. Simply put, I have yet to see where not segmenting out /var was a good idea -- and have explicitly caught several people who created "one big root (/)" only to see a rogue process to adversely affect other server data, or at least cost a good 2-20 hours of "clean up time" as a result (possibly affecting server performance too).
Lastly, as a UNIX administrator, the very nature of UNIX filesystems is (most often) the rule of conservitism. In other words, UNIX journal replays, filesystem integrity checks and other automated processes often don't want to automatically fix things if there is a chance that data will be lost -- quite the opposite of a NTFS journal replay that avoids the time of a CHKDSK to its own demise when it should have done a full CHKDSK. In many cases, UNIX systems will require you to do a full fsck/repair on a filesystem off-line (non-mounted) before it will let you continue, which means the larger the filesystem, the more time it will take to do so.
By segmenting UNIX filesystems, you can not only reduce the time when a filesystem needs to be checked by making it smaller (because when a journal misplay occurs, or a full fsck is required, it is typically only one filesystem of the entire lot), but if it is going to be an extended check, you can bring up the rest of the system without that one filesystem. E.g., I never make just one data filesystem, I make at least two. That way, if a full fsck is required on one, I can bring the server up and let half my users work while the other half waits 15-60 minutes, instead of having all my users wait on the 30-120 minute fsck required on one, big data filesystem.
- "But I Just Gotta Defragment My UNIX/Linux Filesystem!"
Okay, some of you are just so programmed that even though you appreciate and even believe that UNIX/Linux filesystems need to be checked less, you just want to defrag for maximum performance. So what program do you use? Well, it depends on the filesystem. Instead of going into a huge HOWTO on every filesystem, I am going to cover this under the "best practices" for the two Linux filesystems I deploy. I would rather give correct info/recommendations for those two, based on my experience and in that context, than to try to tell you what to do for any, arbitrary setup.
My Professional Linux Practices (and Even More, Why?)
- Volume Management
Regardless of OS, I use Volume Management on the PC. Although some RISC/UNIX platforms have good disk labels (aka partition table formats) that are well-designed for their architectures, the massive issue with the PC and the legacy BIOS/DOS disk label (aka partition table format) using primary, extended/logical slices (partitions) is the fact that it is at the mercy of the varying/conflicting disk geometry issues as well as has not means to store meta-data for volume information.
This means when I deploy Windows NT5+ (2000+), I configure a slice of type 42h for a Logical Disk Manager (LDM) Disk Label (aka "Dynamic Disk"). When I deploy Linux 2.4 or 2.6, I configure a slice of type 8Eh for Logical Volume Manager (LVM) -- version 2.6 using LVM2. I almost always do this regardless of whether or not I'm doing RAID, snapshots, etc..., I do it for flexibility. I leave it up to individual sysadmins to decide for themselves, but I encourage you not to avoid learning LDM and LVM/LVM2 because there are sound reasons for doing so.
In fact, the common physical volume (pv), volume group (vg), logical volume (lv) 3-level approach i Linux's LVM is basically ubiquitous across a host of UNIX flavors and their various platforms. Learning the elementary terminology, and how to do basic, harmless operations like allocation new space, is highly recommended. If you don't know your way around any UNIX LVM, then learn it so you are ready to deal with most implementations.
- Segmented Filesystems
There is a lot of debate on this, but based on my previous comments, I will not change my mind. Although I do agree that creating too small of a filesystem is a bad thing, and I regularly run into it on existing systems that I wish I could easily change. In fact, prior to LVM in Linux, I adopted an "equal size" arrangement/approach that many have mirrored. In a nutshell, I never make any filesystem smaller than system memory, and I make the essential filesystems of equal size, and any subsequent support filesystems the same size that is a multiple of those essential filesystems.
For example, I consider the "bare minimum" Linux filesystem segmentation to be:
/
swap
/tmp
/var
I absolutely and positively will not tolerate /tmp and /var on anything else. If I am absolutely hurting for space, I will symlink /tmp -> /var/tmp. I like to avoid putting even /tmp on the root filesystem. When implementing these "bare minimum" Linux filesystems, I assume /home will be mounted remotely. If not, then a separate /home filesystem is of great consideration, although there are one or two workarounds (see /usr/local below).
Regarding size, I typically try to stick with the typical, maximum size of the largest, but common removable media for each The reason why is because that's the size of a typical OS install (and the maximum I could expect all updates for an existing install to be), and typically a small multiple of the common system memory. About 5 years ago, this was a CD-R, so I typically used 0.5GB or 1GB. Now this is the DVD-R, so I have typically been using 4GB or 8GB. And I always make sure these sizes are never less than the amount of system memory -- so raise them if they are not.
Next comes the big enchillada, the /usr filesystem for static user/service binaries.
/usr
In a system where my space is limited, I will do a smaller install and just use the root (/) filesystem to store /usr. But when I have space, I make /usr fairly large -- typically 4-8x the amount of space that the OS will put in /usr. These days, for a full Linux install, that is roughly 16GB or 32GB. Over the life of the system, with updates and even a number of concurrent additions of necessary distro-provided (or an associated repository) packages added later, this is more than enough. Any other accomodation should be done with a separate filesystem (see /opt or /usr/local below).
Next we are left with the user and service data of the system. The size of this will vary, and they may or may not need to be separated out.
Workstation Optional (choose 1): /home, /export/local, /export/(systemname)
Workstation Optional: /srv
Server Standard (2+): /export/(systemname)(#)
Server Standard: /srv
For workstations, things vary.
Regarding local /home, if my users are going to mount data over a remote NFS mount and not require any local storage, then a /home filesystem is optional. If there will be no regular NFS mount, I will often create a /home directory. If I NFS mounts to other systems will occur frequently and/or local storage is required, I actually like to create a /export/local. If I know local data will be regularly shared out via NFS or SMB on a specific LAN regularly, /export/(systemname). In the case of the latter two, I will either symlink /export/* to /home/*, or NFS export /export/local or /export/(systemname) and locally mount into /home/local or /home/(systemname). In most cases, the (systemname):/export/local is possibly in the local network's NIS/LDAP automounter table maps and done automagically anyway. It all depends.
Regarding /srv, if the system is a workstation/desktop, then there's probably little need for /srv, or it should be the same size as the essential filesystems (/, swap /tmp, /var) previously just for storing a few services (like maybe a quick FTP-SSL or other data access option). In most cases, you can forget all about /srv on a workstation/desktop.
For servers, things change drastically.
If the system is a data file server, then I create at least 2 /export/(systemname)(suffix#) filesystems -- at least 2 for the reasons I previously explained -- for user data.
For service data, /srv is the new FHS 2.3+ location, although older systems might be /var/lib, /var/www, etc... The base /srv should be at least as big as the essential filesystems, if not as big as a data filesystem if the server is providing a lot of different services. If the server has a primary role as a mail, web, file and other service, then I like to separate those out for both localization and security reasons.
Server Optional: /srv/ftp, /srv/www, /srv/...
Furthermore, if the server is a mail, print or other spooling service, then additional /var/spool, /var/mail and other /var/* subdirectories should be created as appropriate:
Server Optional: /var/spool, /var/mail, /var/...
[ NOTE: You should consider an IMAP directory a "service data" directory and not a "service temporary/spool." IMAP directories, assuming mbox is used, is like a collection of large files that are running. This is impor
Far, far, FAR too often do I see Samba servers handling print operations rendered useless and bring down the whole network because the /var/spool directory was on root (/), or possibly on a small /var that is now holding up any other services using /var.
Lastly, we are left with the "Optional/Local" filesystems. These are for directory trees that are not part of the standard distribution packages or locations. They are commonly /usr/local and/or /opt, and have their own root-like structure underneath with bin, etc, lib, sbin, var, etc... -- especially for src (e.g., /usr/local/src). They should be used sparingly! But if there is a lot of customizations going on, they are a good idea -- and should probably be equal in size to at least /usr, possibly a data filesystem if needbe. Unless a standard 3rd party app wants /opt all to itself, I symlink /opt to /usr/local. In a few cases, I keep both, and /usr/local is actually a NFS mount to a common, shared tree among systems of the same release/version (e.g., appserv:/usr/local.SunOS5.9 -> /usr/local on a Solaris 9 system).
Optional: /usr/local (commonly symlink /opt -> /usr/local)
The "essential" filesystems (typically 4-8GB today, 1-2GB bare minimum):
/, swap, /tmp, /var
The "binary" filesystem (typically 16-32GB today, 4-8GB bare minimum):
/usr
The "data" filesystem(s) (typically at least /usr, if not bigger):
/home, /export/local or /export/(systemname)[#]
The "service data" filesystems(s) (from as small as "essential" to as large as data):
/srv (workstation optional)
/srv[/ftp,/www,etc...] (application-specific services only)
The "service temporary" filesystems(s) (typically same as /usr):
/var[/lib,/mail,/spool, etc...] (spooling/some application-specific services only)
The "optional/local" filesystem:
/opt [ -> /usr/local ] (rarely its own filesystem, only when 3rd party dictates)
/usr/local (same as /usr, can host all added files in most cases, most 3rd party)
- Why I Do Not Deploy ReiserFS and JFS
Let me first say that I highly respect every Linux filesystem development lead, from Steve Best (JFS), to Hans Reiser (ReiserFS) to Nathan Scott (XFS) to Stephen Tweedie (Ext3). Each is fairly good at explaining their focus, results and advantages without too much of the non-sense I see most users engaged in "filesystem pissing contests" do. Over the years I've gotten a few facts wrong myself, but I've come to the same preferences over and over via my methodical usage and approaches.
It was because SuSE shipped ReiserFS as standard that I could not consider SuSE, and even representatives from SuSE recommended that I not explore ReiserFS. Why? Because I had an engineering network that used NFS, and there was no other network filesystem that could give the type of access and push the kind of data we needed. Nothing against Hans Reiser and ReiserFS, and I've actually only been impressed by his ideas and implementation -- including the meta-data journaling approach, as well as other innovative features of ReiserFS 3 as well as work on ReiserFS 4.
ReiserFS continues to builds a revolutionary filesystem that lacks traditional UNIX inode layout and interfaces, which is why ReiserFS lacks a lot of kernel feature compatibility, and not all of the Linux Virtual Filesystem (VFS) layers can abstract these features to ReiserFS that just isn't of the same, traditional design. This prevents me from using ReiserFS. As an additional consideration, by his own admission, Hans Reiser has stated that filesystems should be redesigned every 5 years. As much as I've seen ReiserFS handle dynamic changes without incident, as much as I've never seen ReiserFS make a journal misplay, the fact remains that with a continually fluid design, or significant changes on a regular basis, the off-line tools continue to lag the on-line kernel implementation. So while I might be okay as long as a ReiserFS filesystem is matched against the proper kernel, the second ReiserFS does properly not trust its journal replay, I'm at the mercy of the off-line tools. And so far, I've had horrendous luck when that happens.
In OS/2 Warp, IBM began a new, revolutionary filesystem design. Being that Microsoft no longer had access to IBM's technologies and code (legally since 1993, after 1993 is a long story), a radical replacement for HPFS was devised. The result was the IBM Journaling Filesystem (JFS) and it was extremely innovative. IBM spent the next few years porting JFS to its AIX UNIX operating system, added all the traditional inode structures and kernel filesystem support necessary and expected by a set of standard UNIX interfaces. By 2001, the job was completed and JFS2 was born.
Naturally the JFS2 port would have and should have been the foundation for the Linux port -- even if it started in 1999 before completion on AIX. But as contracts would have it, IBM had a Non-Compete Clause in their Project Monterey (64-bit UNIX) agreement with SCO which prevented IBM from porting code from AIX (Monterey for IBM Power is known as AIX 5L), so IBM ported JFS from OS/2. This meant that IBM had to re-create, "clean room," all those interfaces they had spent 4 years doing for AIX. As such, by 2001, when JFS was considered "production quality" for Linux, it lacked almost all major feature support for Linux -- quotas, NFS, etc...
To this day, JFS still suffers from some compatibility issues with standard Linux VFS features, although its fairly static design and traditional layout does make it more compatible than most ReiserFS developments. In any case, it has been a non-consideration for myself, even if others have deployed it to great success.
- My Experience with Ext3 and XFS
I adopted Ext3 in early 2000 for kernel 2.2 when it was still only the "[full data] journaling" mode. It was little more than simple "double-buffer" commit. It was easily converted to Ext2, as well as back, and it did the job to drastically reduce fsck times. Probably the biggest sell for Ext3 was the ability to drop into a full fsck when necessary -- something that saved me dearly when a physical disk error occured (and my RAID card firmware and driver were not compatible -- long story). To use a trusted fsck of 10 years on a filesystem whose structure had not changed in the same period of time was convincing enough.
Since then, I have only trusted my "essential" filesystems to Ext3 without reservation. I have never lost a Ext3 filesystem, and I have had no unexpected data loss with either "journal" or "ordered writes" mode. I purposely avoid "write back" mode due to its inherent issues that it could affect files that are not being modified. With newer directory indexing features, I find the performance of Ext3 to be more than adequate for filesystems under 100GB. It should be noted that I purposely avoid using Ext3 on filesystems greater than 1TB (even though newer versions support up to 8.8TB/8TiB).
The Ext3 base feature set -- full NFS compatibility, most other, standard Linux features in mid-to-late 2.4 (quotas, POSIX EAs/ACLs, etc...) were sufficient for most operations -- especially in the early days of Ext3 back in kernel 2.2.
Unlike JFS, XFS was a direct port from Irix to Linux. Unlike any other filesystem, XFS brough a lot of heafty requirements that prevented it from being in the stock kernel. The good news is these capabilities were ported into kernel 2.5, and most other filesystems now benefit. Other than some paging features tied to Irix that had to be written for Linux, XFS was a clean implementation on Linux. And that included the wealth of features that were standard in XFS. This included full extened attributes (EA) in the inode itself, a feature still lacking from most other Linux filesystems (let alone most other UNIX filesystems) that hack on a hidden file. And best of all, like Ext2/3, the structure had remained unchanged from its traditional UNIX design since the mid-'90s, despite all its advanced features. There was even Linux quota support for XFS before Ext3, while NFS compatibility is just as good (among other standard Linux kernel features).
XFS uses both extents (which JFS also does) and delayed allocation (which ReiserFS 4 also does) to combat fragmentation. This makes XFS ideal for filesystems where files, both large and small, could be written. In traditional filesystems, the combination of lots of large and small files causes all sorts of allocation issues that typically increase fragmentation. Delayed allocation helps pack smaller files better, but cannot do the same for large files. Extents help separate small and large files into their own allocation areas of the disk, but small files are not always packed well. Only the combination of both delayed allocation and a proven extents strategy -- which XFS was designed for and implemented on Irix from day 1 in the mid-90s, now ported 100% to Linux -- gives the best of both worlds. Now there are limitations to the combination of pre-allocation and extents. Most of it has to do with its additional overhead, which will be covered later with regards to fragmentation.
But the major, key differentiation of XFS is built upon its existing, proven, stable structure on Irix. That included the full suite of off-line tools with 5+ years deployment -- xfs_repair, xfsdump/xfsrestore, xfs_growfs, etc... The off-line repair tool was very trusted. The dump/restore , combined with the native inode storage of any EAs/ACLs info directly in the inode**, but it could be safely run against a mounted XFS filesystem and did not require a snapshot or other volume management "freeze" (unlike Ext3). It already had the ability to be grown, managed, reorganized (defragmentor), etc... with the existing suite of off-line tools that pre-existed, not what was being promised to be developed, etc...
[ **PROFESSIONAL NOTE: Ironically enough, XFS was ready for SELinux before Ext3 (it's XATTR format is a fully support XFS inode EA type), which begs the question on what Red Hat is waiting for?!?!?! XFS is a perfect complement to Ext3 to address its deficiencies in data and larger filesystem deployments. ]
For data filesystems, I was sold on XFS and started using it immediately. I tested XFS for other filesystems as well, but quickly stopped considering it after both the performance of "temporary" filesystems was not optimal combined with the fact that I had two /var filesystems get hit by the XFS 1.0 bug. The bug was an oversight in the design of the one additional requirement for the Linux port, the paging facility that was previous tied to Irix -- something that has been long fixed and is now trusted (especially in 2.6 where the paging facilities are part of the stock kernel code).
- Specific Practices for Ext3 and XFS
Now even though Ext3 and XFS work quite well for myself "out-of-the-box" (with a few distro exceptions/workarounds), there are still some specific practices and recommendations I utilize for each.
Ext3 gets the call for all "essential" filesystems -- /, /tmp, /var. It's static nature means I can read it with almost any boot disk (although I try to stay with the distro's recovery CD/mode). I also use it for all temporary filesystems, including mail, spool and most service directories that are 32GB or smaller.
The only issue of major concern with Ext3 is the pre-allocation of inodes. The ratio of inodes to blocks is typically 8-16 or so (one inode for every 32-64KB on the typical filesystem with 4KB blocks). On the /var filesystem, or another temporary filesystem with lots of small files -- possibly a mail or news spooler (although not nearly as much in the case of mail these last few years with MS-TNEF flying around ;-), this is not ideal. It is very often the case that a "df -i" will result in twice as many inodes used than actual blocks -- although newer logging defaults in most distributions/services are not nearly as bad as of late. So using the "-i" or "-T" option to "mke2fs" when creating /var or a /var/"spool" directory is recommended for Ext3 /var and /var/spool filesystems. E.g. (1:1 inode-to-data block assuming a default data block is 4KB):
# mke2fs -i 4096 -j -L var /dev/vg00/lv04
# mke2fs -j -L var -T news /dev/vg00/lv04
See "man 8 mke2fs" for more information.
XFS, on the other hand, dynamically allocates inodes (just like JFS and ReiserFS) so the number of inodes is not an issue. Furthermore, XFS uses advanced packing techniques so data can be stored directly in its own inode (instead of using a data block) when small enough, as well as other usage reduction approaches (the most of any Linux JFS design).
However, I typically deploy XFS on user data filesystems, and the rare, large service directory (e.g., database, IMAP spool, etc...). On user data filesystems, I typically wish to take full advantage of Extended Attributes (EAs) like Quotes, ACLs, SELinux, etc... support. The default 256 byte size of a XFS inode is not ideally suited for storing POSIX ACLs, as less than 64 bytes are typically left for EAs. Should an inode need more space, a full data block (typically 4KB) would be allocated, which is not always ideal, plus it means not all of the meta-data is stored in a single inode. So when using ACLs and/or SELinux, increasing the inode size in XFS to 512 bytes (possibly 1024 bytes when using both heavily) is recommended, at only a small disk penalty overall (a tad more noticable with 1024 bytes). The option to use a larger inode size when creating a XFS filesystem is "-i size=value" such as follows:
# mkfs.xfs -i size=512 -L engr_unclass /dev/vg01/lv01
# mkfs.xfs -i size=1024 -L engr_secret /dev/vg02/lv01
I absolutely cannot live without xfsdump when it comes to filesystem backup. Instead of having to deal with various backup of hidden file ACL and SELinux information in other filesystems, that information comes over in the inode itself during a xfsdump (again, Red Hat why don't you support XFS for these key enterprise features, removing the need for further hacks to/for Ext3 once and for all?!?!?!). And it was designed to be run on a mounted XFS filesystem, taking away the need to make a snapshot or other freeze-in-time/off-line-equivalent technique (other than for databases -- which is another, non-filesystem related issue). The existance of xfs_copy is also a nice off-shoot utility for quick cloning of an existing XFS filesystem that might not be the same size (unlike dd), without losing all of the ACL and SELinux information in the inode meta-data (that would not be preserved with a findcpio, tar, etc... copmmand). I've done one xfs_growfs without incident atop of a logical volume -- pretty quick and straight-forward -- all while the filesystem was mounted too. ;->
- Defragmenting Ext3 and/or XFS
Defragmenting Ext2/3 has basically one rule, don't do it. Although the [e2]defrag utility exists, it always seems to lag the Ext2/3 developments, severely. For the most part, it typically can't hurt to try [e2]defrag -- if an attribute is detected that it doesn't support (such as journaling -- requiring Ext3 to be converted down to Ext2), then it will fail to run. Some guides suggest disabling attributes with "tune2fs" until it runs, but that is a huge mistake -- those attributes are set for a reason.
[e2]defrag much be run on an off-line filesystem. But at this time, I cannot recommend it. I typically limit my usage of Ext2/3 filesystems to filesystems under 100GB. Although I use them for temporary filesystems which fragment heavily (e.g., /tmp, /var, /var/spool/mail, etc...), I localize that fragmentation by appropriately segmenting those filesystems. That seems to limit the degredation.
XFS was designed off-the-bat to completely eliminate fragmentation. The combination of extents -- by which small files and large files are allocated in completely different areas of the filesystem to prevent packing issues -- along with delayed allocation -- which ensures small files are packed well and not merely allocated "first free block" -- prevents nearly all fragmentation. While some people advocate XFS because it does avoid such fragmentation, it should never be used for filesystems with lots of small files -- especially not lots of small, temporary files with lots of writes. In those cases, the overhead of extents and delayed allocation completely negate the benefits of reduced fragmentation.
In other words, it's probably better to segment /tmp and /var out as separate Ext3 filesystems which does quick indexing and writes/deletes (even if with more fragmentation) than to bog down the root (/) filesystem into the overhead of writes/deletes to /tmp and /var subdirectories with XFS. Especially since there can be boot-time considerations with XFS (the filesystem does not offer a "bootstrap" at the beginning of a slice/ partition -- so boot must be in the MBR), and it's always good to not put root (/) at the mercy of continuous writes/deletes on /tmp and /var files anyway. In general, every attempt of mine to use XFS for /tmp, /var or other temporary filesystem with heavy small file writes/deletions has been less than ideal compared to Ext3.
With that said, SGI did come out with a filesystem reorganizer (xfs_fsr) tool after a few, rare applications did show significant fragmentation over time (such as large files that grow regularly). Like nearly all of XFS' toolsuite, the filesystem reorganizer works directly with XFS' journaled implementatin on-line while the filesystem is mounted. By default, the reorganizer runs for 7200 seconds (2 hours), user settable with the "-t" option. It makes as many passes with each pass attempting to reorganize the 10% worst fragmented files in each pass for each filesystem. With no options, it attempts to run on all mounted XFS filesystems (i.e., /etc/mtab), although a filesystem specific list can be passed. Options also exist to pick up where a previous run left off.
For more information on the XFS filesystem reorganizer (xfs_fsr), see "man 8 xfs_fsr".
Addendum: XFS List Comments/Clarifications
Some comments came up from people on the SGI Linux XFS list about issues with XFS, I wanted to repost what I posted that addressed my XFS roll-outs. In most cases, the issues are race conditions that have little to do with XFS, also affect Ext3, but are often based on the backport of XFS to 2.4 in the stock kernel (and not SGI's releases for Red Hat Linux 7). I could also go into many details on the layers upon layers of storage/filesystem that is quickly getting out of control (and I would argue is based on poor beliefs/limited deployments of "good" hardware RAID).
1. Kernel 2.4
I have _never_ used the XFS backport to kernel 2.4. Frankly, I don't trust it. Not because of XFS, but because of kernel 2.4, and because it doesn't come directly from SGI, tested and blessed.
2. Kernel 2.6
With Fedora Core 3, I started deploying XFS on kernel 2.6, but I don't put my faith in it yet with 4K stacks. I was very disappointed when Red Hat forked Red Hat Enterprise Linux 4 development and did not bring XFS over. I think it was a huge mistake to not put in the efforts to see XFS ready for 4K stacks (NOTE: 4K stacks are something I do agree with Red Hat on doing). Red Hat could offer a lot to XFS if they had to maintain it equally with Ext3 under RHEL. Again, I will assert it is in their best interested to do so. With Fedora Core 3 I have quotas, NFS, ACLs and, now, SELinux, but it is not as tested and proven as my old XFS 1.2 deployments on Red Hat Linux 7.x (and I assume I'll start running into stock kernel implementations soon enough).
It should also be noted that CentOS (a 1:1 rebuild of RHEL from SRPM) also offers XFS in its CentOS Plus (packages that are different than stock RHEL) kernels. The CentOS 4 Plus kernel basically seems to be the same as Fedora Core 3, XFS from stock 2.6 kernel implementation. While I trust it more than the 2.4 backport even though it's now in the latter, stock 2.4 kernels, I still can't trust it as much as the prior, official SGI XFS releases that had their blessings on Red Hat Linux 7.x and Red Hat Linux 9.
3. LVM/MD Usage
I limit my use to LVM to volume slicing. Let me start by saying that I'm a huge fan of volume management. I use both LVM and LVM2 for flexible, on-line additions/modifications of logical volumes. In a nutshell, Ilargely use it to slice my disks with more flexibility -- reserving space, create new volumes as necessary and theoccassional expansion (although I typically try to stick to new mounts/symlinks).
But with that said, let it be known that I don't trust LVM and especially not LVM2 with snapshots, more complex resizing and definitely not any RAID operations. I do not trust DeviceMapper (DM) with either LVM2 or EMVS right now. Why? All I keep reading is about is race condition after racecondition after race condition. And in each case, it's not limited to XFS.
First off, I've limited myself to only 3Ware and select LSILogic (including former Mylex) products over the last 5 years. 3Ware uses an ASIC-driven "storage switch" and I have only deployed LSI Logic (and former Mylex) products thatare XScale (which is based on StrongARM). These are very, very high performing -- able to move a lot of data with not only little CPU overhead, but more importantly, without the extensive use and duplication of data streams through the CPU-memory interconnect. I.e., it's not the XORs that get you, but the duplicated data streams tying up the interconnect that data services could be using. It's the same reason why hardware switches/routers are better networking equipment than PCs -- these "storage switch - I/O processors"are the same. Their on-board RAID intelligence is self-contained meaning their drivers are simple, GPL block drivers. Even Intel is moving to put its XScale I/O Processors (IOP) on Xeon mainboards, possibly in the I/O Controller Hub (ICH), directly -- to off-load these unnecessary operations for today's network/storage (RAID, layer 2/3/4 frames/packets/transports, iSCSI overhead, etc...) off of the CPU-memory which it is not designed for (and only unnecessarily duplicates data streams taking time away from actual data processing).
Secondly, I've also had excellent "forward product" volume compatibility -- especially with 3Ware of 3+ generations over5 years, full support moving from older to newer, far, far better and longer than MD (let alone LVM/LVM2). And many people have never seen 3Ware's 3DM/3DM2 tools foradministration and monitoring, they are much easier to deployand have saved my butt in several cases. LSI's tools aregetting better too. So it is this abstraction of RAID into hardware that removesthe multiple layers that often cause the "race conditions" between LVM-MD and other kernel-level operations. This is not just an issue for XFS, not just an issue for Ext3 and Linux in general, but many other OSes as well. Which is why I have been deploying XFS for a long time, provided I "do my homework," alongside Ext3. All the issues I've heard about off-list have surrounded configurations that are an issue with Ext3 as well -- not limited to XFS at all.
4. RHEL 64-bit with 4+GiB mem and 4+TB disk calls for XFS
I'm starting to see the potential for some system integration projects that will involve data volumes of 4+TB. In all my recent Opteron 2xx/8xx integration projects, I have put in no less than 1GiB DIMMs per DDR channel, which means a minimum of 4GiB for Opteron 2xx (4 DDR channels) at a premium of only $100-150 over 2GiB. Opteron is the commodity 2-4 way server solution for just about everything now. And while I know the 4+TB data volume on a file server is not a staple for Red Hat who is catering to grid computing clusters, web servers or possibly Oracle SQL databases using "raw" slices (instead of filesystems), they are still the"flagship" carrier of any distribution when it comes to fileservices with NFS (or NFS+SMB) in my book.
From all I've read, x86-64in PAE mode using 52-bit register (48-bit "Long Mode") is 4K pages, although I do note 2M and 4M pages as well. Again, I'll agree with Red Hat that 4K stacks are probably the correct move for x86/x86-64 in the VM. All the arguments I've seen that attempt discredit it are not only not in agreeance from what I've seen on Red Hat's plate, but the actual work Red Hat has put forth in working on 4K stacks (for the future benefit of all). So I won't even touch that issue. My argument is that Red Hat needs to add XFS to its plate for RHEL 5, including ensuring reliability on 4K stack kernels.
I don't see how Red Hat can offer a solution that scales to these data volume needs if it continues to offer Ext3 -- let alone the continued issues of lack of features. It's almost like Red Hat is two-faced when it discredits (appropriately) ReiserFS and JFS for lack of both standard kernel interfaces and user-space support, then turns around and not only acts like, but many Red Hat developers flat out state that XFS does not offer anything that Ext3 does not. Beyond just the big scalability difference, I don't know how Red Hat can push SELinux and other filesystem extended attributes (EAs) when they don't offer a way to back it up on Ext3 -- while XFS does! And that's before we even touch the fact that XFS can do _live_ operations for dumping, copying, defragmentation (file reorganizing), etc...
Red Hat can choose to ignore us system integrators and lose a lot of business. In fact, I'm really getting to the point I'm half-way serious about getting some investors to build a new enterprise distribution and offer Service LevelAgreements (SLAs). The distribution would always be based ona fork of the 2rd or 3rd Fedora Core release -- as I believe very, very strongly in the 1-2-3 x 6-month (although it's turning more into the 1-2 x 9-month) release model that Red Hat has followed over 15 releases since Red Hat Linux 4.0 that results in the "best balance of feature adoption v. stability" by the 3rd release. I typically do agree with Red Hat's design decisions at the core -- but not the end-focus of Red Hat Enterprise Linux as of late for companies more than willing to pay $3,000 for Advanced Server.
In fact, I'm 100% in agreeance with Sun technical analysis when they say that Red Hat is not addressing the storage/filesystem aspects (among other things) -- especially the layers upon layers of LVM, LVM2-DM, MD, etc... God knows SuSE is not by supporting ReiserFS (which has always made SuSE a non-consideration for traditional UNIX shops with large data warehousing, NFS services, etc...) because only XFS can offer the same features and compatibility that you'd get out of a traditional UNIX platform (before you disagree, please read up on what ReiserFS has issues with -- traditional UNIX interface compatibility, off-line tools/support, etc... is very important). So if I had to implement a commodity 2 or 4-way Opteron 2xx/8xx solution today with multi-TB volumes, I would go Solaris 10, not RHEL 4. If Red Hat decides to put forth the effort on XFS for RHEL 5, then I would most likely change that recommendation (and very much want to do so).
19 comments:
Hi Bryan and thanks for a brill article, my Linux learning has moved many inodes ;-)
Excellent article, very informative.
Thanks a lot,
Regards,
Sean
Hey Brian, love the article. As someone who's seen NTFS go belly up more than once, it's nice to understand some of the reasons why. Bleh. Hope to see you at one of the LUG meetings this year.
Calling ALL Affiliates!!
Now here is the deal of the month. Hurry ....only 24 places left for the Affiliate Bootbamp at FX Networking
Hi TheBS. I was looking for residential wine cellars related information and came across your site. Very good reading! I have a residential wine cellars site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Check it out when you can :)
Well done on a nice blog TheBS. I was looking for information on mac data recovery and came across your post this post - not quite what I was looking for related to mac data recovery but interesting all the same!
You have lovely blog here. Do yuo have any experiences about cd duplication dvd et rom. I have a web site needs little bit help, cd duplication dvd et rom
Thx for keeping us informed on cd duplication rom services. There is also another one cd duplication rom services
Nice blog TheBS. Your posts were interesting reading. I was looking for residential wine cellars related information and found your site. I have a residential wine cellars site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Please try and visit it, see what you think.
Hi TheBS. I was looking for wine cellar equipment related information and came across your site. Very good reading! I have a wine cellar equipment site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Check it out when you can :)
I enjoyed reading some of your posts TheBS. I was looking for wine cellar rack related information and found your site. I have a wine cellar rack site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Come and check it out if you get time :-)
I enjoyed reading some of your posts TheBS. I was looking for red wine related information and found your site. I have a red wine site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Come and check it out if you get time :-)
I enjoyed reading some of your posts TheBS. I was looking for wine glasses related information and found your site. I have a wine glasses site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Come and check it out if you get time :-)
Fantastic blog you've got here TheBS, I was looking for wine cooling system related information and found your site. I have a wine cooling system site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Stop by and check it out when you can.
Hi TheBS, very unique blog you have! I was looking for wine refrigerator related information and came across you rsite. Very good info, I'm definitely going to bookmark you! I have a wine refrigerator site. You'll find everything about wine, gift baskets, Napa Valley wine tours, and how to keep your wine properly chilled until it's ready to drink. Please visit it.
If you have a site similar to mine and would like to exchange links, please contact me through my website.
I was searching blogs for asic design services and found your entry” this post Its not a perfect match but Tis the season! So I thought I would write. There is lot of info on Embroidery stuff out there. I have been looking to add content to our site about Patterns and Sewing. If you have any good resources please share -- feel free to take a look at our site at asic design services
I just read your blog entry - this post -- my partner and I are planning on setting up a review website – it will live on the home page of our site asic design services. I am looking for versatility in embroidery machine and setup a comparison. If you can help with any insight or direction that would be great. I am also would like to hear about Academy's courses with hands-on embroidery that people have taken.
I just found your blog entry -- this post -- I am starting my own blog – you find frequently updated news, commentary, and the latest links about graduate school for applicants, current graduate students, post-docs, and faculty. asic design services If you come across any timely links or stories of interest to applicants, students, and faculty, share them with me -- so I can share them with all of our readers. Please visit us asic design services
I was searching the web and found your entry this post I really like your site and found it worth while reading through the posts. I am looking to publish a comprehensive site reviewing many different articles and blogg. Please feel free to take a look at my blog at asic design classes and add anything your want.
Bryan,
Congratulations, great article, almost 5 years later, and it is still true to every word.
And RedHat didn't ship XFS with RHEL 5.3 yet! O_O
Americans everywhere humor A detention wow gold notice was written like this: a wow power leveling police car with stones, to win wow gold the detention center for seven wow power leveling days all-inclusive accommodation replica rolex Tour Value; hit send 2 a beautiful bracelet, wow power level fashionsuit, police transport; more more surprises , the former can enjoy free shaved 10; before the 100 can play with power leveling the dogs, the guests were presented massage sticks, electric shocks to CHEAPEST power leveling the dead skin beauty care services.
Post a Comment