2005-12-28

Windows XP Embedded Gotchas

Thanks to virtually no help whatsoever from neither the Microsoft Developer Network (MSDN) nor Microsoft's microsoft.public.windowsxp.embedded newsgroup, I have finally tamed Windows XP Embedded (XPE) with Service Pack 2 (SP2).

Some potential gotchas and how to deal with them ...

- Throw the Embedded Studio Help out the (pun)Windows(pun)

Seriously. Do not follow the step-by-step help in building your first XP Embedded target in Microsoft Windows Embedded Studio from its included help files. They are full of incorrect approaches and steps -- things that will causes you days of grief. We will cover several.

Unfortunately, the MSDN pages mirror much of the Embedded Studio help files. The more 3rd party/independent pages you can find for help, the better. The newsgroup is limitedly helpful, but I have to admit I haven't been around it long enough. Although I did note several people openly complaining why Microsoft can't get its act together in comparison to Linux when it comes to embedded.

- TAP a "pure" XPProSP2 installation

Do not run the 16-bit version of Target Analyzer (TA.EXE) under DOS. Do not run the 32-bit version of Target Analyzer (TAP.EXE) under a Windows PE (or BartPE) boot. Do not run TAP on a XP Home or Pro installation that has been updated with drivers or other patches. Ignore just about every clueless MSDN or other site suggestion I've seen (which is many, many, many).

Install Windows XP Professional with Service Pack 2 (XPProSP2) on the system, and do not update any drivers. Leave them as-is. That is the absolute best way to generate the most "pure" Devices.PMQ file for use in the next recommendation. I have had nothing but issues with TAP when run under a Windows PE environment (let alone TA in a DOS one).

Yes, still do this even if you have 3, 4, 5+ "Unknown Devices" under Device Manager in XPProSP2, or if you normally update drivers. As long as the XPProSP2 installation can boot on its own (meaning the storage driver and other core components work), this is the "pure" Devices.PMQ file you want.

Do not worry about having all or the latest drivers installed in XPProSP2 for Target Analyzer. The PCI vendor/ID information will result in updated drivers in the Component Database being used when you build a target. In fact, I've found that sometimes updating the driver in XPProSP2 (e.g., the ViA Ultra ATA Bus Master driver) and then seemingly importing its "equivalent" SLD component into the Component Database (e.g., from ViAEmbedded.COM) does not work as it should. It's far better to stick with the "older" driver under XPProSP2 when TAP runs, than what you have updated in your Component Database.

Regarding missing drivers and, therefore, devices missed by TAP that don't make it into your Device.PMQ and, subsequent, SLD file, just manually add them to your target. The First Boot Agent (FBA) will automatically configure any missing devices as long as their components were included in the target. So just because they weren't setup under XPProSP2 when TAP was run doesn't mean they won't be available in XPE -- because that's the job of the FBA, to find all hardware and setup such devices.

To summarize, the focus of the Device.PMQ is to get a full set of "pure" components that will let XPE boot. Running the 32-bit version of Target Analyzer (TAP) under a "pure" (no added drivers/software) Windows XP Professional with Service Pack 2 (XPPProSP2) is the best recommendation I can make given 50+ hours of trial'n error with 3 different SBC/Mini-ITX systems.

- Build a SLD for the "pure" Devices.PMQ

One of the stupidest things that the Embedded Studio help has you do is import your Devices.PMQ file (from your TAP run) directly into Target Designer. Never, ever do this! You will almost always get an incomplete, unbootable XPE image build. Always work with Source Level Definition (SLD) aka "component" files.

That means you want to launch Component Designer and import your Devices.PMQ under it as a new .SLD "component" file. Of course this means you have to then import it into your Component Database via Component Database Manager, but that's a small, extra step.

I'm not expert on XPE yet, but I think I know why importing a Device.PMQ directly into Target Designer fails to build a bootable image. I got STOP 0x0000007B error after STOP0x0000007B error within seconds of first boot (well before First Boot Agent, FBA, could run). I think it's because Target Designer makes too many assumptions on what dependencies are already resolved from the Device.PMQ. Whereas if I add a .SLD "component", it seems to add/suggest all the necessary components during the dependency check.

Besides, making a .SLD "component" file for your board from the TAP run (and it's Device.PMQ) is much "cleaner" anyway.

- "Fixed Disk" CompactFlash boot requirement

This is one of the most grossly overlooked aspects of XP Embedded, and virtually ignored in Microsoft's own documentation. Unlike DOS (including Windows 9x) and Linux, the NT loader (NTLDR) has some very picky requirements on CompactFlash boot.

First off, the more well-known fact is the reality that the CompactFlash must be connected to an ATA channel. I knew this from the get-go, and made sure I had SBC/Mini-ITX systems with either ATA-to-CompactFlash logic on-PCB, or an external ATA-to-CompactFlash adapter.

Secondly, and far less well understood, is the fact that the CompactFlash must appear as a "fixed disk" with a "partition" and not as a "removable" device. I use the terms "partition" and "removable" explicitly as they would appear under NT5 (2000/XP)'s DiskPart utility. Boot Windows XP, Windows PE or BartPE and run DiskPart to see the format type.

There is a lot of commentary out there that an ATA-to-CompactFlash adapter removes this requirement. This is completely false. It's not geometry, ATA format or other issue -- it's the way the CompactFlash device itself presents. Now for DOS or Linux, it has no issue. But for Windows XP, the NTLDR absolute does not want to boot from anything that identifies itself as a "removable" device.

The reasons for this stem from countless, legacy NT design flaws and other ways to prevent (or at least inhibit) copying of a NT system that still plagues NT5+ (2000/XP). I won't go into the extensive technical reasons. But even if you use an alternative master boot record (MBR) and even some disk translating bootstrap, there appears to be no way around the issue.

In a nutshell, the "fixed disk" format is incompatible with "normal consumer usage," so you cannot find them anywhere but from a few CompactFlash vendors as a specialized OEM/Industrial part. And even then, it's typically an added P/N for the "fixed" configuration. E.g., SimpleTech's part numbers are typically of the form SLCFxxxJ-F (the "-F" suffix meaning "fixed" configuration).

The good news is that these devices typically come from the factory with the bootstrap already setup with the bootstrap for NTLR. So it's just a matter of copying over the file tree. No need to run BOOTPREP or any other utility. I mean, and side note/rant here, isn't it ironic that you have to use a set of 16-bit DOS utilities like BOOTPREP, which is always an issue in supporting a 32-bit OS like NT which may differ in geometry, hardware access, etc...?

2005-12-17

Note-To-Self: Check Every Part For Failure

I spent a good 2 hours last night, and then I spend another 2 hours again today, debugging a new PC assembly. It wasn't the first PC I assembled this week, and I used the exact same parts as the earlier unit (ASRock GF6100+nF410 S754, S754 Sempron 64 2800+). But every time I powered on the system, it just hung at the POST (power on self test) after announcing memory.

I scratched my head. I unplugged what I thought was everything -- floppy, DVD, header cables, etc... I eventually even re-cannibalized the first assembly to test with a different memory and power supply -- the two biggest offenders. I even threw a 500W ATX 2.0 PS at it. Nothing worked. So then I cannibalized the first system's CPU. Still no go. I literally tried every combination I could think of. So I finally concluded it had to be the mainboard.

Well, I went out to a local PC shop, something I don't like to do given the prices versus mail order. But this was a gift that I needed to finish by tonight, as I'm going to be out-of-town starting the 21st. Luckily I found a really good deal on a S939 Athlon 64 3000+, and a so-so price on a matching Gigabyte S939 GF6100+nF430 mainboard.

So with a completely mainboard+CPU combination, I reassembled and still got the exact same result! What was hanging the system at POST? I mean, it powered up! It started everything. But just hung there -- just like before! What the devil was wrong?

Well, I really looked at the board, and started removing every single header I could find -- even LEDs. Bam! It fully POST'd and responded to the keyboard. So then I went back, one by one, and finally found the culprit.

The simple USB 2.0 header from a Mitsumi F404 Floppy+8-in-1 card reader. A damn $20 part caused me to run out and buy another mainboard+CPU on-the-spot. Damn me for not considering that little USB header -- I just never realized it could totally halt the POST.

So I cannabalized yet another system where I had a Mitsumi F404 in use and replaced it. Sure enough, no more problems. Damn, I had avoided yanking any cables on the Mitsumi F404 because it was hard to get to, and had stupidly concluded that a single USB 2.0 header could not bring down the system. Boy was that a poor assumption!

I would be lying if I said I didn't mind USB -- I absolutely hate it from its ultra-simplistic and system-crippling design. There is no reason a header should hold up the system POST, but guess what? It very well can! It wasn't a polarity or other issue, I had the cable alignment correct. The actual, defective end-device -- an 8-in-1 card reader, attached to the USB header was the culprit -- it literally hung the system entirely at the POST!

Just FYI -- don't be assuming like I was (I know most of you aren't). This one really bites.

2005-12-15

Linux on nVidia C51/NV44 (nForce 4x0/GeForce 61x0)

Just FYI, short blog entry here. I installed Fedora Core 4 x86_64 on a nVidia C51/NV44 (nForce 4x0/GeForce 61x0) chipset system. There are a couple of issues.

PCI IDs Change

One is the fact that the PCI IDs are changed from earlier nForce4 chipsets (bad nVidia!). That means the installer doesn't autodetect the "nv_sata." You'll need to manually select the driver (near the bottom).

If the installer detects another storage device -- possible a SCSI or USB storage -- then that could cause issues if your BIOS is set to boot SATA first and/or you wish it to be /dev/sda. The workaround is to use the "noprobe" installer boot prompt option, and manually select the "nv_sata." Yes, this means you'll be without a lot in the installer.

I guess in nVidia's defense, Fedora Core 4 is over 7 months ago, and pre-dates any C51 info. Although the nForce4 was largely PCI ID compatible with the nForce3 and even nForce2 -- so I'm not sure why nVidia had to change things up.

Forcedeth Revision

Another thing I ran into is the seemingly slight variations in the 10/100[/1000] MAC. On Fedora Core 4's installer/initial 2.6.11 kernel, it didn't take to the GPL "forcedeth." However, the new nForce 1.0-0310 drivers on nVidia's site does provide a "nvnet" driver that works.

Fortunately, once I upgraded to it's updated 2.6.14 kernel, the "forcedeth" driver does work. This could also be related to PCI IDs as well (again, bad nVidia, bad!).

Cool AMD64 GRUB Splashscreen

I wasn't installing Fedora Core 4 x86_64 for any end-user usage, but on a pair of nVidia GeForce 6100 chipset mainboards with the AMD Sempron 64 2800+ that will be running Windows XP (32-bit, of course -- 64-bit XP is a nightmare, long story). I'm putting 8GB partition at the end of their 250GB disks means I can have a have a diagnostic/recovery partition.

The text-only Fedora will all the tools I'd want runs about 600MB. That leaves a good 6GB for a GNU Parted virtual disk image of the C: drive, which is a 32GB FAT32 (the rest is D:). I always recommend a 32GB or smaller C: drive with Windows XP -- dual-boot or not (unless you go NTFS -- then you should go Dynamic Disk as well, long story).

To hide the fact that Linux is dual-booting on the system, I label the Linux boot as "AMD64 Diag OS" and put it to this splashscreen ...


The same site has several other GRUB splashscreens available.

2005-12-01

Budget Uniprocessor PC Servers (WIP)

[ Work-in-Progress (WIP) ]

There's not a week that goes by on a list where someone attempts to use a desktop mainboard and/or solution as a server. When I try to push them in the right direction, I'm confronted with a believe that it's "too costly" to do otherwise. The reality is that it is no more costly to do it today than it was 5 years ago -- spending another $100 more in one or two places can give you a 3-5x server performance increase!

- I/O Segmentation and Throughput

Over 90% of uniprocessor servers deployed are still utilitizing i486-era technology -- yes, mid-'90s technology -- as the cornerstone of their I/O bottlenecks. All storage, network and other I/O is going over a legacy 32-bit @ 33MHz PCI bus. This was not even viable just 5 years ago, let alone today, and segmentation of storage and network I/O is not optional in today's environments with just 2 disks and 1 gigabit controller.

So first and foremost, you need a mainboard that segments storage and I/O. Although some of the new PCI-Express (PCIe) desktop mainboards are tempting at their commodity prices, it doesn't take much more to deliver a powerful, far better server solution. Remember, other than the video card slot, most desktop PCIe mainboards only have a PCIe x1 slot -- and that is easily saturated by 2 hard drives, or 1 GbE Ethernet controller.

It's best to find a quality mainboard solution with multiple PCIe x4 and x8 slots and one or more PCI-X channels, depending on your storage and networking solution. It doesn't matter if your processor delivers 6.4GBps of memory throughput or 8.0GBps of I/O front-side if your I/O is talking to and from memory at a measly 0.125GBps.

- It All Begins With Storage

Storage is the major latency in a uniprocessor PC server. A common belief is that more memory solves all storage issues, but the reality is that storage latency -- due to insufficient I/O bandwidth, can be the bigger killer. Especially when committing large quantities of data on a regular basis -- as the memory flushes data to disk. About the only time it doesn't matter is when over 99% of your operations are reads -- because a write to disk at only 0.125GBps is still over 50x slower than a write to memory at 6.4GBps -- so anything that writes regularly to disk (even if only 5 or 10% of the time) at such a bottleneck is going to adversely affect performance (before we even look at disk-to/from-network).

A key starting point is to identify what your storage throughput requirements are, as much as storage amount needed. If you are feeding a GbE network connection, that means you at least need a dedicated 0.125GBps PCIe x1 channel for storage. In this regard, only the latest nForce4 and Intel i9x5 chipsets barely satisfy. In reality, reading from more than 1 disk is going to saturate the channel, and it would be ideal to spend less than half of one second bursting over 0.1GBps to memory before bursting the other half of a second from memory to the network controller (or vice-versa).

So if you are considering any RAID array for sustained 0.1GBps performance, you want to look for a 0.5-1.0GBps PCIe x4/x8 or 0.5-1.0GBps 64-bit PCI or PCI-X controller. Never, never put in a quality PCI-X or 64-bit PCI controller in a 32-bit@33MHz "shared" PCI slot or you will be thrashing a lot of I/O.

- Intelligent Storage Controllers Cost Money (But Not Always a Lot)

A natural instinct of most cost-conscience IT people is to use a low-cost FRAID (Fake RAID) PCI or PCIe x1 storage controller. These are nothing less than software RAID controllers, only the software is in the driver, and is probably worse than your OS' own software RAID logic. It would be far better to just use the mainboard's SATA channels on a dedicated PCIe x1 channel with the OS' mirroring/striping/etc... than to use one of these cards. Don't let the offer of RAID-5 on newer FRAID controllers fool you -- they are not hardware RAID.

There are few PCIe storage controllers available, notables include the:
  • LSI Logic 320-2E -- 2-channel U320 SCSI PCIe x8 storage controller using an Intel IOP332 X-Scale controller
  • Areca ARC-1210/1220/1230/1260 -- 4, 8, 12 and 16-channel SATA PCIe x4/x8 storage controllers using an Intel IOP333 X-Scale controller
Unfortunately, SCSI is not-so-commodity at the storage device, so most will not want to invest in th cost of the LSI Logic 320-2E. The entry-level Areca ARC-1210 starts at $400 for just a 4-channel, so it too starts to push the boundaries of cost. Unless massive storage is desired, and then the cost of the 8-channel Areca ARC-1220 less noticable much after disk costs, this too is not an ideal cost.

For the most part, the 2-channel SATA 3Ware Escalade 8006-2 and 4-channel SATA 3Ware Escalade 8506-4LP are far more cost effective at around $125 and $250, respectively. In RAID-0, 1, 10 (fastest) and 5 (most efficient), the 3Ware product is very flexible. And its 64-bit ASIC RAID is proven to be very reliable in the 4-years of the 7000/8000 product's existence -- especially its user-space tools (especially under Linux). They are 64-bit@66MHz universal 3.3/5V cards, so they work in PCI-X slots (at 66MHz). The 64-bit ASIC RAID is also ideal for non-block SATA I/O, and has full hot-swap support unlike most FRAID cards that rely on the OS (long story).

For those concerned about the "reliability" of SATA disks versus SCSI, please read the following sidebars from my 2005 September storage article, as well as my blog article on Serial Attached SCSI (SAS):
Once you have the storage controller chosen, and all of it's limitations, it's time to move onto the mainboard, which must handle this storage controller.

- The Mainboard Confuses as Chipsets Have Changed

Probably the most confusing aspect of mainboards these days are the chips -- not so much chipsets -- involved. First off, in the AMD Athlon 64/Opteron platforms, you can mix'n match different HyperTransport tunnels and bridges for whatever you'd like. Secondly, and still far more commonly, people don't look to anything but Intel or other desktop chipsets when looking at Intel processors.

So let's get some rules out so you're aware of them.
  • Intel has not, and never will, design a good server chipset -- thank God for ServerWorks (and the resulting E7500/7200 series)
Luckily for OEMs and, in the last 5 years, resellers as well, ServerWorks (formerly Reliance Computer Corporation, RCC, now owned by Broadcom) has designed Intel's latest chipsets. For those of us who deployed Pentium III and (P3-based) Xeon, the ServerWorks ServerSet IIIHE and LE chipsets with their single (or even multiple) 64-bit PCI bridges got the call. ServerWorks produced a good chipset for the Pentium 4 and (P4-based) Xeon, the Grand Champion (GC) series. The GC provides the basis for the Intel E7500 series, designed by ServerWorks who Intel has cross-licensed.

More on the uniprocessor front, Intel has introduced a lower-cost chipset in the E7200, also based on ServerWorks designs. There are 3 E7200 chipsets to be aware of ...
  • Intel E7210: Socket-478 (P4) or Socket-603/604 (P4-Xeon), DDR SDRAM, PCI-X 1.0 (1GBps)
  • Intel E7221: LGA-775 (P4-Prescott) or Socket-604 (Prescott-Xeon), DDR2 SDRAM, PCI-X 1.0 (1GBps), PCIe x8 (1GBps)
  • Intel E7230: LGA-775 "dual core" (Pentium D), DDR2 SDRAM, PCI-X 1.0 (1GBps), PCIe x8 (1GBps)
The E7210 is still a good buy, even with older Socket-478 processors and DDR SDRAM. The single PCI-X 1.0 channel means that storage and NIC might be sharing the same bus -- but it's still typically a 64-bit @ 100MHz (0.75GBps) or at least 64-bit @ 66MHz (0.5GBps) and 6-4x as fast as a shared, 32-bit @ 33MHz PCI bus. Going with a Socket-603/604 P4-Xeon processor would definitely result in better server performance as the memory channels are true dual-interleaved (Socket-478 is only marketed to be, when it is actually not).

The main difference between the E7221 and E7230 is additional logic support for the Pentium D (dual core). Otherwise, E7221 makes a fine solution. The combination of a PCIe x4 or x8 slot and a PCI-X 1.0 slot means segmented network (typically PCIe x4) and storage (typically PCI-X or 64-bit@66MHz PCI) at a low price-point.

There are a few, low-cost E7230 boards out there that just by-pass the PCI-X slot altogether, drastically cutting down on mainboard traces (hence the cost savings). Remember that a physical PCIe x4 slot is not necessarily a PCIe x4 slot -- and there are only so many PCIe channels, so read the manual. It's quite often that the PCIe x4 slot next to that PCIe x8 slot on a low-cost E7230 mainboard is only a PCIe x1 electrically. Not good with the on-board GbE networking is rather pathetic (like an Intel 82541 connected to the legacy 0.125GBps PCI bus) and you actually need it.
  • AMD HyperTransport+NUMA is the Server Performance King Now -- even on Uniprocessor
First thing to remember about AMD is that there is no such thing as a "Front Side Bus" (FSB) anymore as there are multiple entries into the CPU -- as little as 2 to the typical 3-5 for Athlon64/Opteron. Second thing to remember is that there is no such thing as a "chipset" anymore, as there can be multiple chips from different vendors.

E.g., nVidia nForce Pro 2000 and AMD8000 series chips can and are often "tunneled" on mainboards, with some chips connecting to different processors. The Tyan S2895 Thunder K8WE connects an nVidia nForce Pro 2200 and AMD8131 to one processor with a PCIe x16 slot (completely x16 electrically) and two PCI-X channels (two slots on one channel, one slot on another), while the other processor has a nForce Pro 2050 with its own, full PCIe x16 slot (again, competely x16 electrically). Each nForce has its own GbE port, 4x SATA channels connected to one PCIe x1, etc...

In addition to putting 2 local, glueless, full 168-pin (384 trace) DDR channels directly on each processor resulting in the Non-Uniform Memory Architecture (NUMA), AMD uses a bi-directional system I/O interconnect known as HyperTransport. Because of its bi-directional nature, the "standard" PCI-X HyperTransport tunnel has dual PCI-X busses. The AMD8131 is a dual PCI-X 1.0 (1GBps) tunnel and the rarer AMD8132 is a dual PCI-X 2.0 (2GBps) tunnel. With so many traces for 2 full busses (and potentially up to 10 PCI-X slots @ 66MHz, although typically it's only 2-4 at 66-133MHz), it is very, very difficult to find a uniprocessor mainboard with the costly AMD8131 and all its traces. Some high-end dual and quad processor systems even use 2 (HP DL585) or even 3 (Sun Sunfire v40z) AMD8131 chips for 6 PCI-X channels (yikes!).

Broadcom's ServerWorks division is now producting chips for AMD Opterons. This includes the very cost-effective HT1000 for Socket-939/940 Opteron 100 series. The HT1000 provides a single PCI-X channel, putting it on par with the Intel E7210 as a $200+ mainboard uniprocessor PC server solution. One such mainboard with the HT1000 is the SuperMicro H8SSL-i which is becoming popular with 1U system integrators. The on-board BCM5704 delivers server-quality dual-GbE ports, and then there is a PCI-X slot available for a storage controller.

ServerWorks also produces the HT2000 which adds another PCI-X channel and a PCIe x8 slot to the HT1000 for dual or quad processor systems.

- Overlooking the Gigabit Ethernet (GbE), Not All are Created Equal

Probably the most overlooked component in a server, let alone network infrastructure, is how to properly deploy Gigabit Ethernet (GbE). Although I could spend an entire blog (or even book!) on how to properly deploy GbE services, here's a breakdown of the most important requirements.
  1. Absolute: 802.3x support
  2. Very High: Jumbo Frame support (upto 16KiB)
  3. Very High: Large Packet buffer (64+KiB)
  4. Other considerations: 802.1Q VLAN, 802.3ad Link Aggregation
#1 is absolute. It is the direct result of GbE's commoditization, as more and more cheap GbE hardware is deployed. Your networking equipment and all of your nodes must support 802.3x. 802.3x is flow control, and it allows a switch or nodes to tell the other end of the communication to slow or stop communicating while the processing catches up. Using the standard 1500 byte Ethernet frame, there are no less than 600,000 frames per second going in and out of your GbE NIC. With most standard desktop and even many low-cost server NICs having a measly 2-8KiB SRAM buffer total, that's barely enough to handle a few frames -- let alone if it's connected to the legacy 0.125GBps PCI bus shared by everything else!

Without flow control, the talkers keep talking and the receivers keep sending out "resend please" messages -- especially at layer-4 UDP/IP and TCP/IP -- if the NIC can't keep up, or at least push it to the memory for processing. So the hit quickly becomes like old "collisions" in the days of shared Ethernet -- exponential! In the early days of GbE when 2,048KiB SRAM caches were commonplace on NICs, this was rare (I remember gasping when the NetGear GA620 1000Base-SX card only had 512KiB of SRAM -- 1/4th typical!). But now, most desktop cards barely have a 2KiB SRAM buffer -- enough just to handle 1 packet.

#2 is really a bare minimum for a server card. If the card is capable of 16KiB Jumbo Frames, then it should have at least 16KiB SRAM cache for TX/RX. Many will have a split TX/RX SRAM design -- possibly more on the RX (receive 2-8x) versus TX (transmit) to cache incoming packets. We'll talk about what Jumbo Frames are later. Some just have split 16+16KiB, and then have a secondary, unified SRAM buffer which is #3. Here 48-96KiB/per-port is typical for single-chip server NICs today. Older NICs will have 512+KiB off chip (and cost more).

#4 brings in a number of things. First off, Jumbo Frames. Using 9000 byte Jumbo Frames reduce the number of packets at GbE to only 100,000/second, 6x as manageable. It is recommended that any "out-of-band"/dedicated server-to-server network use Jumbo Frames for performance. You'll easily get the best performance. If you can, try to use Jumbo Frames everywhere, but then that means either A) you have to have all nodes using Jumbo Frames, or B) you have to setup 802.1Q VLANs and route. "B" is possible if you already have a Layer-3 switch and it handles auto-VLAN setup (based on frame size). It's far more complicated without a Layer-3 switch, unless you physically segement the 1500 byte frame network from the 9000 byte frame network. Again, except for maybe a dedicated, "out-of-band" network (e.g., server backup switch/cards), this is not feasible.

802.3ad Link Aggregation is also ideal so both switches and end-nodes can use multiple links -- both for performance and failover. It's far more ideal than old 802.1d Spanning Tree, and most OSes or drivers support it now. I.e., in Linux, a generic 802.3ad driver exists in the kernel for any cards that support it (even across vendors). Under Windows, the card vendor provides 802.3ad support for use with its own cards (although it may or may not work with competitor's cards).