2005-12-01

Budget Uniprocessor PC Servers (WIP)

[ Work-in-Progress (WIP) ]

There's not a week that goes by on a list where someone attempts to use a desktop mainboard and/or solution as a server. When I try to push them in the right direction, I'm confronted with a believe that it's "too costly" to do otherwise. The reality is that it is no more costly to do it today than it was 5 years ago -- spending another $100 more in one or two places can give you a 3-5x server performance increase!

- I/O Segmentation and Throughput

Over 90% of uniprocessor servers deployed are still utilitizing i486-era technology -- yes, mid-'90s technology -- as the cornerstone of their I/O bottlenecks. All storage, network and other I/O is going over a legacy 32-bit @ 33MHz PCI bus. This was not even viable just 5 years ago, let alone today, and segmentation of storage and network I/O is not optional in today's environments with just 2 disks and 1 gigabit controller.

So first and foremost, you need a mainboard that segments storage and I/O. Although some of the new PCI-Express (PCIe) desktop mainboards are tempting at their commodity prices, it doesn't take much more to deliver a powerful, far better server solution. Remember, other than the video card slot, most desktop PCIe mainboards only have a PCIe x1 slot -- and that is easily saturated by 2 hard drives, or 1 GbE Ethernet controller.

It's best to find a quality mainboard solution with multiple PCIe x4 and x8 slots and one or more PCI-X channels, depending on your storage and networking solution. It doesn't matter if your processor delivers 6.4GBps of memory throughput or 8.0GBps of I/O front-side if your I/O is talking to and from memory at a measly 0.125GBps.

- It All Begins With Storage

Storage is the major latency in a uniprocessor PC server. A common belief is that more memory solves all storage issues, but the reality is that storage latency -- due to insufficient I/O bandwidth, can be the bigger killer. Especially when committing large quantities of data on a regular basis -- as the memory flushes data to disk. About the only time it doesn't matter is when over 99% of your operations are reads -- because a write to disk at only 0.125GBps is still over 50x slower than a write to memory at 6.4GBps -- so anything that writes regularly to disk (even if only 5 or 10% of the time) at such a bottleneck is going to adversely affect performance (before we even look at disk-to/from-network).

A key starting point is to identify what your storage throughput requirements are, as much as storage amount needed. If you are feeding a GbE network connection, that means you at least need a dedicated 0.125GBps PCIe x1 channel for storage. In this regard, only the latest nForce4 and Intel i9x5 chipsets barely satisfy. In reality, reading from more than 1 disk is going to saturate the channel, and it would be ideal to spend less than half of one second bursting over 0.1GBps to memory before bursting the other half of a second from memory to the network controller (or vice-versa).

So if you are considering any RAID array for sustained 0.1GBps performance, you want to look for a 0.5-1.0GBps PCIe x4/x8 or 0.5-1.0GBps 64-bit PCI or PCI-X controller. Never, never put in a quality PCI-X or 64-bit PCI controller in a 32-bit@33MHz "shared" PCI slot or you will be thrashing a lot of I/O.

- Intelligent Storage Controllers Cost Money (But Not Always a Lot)

A natural instinct of most cost-conscience IT people is to use a low-cost FRAID (Fake RAID) PCI or PCIe x1 storage controller. These are nothing less than software RAID controllers, only the software is in the driver, and is probably worse than your OS' own software RAID logic. It would be far better to just use the mainboard's SATA channels on a dedicated PCIe x1 channel with the OS' mirroring/striping/etc... than to use one of these cards. Don't let the offer of RAID-5 on newer FRAID controllers fool you -- they are not hardware RAID.

There are few PCIe storage controllers available, notables include the:
  • LSI Logic 320-2E -- 2-channel U320 SCSI PCIe x8 storage controller using an Intel IOP332 X-Scale controller
  • Areca ARC-1210/1220/1230/1260 -- 4, 8, 12 and 16-channel SATA PCIe x4/x8 storage controllers using an Intel IOP333 X-Scale controller
Unfortunately, SCSI is not-so-commodity at the storage device, so most will not want to invest in th cost of the LSI Logic 320-2E. The entry-level Areca ARC-1210 starts at $400 for just a 4-channel, so it too starts to push the boundaries of cost. Unless massive storage is desired, and then the cost of the 8-channel Areca ARC-1220 less noticable much after disk costs, this too is not an ideal cost.

For the most part, the 2-channel SATA 3Ware Escalade 8006-2 and 4-channel SATA 3Ware Escalade 8506-4LP are far more cost effective at around $125 and $250, respectively. In RAID-0, 1, 10 (fastest) and 5 (most efficient), the 3Ware product is very flexible. And its 64-bit ASIC RAID is proven to be very reliable in the 4-years of the 7000/8000 product's existence -- especially its user-space tools (especially under Linux). They are 64-bit@66MHz universal 3.3/5V cards, so they work in PCI-X slots (at 66MHz). The 64-bit ASIC RAID is also ideal for non-block SATA I/O, and has full hot-swap support unlike most FRAID cards that rely on the OS (long story).

For those concerned about the "reliability" of SATA disks versus SCSI, please read the following sidebars from my 2005 September storage article, as well as my blog article on Serial Attached SCSI (SAS):
Once you have the storage controller chosen, and all of it's limitations, it's time to move onto the mainboard, which must handle this storage controller.

- The Mainboard Confuses as Chipsets Have Changed

Probably the most confusing aspect of mainboards these days are the chips -- not so much chipsets -- involved. First off, in the AMD Athlon 64/Opteron platforms, you can mix'n match different HyperTransport tunnels and bridges for whatever you'd like. Secondly, and still far more commonly, people don't look to anything but Intel or other desktop chipsets when looking at Intel processors.

So let's get some rules out so you're aware of them.
  • Intel has not, and never will, design a good server chipset -- thank God for ServerWorks (and the resulting E7500/7200 series)
Luckily for OEMs and, in the last 5 years, resellers as well, ServerWorks (formerly Reliance Computer Corporation, RCC, now owned by Broadcom) has designed Intel's latest chipsets. For those of us who deployed Pentium III and (P3-based) Xeon, the ServerWorks ServerSet IIIHE and LE chipsets with their single (or even multiple) 64-bit PCI bridges got the call. ServerWorks produced a good chipset for the Pentium 4 and (P4-based) Xeon, the Grand Champion (GC) series. The GC provides the basis for the Intel E7500 series, designed by ServerWorks who Intel has cross-licensed.

More on the uniprocessor front, Intel has introduced a lower-cost chipset in the E7200, also based on ServerWorks designs. There are 3 E7200 chipsets to be aware of ...
  • Intel E7210: Socket-478 (P4) or Socket-603/604 (P4-Xeon), DDR SDRAM, PCI-X 1.0 (1GBps)
  • Intel E7221: LGA-775 (P4-Prescott) or Socket-604 (Prescott-Xeon), DDR2 SDRAM, PCI-X 1.0 (1GBps), PCIe x8 (1GBps)
  • Intel E7230: LGA-775 "dual core" (Pentium D), DDR2 SDRAM, PCI-X 1.0 (1GBps), PCIe x8 (1GBps)
The E7210 is still a good buy, even with older Socket-478 processors and DDR SDRAM. The single PCI-X 1.0 channel means that storage and NIC might be sharing the same bus -- but it's still typically a 64-bit @ 100MHz (0.75GBps) or at least 64-bit @ 66MHz (0.5GBps) and 6-4x as fast as a shared, 32-bit @ 33MHz PCI bus. Going with a Socket-603/604 P4-Xeon processor would definitely result in better server performance as the memory channels are true dual-interleaved (Socket-478 is only marketed to be, when it is actually not).

The main difference between the E7221 and E7230 is additional logic support for the Pentium D (dual core). Otherwise, E7221 makes a fine solution. The combination of a PCIe x4 or x8 slot and a PCI-X 1.0 slot means segmented network (typically PCIe x4) and storage (typically PCI-X or 64-bit@66MHz PCI) at a low price-point.

There are a few, low-cost E7230 boards out there that just by-pass the PCI-X slot altogether, drastically cutting down on mainboard traces (hence the cost savings). Remember that a physical PCIe x4 slot is not necessarily a PCIe x4 slot -- and there are only so many PCIe channels, so read the manual. It's quite often that the PCIe x4 slot next to that PCIe x8 slot on a low-cost E7230 mainboard is only a PCIe x1 electrically. Not good with the on-board GbE networking is rather pathetic (like an Intel 82541 connected to the legacy 0.125GBps PCI bus) and you actually need it.
  • AMD HyperTransport+NUMA is the Server Performance King Now -- even on Uniprocessor
First thing to remember about AMD is that there is no such thing as a "Front Side Bus" (FSB) anymore as there are multiple entries into the CPU -- as little as 2 to the typical 3-5 for Athlon64/Opteron. Second thing to remember is that there is no such thing as a "chipset" anymore, as there can be multiple chips from different vendors.

E.g., nVidia nForce Pro 2000 and AMD8000 series chips can and are often "tunneled" on mainboards, with some chips connecting to different processors. The Tyan S2895 Thunder K8WE connects an nVidia nForce Pro 2200 and AMD8131 to one processor with a PCIe x16 slot (completely x16 electrically) and two PCI-X channels (two slots on one channel, one slot on another), while the other processor has a nForce Pro 2050 with its own, full PCIe x16 slot (again, competely x16 electrically). Each nForce has its own GbE port, 4x SATA channels connected to one PCIe x1, etc...

In addition to putting 2 local, glueless, full 168-pin (384 trace) DDR channels directly on each processor resulting in the Non-Uniform Memory Architecture (NUMA), AMD uses a bi-directional system I/O interconnect known as HyperTransport. Because of its bi-directional nature, the "standard" PCI-X HyperTransport tunnel has dual PCI-X busses. The AMD8131 is a dual PCI-X 1.0 (1GBps) tunnel and the rarer AMD8132 is a dual PCI-X 2.0 (2GBps) tunnel. With so many traces for 2 full busses (and potentially up to 10 PCI-X slots @ 66MHz, although typically it's only 2-4 at 66-133MHz), it is very, very difficult to find a uniprocessor mainboard with the costly AMD8131 and all its traces. Some high-end dual and quad processor systems even use 2 (HP DL585) or even 3 (Sun Sunfire v40z) AMD8131 chips for 6 PCI-X channels (yikes!).

Broadcom's ServerWorks division is now producting chips for AMD Opterons. This includes the very cost-effective HT1000 for Socket-939/940 Opteron 100 series. The HT1000 provides a single PCI-X channel, putting it on par with the Intel E7210 as a $200+ mainboard uniprocessor PC server solution. One such mainboard with the HT1000 is the SuperMicro H8SSL-i which is becoming popular with 1U system integrators. The on-board BCM5704 delivers server-quality dual-GbE ports, and then there is a PCI-X slot available for a storage controller.

ServerWorks also produces the HT2000 which adds another PCI-X channel and a PCIe x8 slot to the HT1000 for dual or quad processor systems.

- Overlooking the Gigabit Ethernet (GbE), Not All are Created Equal

Probably the most overlooked component in a server, let alone network infrastructure, is how to properly deploy Gigabit Ethernet (GbE). Although I could spend an entire blog (or even book!) on how to properly deploy GbE services, here's a breakdown of the most important requirements.
  1. Absolute: 802.3x support
  2. Very High: Jumbo Frame support (upto 16KiB)
  3. Very High: Large Packet buffer (64+KiB)
  4. Other considerations: 802.1Q VLAN, 802.3ad Link Aggregation
#1 is absolute. It is the direct result of GbE's commoditization, as more and more cheap GbE hardware is deployed. Your networking equipment and all of your nodes must support 802.3x. 802.3x is flow control, and it allows a switch or nodes to tell the other end of the communication to slow or stop communicating while the processing catches up. Using the standard 1500 byte Ethernet frame, there are no less than 600,000 frames per second going in and out of your GbE NIC. With most standard desktop and even many low-cost server NICs having a measly 2-8KiB SRAM buffer total, that's barely enough to handle a few frames -- let alone if it's connected to the legacy 0.125GBps PCI bus shared by everything else!

Without flow control, the talkers keep talking and the receivers keep sending out "resend please" messages -- especially at layer-4 UDP/IP and TCP/IP -- if the NIC can't keep up, or at least push it to the memory for processing. So the hit quickly becomes like old "collisions" in the days of shared Ethernet -- exponential! In the early days of GbE when 2,048KiB SRAM caches were commonplace on NICs, this was rare (I remember gasping when the NetGear GA620 1000Base-SX card only had 512KiB of SRAM -- 1/4th typical!). But now, most desktop cards barely have a 2KiB SRAM buffer -- enough just to handle 1 packet.

#2 is really a bare minimum for a server card. If the card is capable of 16KiB Jumbo Frames, then it should have at least 16KiB SRAM cache for TX/RX. Many will have a split TX/RX SRAM design -- possibly more on the RX (receive 2-8x) versus TX (transmit) to cache incoming packets. We'll talk about what Jumbo Frames are later. Some just have split 16+16KiB, and then have a secondary, unified SRAM buffer which is #3. Here 48-96KiB/per-port is typical for single-chip server NICs today. Older NICs will have 512+KiB off chip (and cost more).

#4 brings in a number of things. First off, Jumbo Frames. Using 9000 byte Jumbo Frames reduce the number of packets at GbE to only 100,000/second, 6x as manageable. It is recommended that any "out-of-band"/dedicated server-to-server network use Jumbo Frames for performance. You'll easily get the best performance. If you can, try to use Jumbo Frames everywhere, but then that means either A) you have to have all nodes using Jumbo Frames, or B) you have to setup 802.1Q VLANs and route. "B" is possible if you already have a Layer-3 switch and it handles auto-VLAN setup (based on frame size). It's far more complicated without a Layer-3 switch, unless you physically segement the 1500 byte frame network from the 9000 byte frame network. Again, except for maybe a dedicated, "out-of-band" network (e.g., server backup switch/cards), this is not feasible.

802.3ad Link Aggregation is also ideal so both switches and end-nodes can use multiple links -- both for performance and failover. It's far more ideal than old 802.1d Spanning Tree, and most OSes or drivers support it now. I.e., in Linux, a generic 802.3ad driver exists in the kernel for any cards that support it (even across vendors). Under Windows, the card vendor provides 802.3ad support for use with its own cards (although it may or may not work with competitor's cards).


2 comments:

Donte said...

I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.


Joannah

http://transcendmemory.net

wow power leveling said...

Americans everywhere humor A detention wow gold notice was written like this: a wow power leveling police car with stones, to win wow gold the detention center for seven wow power leveling days all-inclusive accommodation replica rolex Tour Value; hit send 2 a beautiful bracelet, wow power level fashionsuit, police transport; more more surprises , the former can enjoy free shaved 10; before the 100 can play with power leveling the dogs, the guests were presented massage sticks, electric shocks to CHEAPEST power leveling the dead skin beauty care services.