2007-03-07

"Common Sense" Disaster Recovery Fundamentals

Most people in IT get away with not having a disaster recovery plan every year. Why? Because they don't have a disaster. For those that do, they have ensured over 2/3rds of their employers do not recover. Don't be a statistic -- or rather -- don't let your employer be a statistic.

There are basic fundamentals to disaster recovery. They are:
1. The 3 basic levels of data redundancy
2. The time crunch, especially recovery time
3. Test restores are an ongoing duty

I've heard arguments of budget, simplicity and countless other reasons why various IT support departments fail to have a disaster recovery. I've also seen many, many false disaster recovery setups. They all make my shake my head. Especially when disaster recovery can be very inexpensive (and in free) in most cases.

1. The 3 basic levels of data redundancy

There are 3 basic levels of data redundancy:
A. On-line: immediate recovery of data on the system itself
B. Near-line: near-immediate recovery from another system on the network
C. Off-line: disaster recovery when you don't have the system, let alone the network

On-line includes more than redundant power and disk, which merely saves you time (and money, which is a good reason to pay an extra $100 for a 2nd disk in even desktop systems these days) and avoids downtime. Today servers should use filesystem snapshots so accidentally deleted files can be recovered by users without bothering system administrators. It's worth the time and effort to implement filesystem snapshots on servers, or even the free VMWare Server product to run servers as guest OSes and do snapshots of the entire system (which is great for backups too). There are countless options here to help you, and you don't need to buy a NetApp hardware solution to get them.

Near-line means disk on the network, recoverable within a short period of time, let alone provides you a way to backup quickly in the first place. This could be as simple as a server that receives rsync updates from systems every night (ideal this is also your off-line backup server, as it can put data to tape or ruggedized 24x7, and not merely just during the backup window), rotating backup sets or maybe using filesystem snapshots to do the same. Near-line systems can be as simple as an extra $500 PC on the network with two very large disks in a RAID-1 configuration. Again, there are countless options here to help you, and you don't have to go out and buy a full-up Virtual Tape Library (VTL) product (of which my article, in the previous link, dissects and isn't really much different than what you can do on your own) to get them.

Off-line can mean a combination of things, but it usually means tape or ruggedized disk, and it means only periodically. When you have a near-line system, you only need to off-line every week or two for off-site disaster recovery -- that's it! As such, off-line does not mean a 3.5" disk in a hot-swap bay or an USB or IEEE1394 FireWire connection. 3.5" disk is not meant to be thrown around. Consider at least a 2.5" disk -- which takes 20-50x the shock -- and is as low as $0.75/GB today (160GB for $120), which is being even being used more and more in an "enclosed cartridge" by various vendors. There are even hot-swap SATA bays now that fit four (4) 2.5" drives in a 5.25" bay. Of course, if you're putting TBs of data off-line, just invest $500 into an LTO-1 (or later) tape drive which will quickly pay for itself, as well as has an extensive upgrade path through multi-TB cartridges. In either case, these "off-line" options always go in your "near-line" server, not your main servers up 24x7, so if they cause an issue (e.g., when removing 2.5" disks), no one is down!

If you're not seeing a repeat theme here -- near-line disk and off-line ruggedized disk or tape complement each other, but you need both near-line and off-line. You want to keep near-line disk so you can easily backup/restore your actual systems within the backup window and off-line media so you can restore after a disaster. People dismiss tape because they incorrectly implement tape -- you should never backup directly to tape from end-systems, but use a near-line system to directly backup near-line to disk during your backup window, and then off-line to tape from that near-line system at your leisure. That solves the alleged "tape problem," which isn't a "tape problem" but people not removing the disadvantages of tape (linear access) leaving only the advantages (cartridge life, portability, etc...) over 3.5" disk.

If you can't afford tape or don't have enough data to justify it's cost then get some 2.5" drives, which are designed for the torture of being moved around in laptops. Don't look to external 3.5" drives, especially not via USB or FireWire, as they are often not designed for hot-plugging with servers (despite marketing to the contrary), and can bring down your servers (especially overnight when there is a "bus disconnect"). Use 3.5" drives as they were designed, as fixed disks, in a near-line server as above. This is much better of a solution -- often costing less than $500 with a PC and two (2) redundant disks -- than a few external drives that aren't redundant and not as fast over the external buses either.

As countless people have thanked me (among others who put forth the same), recognizing near-line is the answer for nearly all restores -- because companies typically don't have a disaster event -- is why you don't need (let alone want) to use removable disk for off-line. You only need off-line for that disaster chance, so you only need to off-line every 2 or so weeks. Use commodity 3.5" disk in its natural, fixed configuration on the network as your staple, near-line solution, and then complement that with infrequent off-line to either tape or ruggedized disk from that near-line server which can go up and down, unlike your main servers (which you don't want to be hot plugging things on).

2. The time crunch, especially recovery time

Time is money. The time crunch after any disaster -- whether it's a single, server melt-down or the destruction of a full office -- is money down the drain tied linearly to the duration of the downtime. Consider the following concepts:
A. Consistency/Reproducibility
B. Boot to Restore
C. Network Reconfiguration

Consistency and reproducibility are the staples of configuration management in any environment. "One-off" systems can be a real PITA for configuration management, especially in enterprises of multiple servers. When you go to install any capability, try to do as many systems at the same time in the same configuration -- and consider buying 1 extra system for immediate replacement or spare parts -- especially if the systems will in use for more than 2 years as older parts will be no longer available. And at least make it so you can install the next system the exact same way as you did earlier -- be it cloning (often required for Windows), or formal/proper package management (easily to do with modern Linux distros), etc... And another option today, which is free, is to always install your servers as guest OSes in VMWare -- which can be moved to any system running VMWare Player/Server -- ala "reinstall" in 15 minutes.

Put these images on bootable DVDs and make several, extra copies to go with both your near-line systems and off-line media. That way you're ready to bring any type of system back-to-life when you need it. Again, the more you standardize your installs, the easier it is to consistently reproduce them.

Boot to restore time is yet another reality. The longer you can't bring a server back to life, the more users are down and it costs your company. Your bootable DVDs with your system images should have a way to reinstall those images. It's worth paying for 1 license of a professional Windows recovery system for each Windows server. And with Linux, you should know how to recover any Linux system from a "rescue" or other CD, and merging that boot with the images is the best thing you can do. And, again, if you really don't want to deal with much, just having the images as VMWare guest makes it cake, as you completely separate the hardware aspects from the OS run-time, and any VMWare Player or Server will get you back and running.

In all cases, again, your system images are useless or a major PITA if you don't have a way to boot them and get them back on the system -- especially when it might not be the original hardware.

Network reconfiguration is the last aspect people overlook. I don't know how many times I've seen a router or switch fail, and no one backed up the firmware. To make things worse, you need to have not mere network topology documentation, but an extensive document listing all the services required on your network -- especially when a major disaster takes down your office. Your desktops and servers could be ready, but select network devices or networking services could be the massive "reverse engineering effort" that keeps them from running.

Especially since sysadmins have a nasty habit of looking at networking separate, possibly even outsourcing support. Make sure you get that vital, reconfiguration information from your ISP, network support, etc..., and put copies on your restore DVDs.

3. Test restores are an ongoing duty

Lastly, and definitely not least, test restores are not optional. Specifically:
A. On-line: Check options daily
B. Near-line: Daily logs, check restore before you off-line
C. Off-line: As you off-line, as well as when you rotate, from near-line

On-line restores, such as filesystem snapshots, will pretty much be checked by your users. That's the power of filesystem snapshots and letting your users help themselves, your users will let you know if they don't work. And if you are using VMWare or other virtualization snapshots, every now and then test your VMWare snapshot from the previous day on a spare system (if you have one -- you should if you have at least 3 servers), separate from the network.

Near-line restores will also be fairly natural. Check your logs daily to see what nightly pulls from systems didn't complete. Otherwise you really only need to check one other time, when you off-line. Because if and when you are going to put to off-line tape or ruggedized disk, you might want to check that the backup was complete before you do. This is also where you want to make use of that spare server, or at least some spare PC.

Off-line is really the only time anything is a PITA. You at least want to use "write after verify" to your tape or ruggized disk. People who incorrectly backup directly to tape, instead of using a near-line disk solution, often turn this off because it won't finish in their backup window. Just more reason to get a near-line solution, because you can off-line data for off-site storage at any time, from any recent backup that's on your near-line server! That's why near-line is your staple backup/restore solution, especially when complementing off-line!

Furthermore, you need to periodically check your previous off-line backups.
I normally recommend 2 off-line check a month, or at least 1 per month, and it's normally a real pain. People loathe this, but that's where near-line comes to the rescue yet again. Instead of restoring to the original server, which most people do when they don't have separate near-line -- only directly to off-line (or, worse yet, a removable/portable disk right on the server), you can directly compare the off-line media to the near-line server's storage. You can even restore to the near-line server, not bothering your main servers. And you could even store file lists and checksums of your off-line backups on your near-line server, which gives you a quick "sanity check" if a prior off-line has gone bad or not.

Conclusions

This article is about "common sense." When people start talking about proper backup, they think big bucks or big efforts. No, that's so far from the truth it's not even funny. The "common sense" is all in the restoration. And it's often very cheap to do (even for small businesses) and saves sysadmins countless hours compared to the common (and poor) procedures in place.

For the most part this is realizing a near-line backup server is the key. It gives you the fastest 3.5" disk performance available, in their natural, fixed, "on" (not stored) environment, right on your network, ready to use. You don't need every backup to be removed from your office, especially since that's only an issue if and when you have a disaster.

From there you only need to do an off-line backup ever 2 weeks or so, and most companies only retain data for longer-term store on a monthly basis. That's where the near-line backup server is ideal again -- because it has backups of all your systems you can then put to off-line media (tape or ruggedized disk) at your leisure, and not under the constraints of the backup window. It also gives you a system to test restores from older, off-line backup media too, without bothering the servers.

3 comments:

Viral said...

About 80% businesses failed after a major data disaster
happened to them. This we have already seen recently
in the UK when major floods caused a major breakdown for
various IT and non-IT companies to lost their data and they
were out of business.

To avoid this, any business, either SMB or enterprise, must
have disaster recovery plan. Bare Metal Recovery is one
technology which is available in the market but not all
people know about this. Check this out at,

www.unitrends.co.uk

They are the originator for Bare Metal term. Using this
technology one can restore OS and Data very quickly.

wow power leveling said...

Americans everywhere humor A detention wow gold notice was written like this: a wow power leveling police car with stones, to win wow gold the detention center for seven wow power leveling days all-inclusive accommodation replica rolex Tour Value; hit send 2 a beautiful bracelet, wow power level fashionsuit, police transport; more more surprises , the former can enjoy free shaved 10; before the 100 can play with power leveling the dogs, the guests were presented massage sticks, electric shocks to CHEAPEST power leveling the dead skin beauty care services.

Adi said...

Find Internet Marketing resource hare Online Marketing Strategy | Internet Marketing Tools | Online Marketing Campaign | Online Marketing Business | Online Marketing System | Online Business | Online Home Business | Online Business Tips | Internet Marketing Online