Archive for the ‘backups’ Category

Outsource IT, Part III
Posted by Steve Kinman on March 5th 2008

Outsource IT – Part III
Third in a series of three! In other words you won’t have to read this stuff anymore after this one. I will get back to the fun ones. I might try to make this one fun along the way. So I left off on the last one discussing some of the financial reasons and technical reasons to outsource your servers. This blog will be geared towards some ideas floating around in my head on what would be some good examples of outsourcing.

You have to step back and look at it from a different angle. If you aren’t ready to outsource the whole farm just yet, then you can go about it in a couple of different ways. One, you can outsource your sandbox, development, and/or test environment. We all know that with SAS 70 and SOX you have to have all of these (or most of them anyway). And outsourcing might be a good way of getting them in place. The cool thing about outsourcing any or all of those are you have a pristine environment and if it does get polluted somehow you can just reload the OS quickly and painlessly and try to tear it up again. Outsourced servers are great for this type of scenario. You can even get a few servers and carve them up virtually and have even more toys to play with. Now, you can just go buy new servers and have this in house but when they break or they are obsolete then you get to buy more. With an outsource model you can buy 1 or 100 and have them for 1 month or 2 years, it’s up to you, your needs, and your budget. You can add hardware, memory, change the OS daily, and only buy the License for a month instead of having to buy it outright when you buy your own servers. I personally believe this is a really good way to get acclimated to outsourcing and test the waters both with yourself and your boss. You always have to make sure they are ok with the way you are doing things. Well, sometimes anyway.

Another option with outsourcing is outsourcing production. Some bosses out in the world aren’t ready for this yet, but they will be. They like keeping their data close by and having multiple copies and instances and USB keys with copies on it, etc. That’s just the nature of data. Now we all know that you can have the same if not more redundancy in the outsourced model too, it is just hard to explain to them sometimes. I have to give them credit. Think about all the data in the world and how much of it we need to use every day. If folks like them didn’t demand that we techies keep it safe the world might have a bad day, I know I would. I use tons of data everyday (might be a fun blog).

If you decide to outsource dev/test or production you have the ability to scale quickly and accordingly when dealing with technology. Not having to be bogged down by worrying about hardware lead times, dealing with accounts payable, the receiving dock, and all the other worries you have when buying hardware is a liberating feeling. I know what you are thinking; I have been over this side of it a few times so I will just leave it at that but the numbers and today’s technology make it all come together and make good business sense.

Outsource IT!

 
That’s Smart
Posted by William Francis on January 30th 2008

My grandmother used to say an ounce of prevention is worth a pound of cure. Usually this was her polite way of telling me to pick my skateboard up off the stairs before she stepped on it and broke her neck or to put a sheet of newspaper over her antique kitchen table before I began refueling my model airplane. All very sound advice looking back. And now here I find myself repeating the same adage some twenty years later in the context of predicting mechanical drive failure. An ounce of prevention is worth a pound of cure.

Hard disk drive manufacturers recognized both the reality and the advantages of being able to predict normal hard disk failures associated with drive degradation sometime around 2003. This led a number of leading hard disk makers to collaborate on a standard which eventually became known as SMART. This acronym stands for Self-Monitoring, Analysis and Reporting Technology and when used properly is a formidable weapon in any system administrator’s arsenal.

The basic concept is that firmware on the hard disk itself will record and report key “attributes” of that drive which when monitored and analyzed over time can be used to predict and avoid catastrophic hard disk failures. Anyone who has been around computers for more than a day knows the terrible feeling that manifests in the pit of your stomach when it becomes apparent that your server or workstation will not boot because the hard disk has cratered. Luckily, we ALL of course back up our hard drives daily! Right?

All kidding aside even with a recent back up just the task of restoring and getting your system back in working order is a serious hassle and it’s not something you get the luxury of scheduling if the machine is critical to operations and failed in the middle of your work day or worse yet, the middle of your beauty sleep. That is where SMART comes in. When properly used SMART data can give “clues” that a drive is reaching a failure point–prior to it failing. This in turns means you can schedule a drive cloning and replacement within your next regular maintenance window. Really aside from a hard disk that lasts forever what more could an administrator ask for?

SMART drive data has been described as a jigsaw puzzle. That’s because it takes monitoring a myriad of data points consistently over time to be able to put together a picture of your hard disk health. The idea is that an administrator regularly records and analyzes characteristics about the installed spinning media and looks for early warning signs that something is going wrong. While different drives have different data points, some of the key and most common attributes are:

  • head flying height
  • data throughput performance
  • spin-up time
  • re-allocated sector count
  • seek error rate
  • seek time performance
  • spin try recount
  • drive calibration retry count

These items are considered typical drive health indicators and should be base-lined at drive installation and then monitored for significant degradation. While the experts still disagree on the exact value of SMART data analysis, I have seen sources that claim at least 30% of drive failures can be detected some 60 days prior to the actual failure through the monitoring of SMART data.

Of course not all drive failures can be predicted. Plus some failures are caused by factors other than drive degradation. Consider drives damaged by power surges or drives that are dropped in shipping as good examples of drive failures that cannot normally be detected through SMART monitoring. However in my humble opinion even one hard disk failure prevented over the course of my career is something to celebrate–unless you happen to own stock in McNeil Consumer Healthcare, a.k.a. the distributors of Tylenol!

So what does this have to do with SoftLayer? Well I am certainly not claiming that SoftLayer is going to predict all your hard drive disasters so there is no reason for you to back up your data. In fact, I recommend not just backing it up but backing it up in geographically disparate locations (did I mention we have data centers in Dallas and Seattle?). What I do mean to share is that technologies like SMART data are just one of the many ways SoftLayer is currently investigating to improve what is already the best hosting company in the business.

I should know. I was tasked with writing the low-level software to extract this data. That’s right. SoftLayer has engineers working at the application layer, down at the device driver layer, and everywhere in between. If that doesn’t give you a warm fuzzy about your hosting company, I don’t know what will.

 
Backups
Posted by Sam Fleitman on October 31st 2007

“ah - I don’t need backups.”
“Too busy to do backups - I’ll get to that later.”
“Backups? It costs too much.”
“I don’t need backups - MTBF of a Raptor is 1.2 Million hours.”
“Oops - I forgot about doing backups.”

Backups are one of the most commonly forgotten tasks of a system administrator. In some cases, they are never implemented. In other cases, they are implemented but not maintained. In other cases, they are implemented with a great backup and recovery plan - but the system usage or requirements change and the backups are not altered to compensate.

A hard drive really is a fairly reliable piece of IT equipment. The WD 150GB Raptor has a rating of 1.2 Million hours MTBF. With that kind of mean time between failures, you would think that you would never have to worry about a hard drive failing. How willing are you to take that chance? What if you double your odds by setting up two drives in a RAID 1 configuration? Now can you afford to take that chance? How willing are you to gamble with your data?

What if one of your system administrators accidentally deletes the wrong file? Maybe it’s your apache config file. Maybe it’s a piece of code you have been working on all day. Or, maybe your server gets compromised and you now have unknown trojans and back doors on your server. Now what do you do?

Working in a datacenter with thousands of servers, there are thousands and thousands of hard drives. When you see that many hard drives in production, you are naturally going to see some of them fail. I have seen small drives fail, large drives fail, and I have even seen RAID 1 mirrors completely fail beyond recovery. Is it bad hardware? Nope. Is it Murphy’s Law? Nope. It’s the laws of physics. Moving parts create heat and friction. Heat and friction cause failures. No piece of IT equipment is immune to failure.

That 1.2 million hours MTBF looks pretty impressive. For a round number, let’s say there are 15,000 drives in the SL datacenter. 1,200,000 hours / 15,000 drives = 80 hours. That means that every 80 hours, one hard drive in the SL datacenter could potentially fail. Now how impressive is that number?

Ultimately, regardless of the levels of redundancy you implement, there is always a chance of a failure - hardware or human - that results in data loss. The question is - how important is that data to you? In the event of a catastrophic failure, are you willing to just perform an OS reload and start from scratch? Or, if a file is deleted and unrecoverable, are you willing to start over on your project? And lastly, how much downtime can you afford to endure?

Regardless of how much redundancy you can build into your infrastructure with the likes of load balancers, RAID arrays, active/passive servers, hot spares, etc, you should always have a good plan for doing backups as well as checking and maintaining those backups.

Have you checked your backups lately?

 










 
 
Copyright © SoftLayer Technologies, All Rights Reserved.
Close
E-mail It
Socialized through Gregarious 42