Hardwhere?
Posted by Brad Lewis on February 27th 2008

It’s a fact — all software ends up relying on a piece of hardware at some point. And hardware can fail. But the secret is to create redundancy to minimize the impact if hardware does fail.
RAIDS, load balancers, redundant power supplies, cloud computing - the list goes on. And we support them all. Many of these options are not mandatory, but I wish they were! That’s where the customer comes in – it is critical to understand the value of the application and data sitting on the hardware and set a redundancy and recovery plan that fits.
Keep your DATA safe:

  • RAID - For starters *everyone* should have a RAID 1, 5, or 10. This keeps your server online in the event of a drive failure.

The best approach – RAID 10 all the way. You get the benefits of a RAID 0 (striping across 2 drives so you get the data almost twice as fast) and the security of RAID 1 (mirroring data on 2 separate drives) all rolled into one. I think every server should have this as a default.

  • Separate Backups – EVault Backup, ISCSI Storage, FTP/NAS Storage, your own NAS server or just a different server. Lose data just once (or have the ability to recover it painlessly) and these will pay for themselves. Remember, hardware is not the only way in which you can lose data -– hackers, software failures, and human error will always be a risk.

StorageLayer. Use it or lose it.

Going further:

  • Redundant servers in different locations – spread your servers out across different datacenters and use a load balancer. Nothing is safer than a duplicate server 1000’s of miles away. That’s why we have invested in a second data center – to keep your data and business safe.

Check ‘em out in our Services > Network Services section.

The future:

  • Solid state drives – aww yeah baby. They are coming.

Solid state drives are just that – a drive with no moving parts. No more platters or read/write heads. I mean come on, hard drives are essentially using the same basics that old record players use. CD’s use this technology too. And you see where those went (can you say iPod? I prefer my iPod touch. I have never had an iPod until now so I skipped right to the new fancy pants model. Can you tell I just got it?).
Check out these comparison tests of solid state drives vs. conventional ones:

  • Faster, faster, faster! –- Processors, memory, drives, network — everything is getting much faster. And in part by redundancy (dual and quad core processors, dual and quad processor motherboards). See? Redundancy is the way of the future!

We have 4 Intel Xeon Quadcore Tigertown processors on one motherboard. That’s 16 processors on one server! Shazam!

  • Robot DC patrol sharks – yep. Got the plans on my desk right now. But I can’t take all the credit, Josh R. suggested this one, I just make things happen.

I work to keep all of our hardware running in tip top condition. But I look at the bigger picture when it comes to hardware – how to completely eliminate the impact of any hardware issue. That’s why I suggest all the redundancies listed above. While I can reduce the probability of hardware issues with testing, monitoring of firmware updates, proper handling procedures, choosing quality components, etc., redundancy is the ultimate solution to invisible hardware.
Hardwhere?, if you will.

 
Spares at the ready
Posted by Sam Fleitman on February 11th 2008

In Steve’s last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.

Let’s say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?

If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so - that’s a common part with mechanical features that is liable to fail - so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you’re probably well covered there.

What if that RAID card fails? It happens - and it happens with all different brands of cards.

What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it’s memory starved or because now your memory is not interleaved - but that might be a risk you are willing to take.

How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?

What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?

If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card - what happens if that card goes bad while you’re on vacation?

Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your… What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert - so that now you have to drive back to the office to empty the water pail of the spot cooler?

You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they’re not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.

Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it’s ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?

 
Backups
Posted by Sam Fleitman on October 31st 2007

“ah - I don’t need backups.”
“Too busy to do backups - I’ll get to that later.”
“Backups? It costs too much.”
“I don’t need backups - MTBF of a Raptor is 1.2 Million hours.”
“Oops - I forgot about doing backups.”

Backups are one of the most commonly forgotten tasks of a system administrator. In some cases, they are never implemented. In other cases, they are implemented but not maintained. In other cases, they are implemented with a great backup and recovery plan - but the system usage or requirements change and the backups are not altered to compensate.

A hard drive really is a fairly reliable piece of IT equipment. The WD 150GB Raptor has a rating of 1.2 Million hours MTBF. With that kind of mean time between failures, you would think that you would never have to worry about a hard drive failing. How willing are you to take that chance? What if you double your odds by setting up two drives in a RAID 1 configuration? Now can you afford to take that chance? How willing are you to gamble with your data?

What if one of your system administrators accidentally deletes the wrong file? Maybe it’s your apache config file. Maybe it’s a piece of code you have been working on all day. Or, maybe your server gets compromised and you now have unknown trojans and back doors on your server. Now what do you do?

Working in a datacenter with thousands of servers, there are thousands and thousands of hard drives. When you see that many hard drives in production, you are naturally going to see some of them fail. I have seen small drives fail, large drives fail, and I have even seen RAID 1 mirrors completely fail beyond recovery. Is it bad hardware? Nope. Is it Murphy’s Law? Nope. It’s the laws of physics. Moving parts create heat and friction. Heat and friction cause failures. No piece of IT equipment is immune to failure.

That 1.2 million hours MTBF looks pretty impressive. For a round number, let’s say there are 15,000 drives in the SL datacenter. 1,200,000 hours / 15,000 drives = 80 hours. That means that every 80 hours, one hard drive in the SL datacenter could potentially fail. Now how impressive is that number?

Ultimately, regardless of the levels of redundancy you implement, there is always a chance of a failure - hardware or human - that results in data loss. The question is - how important is that data to you? In the event of a catastrophic failure, are you willing to just perform an OS reload and start from scratch? Or, if a file is deleted and unrecoverable, are you willing to start over on your project? And lastly, how much downtime can you afford to endure?

Regardless of how much redundancy you can build into your infrastructure with the likes of load balancers, RAID arrays, active/passive servers, hot spares, etc, you should always have a good plan for doing backups as well as checking and maintaining those backups.

Have you checked your backups lately?

 










 
 
Copyright © SoftLayer Technologies, All Rights Reserved.
Close
E-mail It
Socialized through Gregarious 42