|
|
|
|
|
|
|
Posts Tagged ‘hardware’
By Brad Lewis on Wednesday, February 27th, 2008
It’s a fact — all software ends up relying on a piece of hardware at some point. And hardware can fail. But the secret is to create redundancy to minimize the impact if hardware does fail.
RAIDS, load balancers, redundant power supplies, cloud computing – the list goes on. And we support them all. Many of these options are not mandatory, but I wish they were! That’s where the customer comes in – it is critical to understand the value of the application and data sitting on the hardware and set a redundancy and recovery plan that fits.
Keep your DATA safe:
- RAID – For starters *everyone* should have a RAID 1, 5, or 10. This keeps your server online in the event of a drive failure.
The best approach – RAID 10 all the way. You get the benefits of a RAID 0 (striping across 2 drives so you get the data almost twice as fast) and the security of RAID 1 (mirroring data on 2 separate drives) all rolled into one. I think every server should have this as a default.
- Separate Backups – EVault Backup, ISCSI Storage, FTP/NAS Storage, your own NAS server or just a different server. Lose data just once (or have the ability to recover it painlessly) and these will pay for themselves. Remember, hardware is not the only way in which you can lose data -– hackers, software failures, and human error will always be a risk.
StorageLayer. Use it or lose it.
Going further:
- Redundant servers in different locations – spread your servers out across different datacenters and use a load balancer. Nothing is safer than a duplicate server 1000’s of miles away. That’s why we have invested in a second data center – to keep your data and business safe.
Check ‘em out in our Services > Network Services section.
The future:
- Solid state drives – aww yeah baby. They are coming.
Solid state drives are just that – a drive with no moving parts. No more platters or read/write heads. I mean come on, hard drives are essentially using the same basics that old record players use. CD’s use this technology too. And you see where those went (can you say iPod? I prefer my iPod touch. I have never had an iPod until now so I skipped right to the new fancy pants model. Can you tell I just got it?).
Check out these comparison tests of solid state drives vs. conventional ones:
- Faster, faster, faster! –- Processors, memory, drives, network — everything is getting much faster. And in part by redundancy (dual and quad core processors, dual and quad processor motherboards). See? Redundancy is the way of the future!
We have 4 Intel Xeon Quadcore Tigertown processors on one motherboard. That’s 16 processors on one server! Shazam!
- Robot DC patrol sharks – yep. Got the plans on my desk right now. But I can’t take all the credit, Josh R. suggested this one, I just make things happen.
I work to keep all of our hardware running in tip top condition. But I look at the bigger picture when it comes to hardware – how to completely eliminate the impact of any hardware issue. That’s why I suggest all the redundancies listed above. While I can reduce the probability of hardware issues with testing, monitoring of firmware updates, proper handling procedures, choosing quality components, etc., redundancy is the ultimate solution to invisible hardware.
Hardwhere?, if you will.
Tags: backups, hardware, innovation, ISCSI, RAID, redundancy, solid state, Technology Posted in News | 1 Comment »
By Sam Fleitman on Monday, February 11th, 2008
In Steve’s last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.
Let’s say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?
If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so – that’s a common part with mechanical features that is liable to fail – so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you’re probably well covered there.
What if that RAID card fails? It happens – and it happens with all different brands of cards.
What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it’s memory starved or because now your memory is not interleaved – but that might be a risk you are willing to take.
How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?
What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?
If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card – what happens if that card goes bad while you’re on vacation?
Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your… What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert – so that now you have to drive back to the office to empty the water pail of the spot cooler?
You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they’re not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.
Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it’s ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?
Tags: backups, costs, datacenter, failures, hardware, memory, redundancy, spares Posted in News | Comments Off
By William Francis on Wednesday, January 30th, 2008
My grandmother used to say an ounce of prevention is worth a pound of cure. Usually this was her polite way of telling me to pick my skateboard up off the stairs before she stepped on it and broke her neck or to put a sheet of newspaper over her antique kitchen table before I began refueling my model airplane. All very sound advice looking back. And now here I find myself repeating the same adage some twenty years later in the context of predicting mechanical drive failure. An ounce of prevention is worth a pound of cure.
Hard disk drive manufacturers recognized both the reality and the advantages of being able to predict normal hard disk failures associated with drive degradation sometime around 2003. This led a number of leading hard disk makers to collaborate on a standard which eventually became known as SMART. This acronym stands for Self-Monitoring, Analysis and Reporting Technology and when used properly is a formidable weapon in any system administrator’s arsenal.
The basic concept is that firmware on the hard disk itself will record and report key “attributes” of that drive which when monitored and analyzed over time can be used to predict and avoid catastrophic hard disk failures. Anyone who has been around computers for more than a day knows the terrible feeling that manifests in the pit of your stomach when it becomes apparent that your server or workstation will not boot because the hard disk has cratered. Luckily, we ALL of course back up our hard drives daily! Right?
All kidding aside even with a recent back up just the task of restoring and getting your system back in working order is a serious hassle and it’s not something you get the luxury of scheduling if the machine is critical to operations and failed in the middle of your work day or worse yet, the middle of your beauty sleep. That is where SMART comes in. When properly used SMART data can give “clues” that a drive is reaching a failure point–prior to it failing. This in turns means you can schedule a drive cloning and replacement within your next regular maintenance window. Really aside from a hard disk that lasts forever what more could an administrator ask for?
SMART drive data has been described as a jigsaw puzzle. That’s because it takes monitoring a myriad of data points consistently over time to be able to put together a picture of your hard disk health. The idea is that an administrator regularly records and analyzes characteristics about the installed spinning media and looks for early warning signs that something is going wrong. While different drives have different data points, some of the key and most common attributes are:
- head flying height
- data throughput performance
- spin-up time
- re-allocated sector count
- seek error rate
- seek time performance
- spin try recount
- drive calibration retry count
These items are considered typical drive health indicators and should be base-lined at drive installation and then monitored for significant degradation. While the experts still disagree on the exact value of SMART data analysis, I have seen sources that claim at least 30% of drive failures can be detected some 60 days prior to the actual failure through the monitoring of SMART data.
Of course not all drive failures can be predicted. Plus some failures are caused by factors other than drive degradation. Consider drives damaged by power surges or drives that are dropped in shipping as good examples of drive failures that cannot normally be detected through SMART monitoring. However in my humble opinion even one hard disk failure prevented over the course of my career is something to celebrate–unless you happen to own stock in McNeil Consumer Healthcare, a.k.a. the distributors of Tylenol!
So what does this have to do with SoftLayer? Well I am certainly not claiming that SoftLayer is going to predict all your hard drive disasters so there is no reason for you to back up your data. In fact, I recommend not just backing it up but backing it up in geographically disparate locations (did I mention we have data centers in Dallas and Seattle?). What I do mean to share is that technologies like SMART data are just one of the many ways SoftLayer is currently investigating to improve what is already the best hosting company in the business.
I should know. I was tasked with writing the low-level software to extract this data. That’s right. SoftLayer has engineers working at the application layer, down at the device driver layer, and everywhere in between. If that doesn’t give you a warm fuzzy about your hosting company, I don’t know what will.
Tags: application, drive failures, drivers, fault tolerance, hardware, OSI, prevention Posted in Business, backups | Comments Off
By Sam Fleitman on Wednesday, October 31st, 2007
“ah – I don’t need backups.”
“Too busy to do backups – I’ll get to that later.”
“Backups? It costs too much.”
“I don’t need backups – MTBF of a Raptor is 1.2 Million hours.”
“Oops – I forgot about doing backups.”
Backups are one of the most commonly forgotten tasks of a system administrator. In some cases, they are never implemented. In other cases, they are implemented but not maintained. In other cases, they are implemented with a great backup and recovery plan – but the system usage or requirements change and the backups are not altered to compensate.
A hard drive really is a fairly reliable piece of IT equipment. The WD 150GB Raptor has a rating of 1.2 Million hours MTBF. With that kind of mean time between failures, you would think that you would never have to worry about a hard drive failing. How willing are you to take that chance? What if you double your odds by setting up two drives in a RAID 1 configuration? Now can you afford to take that chance? How willing are you to gamble with your data?
What if one of your system administrators accidentally deletes the wrong file? Maybe it’s your apache config file. Maybe it’s a piece of code you have been working on all day. Or, maybe your server gets compromised and you now have unknown trojans and back doors on your server. Now what do you do?
Working in a datacenter with thousands of servers, there are thousands and thousands of hard drives. When you see that many hard drives in production, you are naturally going to see some of them fail. I have seen small drives fail, large drives fail, and I have even seen RAID 1 mirrors completely fail beyond recovery. Is it bad hardware? Nope. Is it Murphy’s Law? Nope. It’s the laws of physics. Moving parts create heat and friction. Heat and friction cause failures. No piece of IT equipment is immune to failure.
That 1.2 million hours MTBF looks pretty impressive. For a round number, let’s say there are 15,000 drives in the SL datacenter. 1,200,000 hours / 15,000 drives = 80 hours. That means that every 80 hours, one hard drive in the SL datacenter could potentially fail. Now how impressive is that number?
Ultimately, regardless of the levels of redundancy you implement, there is always a chance of a failure – hardware or human – that results in data loss. The question is – how important is that data to you? In the event of a catastrophic failure, are you willing to just perform an OS reload and start from scratch? Or, if a file is deleted and unrecoverable, are you willing to start over on your project? And lastly, how much downtime can you afford to endure?
Regardless of how much redundancy you can build into your infrastructure with the likes of load balancers, RAID arrays, active/passive servers, hot spares, etc, you should always have a good plan for doing backups as well as checking and maintaining those backups.
Have you checked your backups lately?
Tags: backups, evault, hardware, IT, mtbf, nas, physics, SoftLayer Posted in Technology, backups | Comments Off
|
|
|
|
|
|
|
|
|
|
|
|
|
|