Peter's Solaris Zone

Availability measures

Allowable lack of service for a given percentage availability, assuming 365 days a year, 7 days a week, and 24 hours per day.

PercentagePer YearPer WeekPer Day
9987.6h1.68hr14.4m
99.98.76h10m86s
99.9952m1m8s
99.9995m6s<1s

Things are even tighter if you consider a working week. I assume this is 42 hours, as that's a quarter of a week and makes the maths easier.

PercentagePer YearPer WeekPer Day
9921.9h25m3.6m
99.92.2h2.6m21s
99.9913m15s2s
99.99975s<1s<1s

The advantage of the working time figures is that they might be easier to keep to - you have 3/4 of the week to have as much downtime as you want, so it may be possible to limp through the day and repair the system out of hours.

Commentary

Note that getting 99.999% availability is hard. It essentially precludes any maintenance work, and even if you set up failover it's almost impossible to do it within the time allowed. Think how long a reboot takes! Or how long spanning tree takes to converge.

I regard 99.99% availability as achievable, with careful administration, sound architectural design, and full control of external influences. And that's without special hardware or software. It does require protected power and some level of redundant disk.

In my experience, most organizations would be ecstatic with 99.9%. Many would be lucky to get 99%.

It's also about expectations. And I suspect this is the cause of a number of problems of under- and over- setting targets. There is the user experience on an unreliable client, which sets user expectations to be very low indeed, so that continual service downtime is often tolerated when it simply shouldn't be. The trouble is that moderate reliability isn't good enough - most users I know get used to systems being up, and reset their expectations unreasonably high.

Also remember that if you have a service dependant on a number of systems or components, that you have to combine the availability figures of the individual parts to get the availability of the whole. This makes your availability drop very fast.

Reliability

Remember that availability and reliability aren't the same! Availability is a measure of the fraction of the time that a system is available. Reliability is a measure of the time between individual failures. An unreliable system might have high availability if the downtime associated with each failure is small; likewise a reliable system might have low availability if failures have large amounts of associated downtime.

(I suspect that users prefer little failures. They don't mind an excuse to take a short break or have a cup of coffee. But they get very frustrated if downtime reaches any significant period, even if it doesn't happen very often.)


Peter's Home | Zone Home