Customer Calls Reliability 360; Addressing Future Challenges @ the Speed of Light
The Misconception of Reliability and Availability
After a 10 year reliability journey that takes me across the appliances, power generation, aerospace, component placement, electronics design, wireless, and finally the cable industries, I would like to share my astonishing discovery with my friends and foes alike. My focus will be on the cable industry where my recent study reveals some very interesting things.
My dear friends, I wish I could tell you the result was positive or somewhat encouraging. For starters, there seems to be a HUGE misunderstanding of reliability versus availability. This clear ambiguity has become the Achilles’ heel of many industries, including the cable industry. Thus far, studies showed the cable industry spends $10’s of billions yearly to maintain an availability of 99.98% or less. The question is why. I would like to explore this uncharted territory with you via a little study I conducted for nearly a year in this industry.
Let’s start with a graph I call the “P. Menard Comparative plot” of the Reliability and Availability for cable’s hybrid fiber-coax (HFC) network. For those readers familiar with HFC network operation you know this is the most vital sector for a Multiple System Operator (MSO).
It is the pipe that aggregates the downstream and upstream services transmitted back and forth between the customers and the head-ends (HE). This is truly the work horse of the cable industry, the most strenuous and costly by far to operate. Without further delay, let’s take a look at the two curves below:
The graph above illustrates the underlying cost of operation of the HFC plant. In this N+6 configuration analysis servicing between 750 and 1000 potential customers, one can see the huge difference in the system’s availability and reliability. This simulation was based purely on field data collected on component reliability of network elements to validate the performance of the HFC network (Amplifiers, LEs, TAPS, coaxial cable, optical fiber, optical electronics, etc.).The blue line depicts the availability performance, and the red line shows the overall system reliability curve. To say the two graphs are diverging is an understatement. To everyone’s surprise, we end up with a system reliability around 60% the first year and around 10% the fifth year, while the availability remains constant around 98.96%. What does 60% first year and 10% fifth year reliability mean? To explain this, I would like to use a little example. Suppose you are a maintenance manager running an HFC plant with 100 nodes, this means you will have 40 failures in the first year and 90 failures within the fifth year of operation (all requiring a truck roll). A 90% failure rate within the first 5 years of operation of any system is a BAD SYSTEM.To my amazement, many involved in the development and the maintenance of HFC networks find this quite typical. The term used is, “the cost of doing business”. The focus is definitely on the availability of the network, not the reliability. HFC technicians are extremely proficient at swapping equipment and restoring service primarily because they have to do so regularly; it is an accepted evil. These truck rolls are extremely costly and constrain key resources in fire fighting mode, thus innovation and continuous improvement suffer. The bottom line is that network operation costs are escalating, increasing 10-20% yearly. This high cost of ownership is expensive to the cable operator and the customers – no wonder my cable bill is SO HIGH.Without getting too technical, I would like to draw the reader’s attention to the specific difference between system reliability and system availability.
First of all, let’s define Reliability.
A consideration of reliability is the backbone to any good business strategy and prospect for growth. Reliability is the probability that an item will perform its intended functions without failure for a specified interval under stated conditions.
There are several key words in the above definition that require clarification. The first is the word, probability. Probability is a ratio, and in reliability it is the number of successes divided by the number of attempts. This means it is a numeric value, derived from an exact calculation. It is not based on opinion, speculation, “seat of the pants”, rule of thumb, etc. The latter are only guesses or hopes, and business cannot be based on such elusive measures.
The second important term is “performs its intended functions”. This suggests that the functions an item, element or system needs to perform have been identified and agreed upon.Many times, an item is deemed unreliable even though it performs the functions that have been identified for it. The problem is that not all the functions needing to be performed are identified.For example, let’s say a part has a certain mass that dampens the vibration of an assembly. A decision is made to reduce the mass of the item in order to reduce its cost. If the function of dampening vibration is not identified, then the change may go through – the item by itself performs its intended function but the assembly may fail due to increased vibration.
The third important term in the definition of reliability is “without failure”. This implies a failure has been defined. In some circumstances, this may be self-evident (smoked board, won’t turn on, etc.). In others, a certain amount of degraded performance over time may be acceptable.In the case of an amplifier or line extender (LE), the amount of gain may degrade over time but as long as the customer does not experience picture degradation, color issues, etc., it may not be considered a failure. Defining the threshold, the amount of degradation, or drift that is acceptable is sometimes difficult, but is very important.
The fourth item is “for a specified interval”. Again, this is not an ambiguous statement like “a long time,” but is a specific number and should be in units of measure relevant to the part. A specified interval of five years does not mean much to a part. The specified interval must be translated into relevant terms applicable to the part, element or system such as: hours of activation, hours of operation, number of cycles, etc., for it to be meaningful.
The last term to be considered is “under stated conditions.” This means that the environment the item, element or system operates in must be completely defined. Temperatures, temperature cycles, pressures, pressure cycles, corrosives, contaminants, maintenance items, and vibration (e.g. household cleansers) must all be defined for an item to be robust in all operating conditions. This is the particular requirement that is the most misunderstood by 99.99% so called reliability engineers. The engineer needs to take into account the worst case temperature, both cold and hot, as the gage for temperature stress factors, elevation, vibration, dust, etc. Through my study, I found 99% of the vendors omit the temperature factor during their reliability calculation by setting the pi-T factor = 1, “BAD PRACTICE.” In addition, an amplifier designed for operation in let’s say, Georgia’s temperature factors will not operate properly in places like Arizona or Nevada and other extreme places.
When developing a technical requirement for an item, element or system, all five terms in the definition of reliability must be addressed. The reliability paragraph in a specification file should:
1. Call out a probability; for example, 0.95
2. Define all functions of the item, element or system. (It could refer to a different paragraph in the specification file where the functional requirements are already stated).
3. Define what a failure is and what is not; for example, failure to operate when commanded to, or greater than 20% change in resistance.
4. Define the specified interval or mission duration; for example, 1,000 hours energized or 900,000 cycles (note: you must then adequately define a cycle).
5. Define the stated conditions; for example, 50 degrees Celsius energized and 25 degrees Celsius when not energized.
All of the terms above are necessary for a thorough reliability specification requirement.
To illustrate the process stated above, I would like to review the performance of two very well know IP switches. The first one is a graph of an IP switch considered best in its class.
This graph illustrates the performance of a very well designed and reliable IP switch. In layman terms, this device is so reliable that it will experience nearly zero failures over 5 years of operation. Although this device has superior reliability, the sales team for this product finds it difficult to sell to the cable industry. Why? The answer is Simple; the people in charge make their decision based on immediate rewards (lowest cost and vendor relations) without taking into account the COST OF OWNERSHIP and COST OF OPERATION.
Now, let’s analyze a similar IP switching product line that offers a better initial price but exhibits nearly 4 times higher COST OF OWNERSHIP and COST OF OPERATION than the prior.
It is clear that this product line is not comparable to the first product analyzed. If we look at the performance at year 5 for this product, it is evident that its survival rate is about 38% compared to the nearly 98% survival rate of the first product. However, to my amazement, the less reliable product is the leading IP switching choice for decision makers looking to make an immediate impact to their short-term strategy without regards to long-term business needs.This type of mentality needs to be shifted quickly in order to hold original equipment manufacturers (OEM) accountable for their poor performances. Consumers must demand quality and reliability.
Now, let’s turn our examination to that of Availability by answering these 2 questions:
1) What is Availability?
2) Where does the focus for availability lie?
What is Availability?
My definition for availability is as follows: Availability is the probability that an item, element or system is good and ready to go when needed. For example, I expect my car to start under all conditions (hot or cold, wet or dry) and to take me back and forth to my routine destination day-in and day-out. When I take it to the mechanic for 2 to 3 hours for routine maintenance (oil and tire changes, fluid checks, etc.), my car is not available (so we have to shave a little bit of the 99.999% availability matrix set forth by the manufacturers). Although during routine maintenance, when my car is not available, this is not a reliability hit since routine maintenance is scheduled downtime called for by the manufacturer. On the other hand, if I have a transmission problem, both availability and reliability will take a hit since it is a failure and not planned maintenance and during the failure and repair of failure, my car is unavailable to me.Some complex systems that are prone to failure many times include multiple schemes of redundancy (standby, dual processing, etc.), but this can be expensive. To examine systems with multiple schemes of redundancy there is a process called the Markov model that can simulate the overall system availability for these cases.
In its simplest form, availability is a function of mean time between failures (MTBF) and the mean time between repairs (MTBR) as illustrated in this simple equation below
A (t) = MTBF / (MTBF + MTBR)
Note : For A (t) to be large, the numerator must be large, therefore the time between failures must be long. Also, the denominator must be small, therefore the time to repair must be short.
Where does the responsibility for Availability lie?
Availability is a shared matrix that needs careful attention by all parties involved (vendors, engineering, network operation, and field installation). In a great organization that values continuous improvement here’s how the process goes: The engineering team works with the vendors to translate their system requirements into technical specification. Upon clear agreement, the vendor develops and delivers robust and reliable products suitable to meet the targeted MTBF (use my guideline above for defining reliability requirements). The Operation team works with engineering all through the design process, testing and validation phases to identify failure characteristics, troubleshooting guides, and corrective actions in order to minimize the overall system downtime thus improving the MTBR to support the customers’ and contractual targets (Availability = 99.999). The field team coordinates the builds with operation and engineering teams per vendor’s recommendation to reduce infant mortality and strenuous operation.
As you can see, availability is a subset of reliability; it requires careful input of all clearly defined contractual agreements up front to reduce strenuous operation, and to build long-term value for both the customer and the business. Without this clear focus, system operation goes through day-in and day-out in fire fighting mode, which is costly and time consuming. Fire fighting ties up key resources in non-productive activities and hurts the company’s bottom line.
In closing:
We are all witnesses to the downfall of the big three automobile industry moguls and the greatest financial tsunami of all time due to a lack of due-diligence and risk assessment. Risk calculations are not easy, but necessary. The skills required are not something you can get over-the-counter, via a book, or some clever tool. This is an extremely disciplined skill that requires years of crafting by connecting the dots and staying abreast of this fast changing world. For me, a reliability engineer is an engineer on steroids; meaning that you have to be able to reverse engineer a particular design while finding ways to improve it based on other core principles [material properties, design for six sigma (DFSS), design for reliability (DFR), LEAN, etc.].
Stale ideology, rigid and outdated guidelines, bureaucracy, the good old boy system, and yesteryears glory days will not propel your company forward. I believe we are at a cross road inAmerica where everyone has to play their parts. Toyota handily destroyed 3 pillars of American industry by sticking to a long term strategy and focus on core fundamentals. If I’m not mistaken, they consider long term as 100 years. Today’s American companies are focusing quarter to quarter and have very vague visions and missions. If making money is your mission, I can pretty much guarantee your company will not be around very long.
Philimar Menard is the Chief Technology Officer of Q&R Consulting Firm Inc., which specializes in design for six sigma (DFSS), design for reliability (DFR), continuous improvement, and Lean 6Sigma (LSS). He helps companies propel into the next generation by offering sound solutions to availability and reliability problems that guarantee continuity of operation and engineering designs done right the first time.
Contact him at philimar.menard@qrcfi.com.