Reliability Analysis

FREE WEBINAR – September 2, 2015
Host: Ops A La Carte
Speaker: John Cooper, Senior Reliability Engineer and Instructor
Time: 12:00pm-1:00pm Pacific TimeWhether you are in a startup company, pressed to meet a demanding schedule, or if you are in reliability engineering and are offering services to a startup, it’s important to understand how startup companies operate, and what their needs are in regards to product reliability.

Many startups today are fast-paced, driven by young college grads, and often funded through crowdsharing. They may be sponsored by one of several well-known venture capital incubators.

One of the challenges in modern reliability engineering is to help management and engineers understand the value and process of reliability engineering. In an electronic company, there may challenges to get people to understand such tools as HALT, or FMEA. In a crowdsharing based startup, the challenge is an order of magnitude more difficult. The risk of failure is much higher, when you consider the consequences of products that fail soon after shipment.

A startup company is under pressure to meet goals such as the following:
1) Quicker: Product development time must be shorter – time to market is much more critical.
2) Products must be better – higher performance, better quality and reliability (as demonstrated by smart phones, for example).
3) Products must be lower cost, which is cost of ownership, including warranty cost.

We will examine what the special reliability needs are for startup companies, and consider what tools make more sense – tools such as HALT – for results must occur in days, not weeks. In startups, decisions are made real-time, meetings are short; there is no “long term”.

Join us in this webinar, and share your questions or concerns.

Sept 10, 2014: Free Webinar 12pm-1pm PDT

Using a Warranty Event Cost Model to Drive Warranty Cost Reduction Strategy
Speaker: Robert Mueller
Date: Wednesday, Sept 10, 2014 (12pm-1pm PDT)
Space is limited. Reserve your Webinar seat now at:

Many organizations have become excellent at tracking both their product’s field failures and determining component reliability for current products from field repair data. These component reliability and product field failure rates now need to be converted to fully loaded warranty costs. Only when the ‘cost’ of a component’s failure is expressed in its fully burdened dollars can the relative importance & priority to mitigate that failure mode can best be evaluated. Putting it another way, when funding reliability improvement efforts, management listens most attentively to $s, not rates. Developing a simple and easily applied warranty event cost model is a central capability to reliability improvement engineering programs.

In this webinar you will learn about the warranty event model and how it easily transforms frequency of field failures into fully loaded warranty or service costs. You will learn how you can easily compare the fully loaded warranty costs of various component field failures and warranty event types. Further, you will learn how to apply the model to help prioritize reliability improvement initiatives based on both its impact on manufacturing costs and on its likely field/service cost reductions. Finally, the nasty little hidden secrets of software related warranty events will be exposed.

About the Speaker (Bob Mueller, MS, CQE): Bob is a product development professional with 30+ years of technical and management experience in software intensive product development as well as in R/D process and quality systems development including extensive consulting experience with cross-functional product development teams and senior management. After receiving his M.S. in Physics in 1973, Bob joined Hewlett-Packard in Cupertino, CA in IC process development. In the next three decades before leaving HP, he held numerous positions in R/D, R/D management and technical consulting including management positions in the computer, analytical, healthcare business units and in HP’s internal engineering consulting organization. For the past decade Bob has focused on agile development methodologies and practices that drive software reliability, improved software support operations, warranty chain management and customer satisfaction. . Bob is a senior member of the ASQ and a Certified Quality Engineer (CQE). He is the chair of the training working groups for the Institute of Warranty Chain Management and has taught many courses at local colleges.

Aug 13: Free Webinar 12-1pm PDT
Register Here

Title: Matching Design FMEA Styles to Program Objectives and Personal Engineering Culture
Speaker: Adam Bahret
Date: Wednesday, August 13, 12-1pm PDT

There are many styles of Design Failure Mode and Effects Analysis (FMEA) practiced in industry today. For an organization it can be difficult to know which best matches their program objectives and culture. Overcoming the reluctance of participation due to their time consuming nature and finding a way to implement the corrective actions into a fast moving development program is difficult. Many FMEAs only yield a fraction of their potential benefit because of these challenges. This webinar will discuss these FMEA types and how to overcome these engagement and implementation challenges.

Physical acceleration means that operating a unit at some higher stress level(s) (i.e, higher temperature, voltage, humidity or duty cycle, etc…) should produce similar failures as would occur at typical-use stresses, except that they are expected to occur much sooner.

Failures may be due to mechanical fatigue, corrosion, chemical reaction, diffusion, migration, etc. These are the same cause of failures under normal stress condition; the only difference is the time scale (the time to failure).

When there is true acceleration, changing stress is equivalent to transforming the time scale used to record when failures occur. The transformations commonly used are linear, which means that time-to-fail at high stress just has to be multiplied by a constant (the Acceleration Factor or AF) to obtain the equivalent time-to-fail at use stress.

For many engineers, this is where they encounter the biggest challenges: what is the preferred model? How do I fit the model to the condition, materiel and physical attributes of the UUT?

Too many knowledgeable users employ the Arrhenuis equation, but this is not always the preferred method, and care must be taken to apply the proper model.

What are the experiences with other models?

What are the success stories?

Reliability Curves

Plot of Observed Failure Rate with Infant Mortality, Constant, and Wear Out Failure Rates

Mechanical failure can occur at any point in a product life cycle. These can be divided into infant mortality, constant failure rate, and wear out. As shown in the diagram, which is a plot of failure rate as a function of time, the individual curves of these three different classes of failure mechanisms sum together to form the classic bathtub curve of observed failure rate.


Infant Mortality

Infant mortality occurs early during product use. The failure rate declines as a function of time, so reliability actually increases until a point is reached where the constant failure rate becomes dominant and the infant rate becomes negligible. The Weibull Distribution is used to model the infant mortality period. A wear-in or burn-in period may be used to screen out defective units. Infant mortality is typically caused by defects in manufacturing, handling, and storage. Examples of these causes are:

  • workmanship and assembly errors
    • misalignment of belts, pulleys, gears, shafts and bearings
    • over-tightening stresses parts, causes excessive friction
    • under-tightening leaves parts loose to fall off, vibrating shafts
  • parts are out of tolerance from design specifications
    • rubbing contact of moving parts
    • loose fits lead to vibration and galling
    • excess friction between mating, moving parts
  • excess flash from molding
    • similar problems to tolerance issues above
    • flash breaks off, contaminating system with debris
  • damaged parts from improper processing and handling
    • cracks from damage propagate under stress leading to failure
    • over-heating changes material properties
    • solvents and residues lead to stress corrosion cracking
    • environmental conditions cause swelling and warping

Constant Failure Rate

Most of the product life is spent in the random failure state, which has a constant failure rate. The reliability with a constant failure rate is predicted using the exponential function. Failures are caused by mechanisms inherent in the design.

It should be noted that a preventative maintenance program during the constant failure rate phase can actually reduce reliability by reintroducing infant mortality into the system. The scheduled replacement of parts presents an opportunity for errors in workmanship as well as adding the possibility of failure of the parts themselves.

This has led to the current practice of Reliability Centered Maintenance. RCM determines the maintenance requirements of individual components to replace only those components which actually need replacing while monitoring the condition of all components which are prone to wear and eventual failure. Not all components in a system follow the bathtub curve. Reliability centered maintenance identifies the reliability curve for a component and provides an applicable maintenance strategy to match.

Wear Out

The wear-out phase precedes the end of the component or product life. At this point, the probability of failure increases with time. While the parts have no memory of previous use during the random failure phase, yielding a constant failure rate, when they enter wear out, the cumulative effects of previous use are expressed as a continually increasing failure rate. The normal distribution is often used to model wear out. Weibull may also be used to approximate this and every other period. Scheduled preventative maintenance of replacing parts entering the wear-out phase can improve reliability of the overall system.

Examples of failure mechanisms in wear out are:

  • Fatigue – constant cycle of stress wears out material
  • Corrosion – steady loss of material over time leads to failure
  • Wear – material loss and deformation, especially loss of protective coatings
  • Thermal cycling – not only fatigue, but change in chemical properties, alloyed metals can migrate to grain boundaries, changing properties
  • Radiation – Ultraviolet, X-ray, nuclear bombardment in environment changes molecular structure of materials

FMEA method is quite known today. There are a lot of guides, articles, and standards written about FMEA method. However, not so much written about FMEA links-relationship between FMEAs and other processes. Are these links important? To have a really strong FMEA approach we should take into consideration links-outputs/inputs between FMEA´s.  In most cases we develop/produce products within supply chain where our product is part of upper system or customer application and consist from components co-developed/produced by our suppliers. We are located betwen customer domain and supplier domain. Each domain has own FMEA (Application FMEA and Supplier FMEAs). The key point is to create interfaces betwen domain FMEAs.  When we don´t care these interfaces than we develop “our product“ not “customer product” and we can fail in customer application. There are a lot of real examples where product fail because there was missing customer voice.  We should keep in mind such approach and transfer Voice of Customer to Suppliers. Important is to setup FMEA communication platform between all supply chain entities with same risk evaluation criteria (Severity, Occurence, Detection). The objective such approach is identification of all risks and their failure cause – failure effect chain from supplier through us to customer. This approach can give us complete view what happen if parameter of supplier component fail in our domain and what will be failure effect on customer application.






Other aspect of more robust FMEA approach are links between FMEAs and other company procesess within our domain. Benefit of such approach is view on FMEAs from various functional perspectives. Following items describe how such proces can empower FMEAs and FMEAs can empower other procesess.




Requirement Management – to understand what custormer really need,how application works, what can be failure effects and their severity on customer application. It´s basic input for FMEA.

Quality Planning – FMEA is part of quality planning process and other quality tools and methods are dependent on it like control plan, measurement system analysis, process capability analysis, verification and validation planning, etc.

Risk Management – FMEA is source of product and proces risks which has to be evaluated from other risk perspective like financial effect, project timing, product porfolio, technology roadmap.

Supplier Management – FMEA is good communication platform to speak about component failures and their efects on customer system. Customer is learning from suppliers and suppliers are learning from customers.

Continual Improvement – FMEA is good source of product or process potential improvement projects definition based on highest risks.

Reliability Engineering – FMEA is integral part of product reliability analysis. Help to engineers to understand failure mechanism. Is good source of reliability test planning and after test failure analysis. Change Management – When any change in process or product  is planned than we should analyze with support of FMEAs.

Problem Solving – FMEA can be good reference for team to learn from past failures. New failure events should be added to FMEA.

It´s quite complex task to manage all these links. When we will think about these links than our FMEA can bring us more interesting results than before. It will not be separate method but will become integral part of our company procesess. There is very interesting tool to help company to manage all these links. See to


Many people have heard of, or are familiar with various reliability prediction methods like MIL-HDBK-217, Telcordia SR332, etc.  These standardized handbook methods have widespread use in industry.  They are primarily applicable when making the assumption that the component failure rate is constant (at the bottom portion of the bathtub curve) and are thus generally applicable to most electronic components.  However, caution should be taken when using these prediction methods because there may be components for which this assumption is not correct including some electronic parts like electrolytic capacitors.  There are some handbooks that deal with mechanical parts but they also generally view the failure rates as constant for the time period of interest.

In working with a client recently, they had a reliability goal that they wanted to achieve and desired a reliability prediction to verify that the goal was achievable.  As their component parts list was reviewed, it became obvious that they had numerous parts that were subject to mechanical wear like an LCD touch screen, cable connectors, etc.  For the electronic parts, the goal could be achieved but the components subject to wear had to also be evaluated and integrated into the analysis.

It then becomes necessary to deal with such components that will experience wear individually and determine whether or not they are apt to wear out within the reliability goal period of interest (or product lifetime).  If it can be shown that the wear out occurs beyond the expected life of the product, then there is no problem.  This determination can be done through testing or other analysis methods.  If the component is likely to wear out within the expected product life, then decisions must be made regarding a maintenance strategy and the potential impact to warranty.

What has been your experience in performing predictions when you have components that can wear out?

An adaption of the Functional Safety standards IEC 61508 and IEC 26262 by the European Union brought a new life into slowly fading activity of reliability prediction. Both reliability prediction and reliability demonstration are now key parts of many product development programs, however despite phonetic similarity those two have little in common as well as the result they generate. 

While reliability prediction is an analytical activity often based on mathematical combination of reliabilities of parts or components comprising the system; reliability demonstration is based on product testing and is statistically driven by the test sample size.  Therefore the obtained results could drastically differ.  For example, a predicted system failure rate of 30 FIT (30 failures per billion (109) hours) would corresponds to a 10 year reliability of 99.87% (assuming 12 hours per day operation).  In order to demonstrate this kind of reliability with 50% confidence (50% confidence is considered low in most industries) one would need to successfully test 533 parts (based on binomial distribution) to the equivalent of 10 year field life.  Needless to say that this kind of test sample is prohibitive in most industries.  For example in the automotive electronics the test sample size of 23 is quite common, which roughly corresponds to 97% reliability with 50% confidence. 

The natural question is: how do you reconcile the numbers obtained from reliability prediction with the numbers you can support as part of reliability demonstration?

The answer is: I don’t believe that you can.

You can make an argument that reliability demonstration produces the lower estimate values.  Additionally the test is often addresses higher percentile severity users, thus the demonstrated reliability for the whole product population will likely be higher.  However, in most of the cases the gap will remain too wide to close.  This is something, which reliability engineers, design teams, and most importantly customers need to be aware of and be able to deal with as part of the product development reality.

What does the audience think? We’d love to hear your opinions on this.

Andre Kleyner

Robust Design & Reliability

      I delivered a webinar recently  to  describe   the  differences and similarities between robust design (RD)  activities and reliability engineering (RE)  activities in hardware product development .  A survey we took from   several hundred attendees indicated a diversity of opinions. About half the participants indicated they did not differentiate at all between the two methodologies.  Approximately 20 % indicated they did differentiate between the two methodologies, and about 30% indicated that they did not know.

I was quite surprised at the result, especially since participants came  from  working quality engineers, reliability engineers, engineering directors, system engineers, etc.  Somewhere along the way,  the  differences and similarities between the two seem to have become muddled.  Below I have    collected just twelve of the many ways in which the activities are different:


RD 1  Focus on design  transfer functions, and  ideal function development

RE 1  Focus on design dysfunction, failure modes, failure times, mechanisms of failure


RD 2   Engineering focus, empirical  models, generic models , statistics.

RE 2    Mechanistic understanding, physical models,  science oriented  approach.


RD3  Optimization of input-output functions with verification testing requirement

RE3  Characterization of natural phenomena with root cause analysis and  countermeasure decisions


RD4  Orthogonal array testing, design of experiments planning

RE4   Life tests , accelerated life tests, highly accelerated tests, accelerated degradation tests, survival  methods


RD5   Multitude of control, noise, and signal factor  combinations for reducing sensitivity to noise and amplifying sensitivity to signal

RE5   Single factor testing,  some multifactor testing ,   fixed design with  noise factors,  acceleration factors


RD6   Actively change design parameters to improve insensitivity to noise factors, and sensitivity to signal factors

RE6   Design-Build-Test-Fix cycles for reliability growth


RD7   Failure inspection only with verification testing of improved functions

RE7   Design out failure mechanisms, reduce variation in product strength. Reduce the effect of usage/environment


RD8  Synergy with axiomatic design methodology including ideal design, and simpler design

RE8   Simplify design complexity for reliability improvement.  Reuse reliable hardware .

RD9   Hierarchy of quantitative design   limits including functional limits, spec limits, control limits, adjustment limits

RE9   Identify & Increase design margins, HALT & HASS testing to flesh out design weaknesses.   Temperature & vibration stressors predominate


RD10   Measurement system and response selection paramount

RE10   Time-to-failure quantitative measurements supported by analytic   methods


RD11  Ideal function development for energy relate measures

RE11  Fitting distributions to stochastic failure time data.  Time compression by stress application



RD12  Compound noise factors largest stress.  Reduce variability to noise factors by interaction between noise and control factors, signal and noise factor.

RE12  HALT & HASS highly accelerated  testing to reveal design vulnerabilities and expand margins.  Root cause exploration and mitigation


There are many other differences of course, but this list should start the conversation .  I  would invite bloggers to submit their own opinions  and  lists of differences (and similarities) .

Louis LaVallee

Sr. Reliability Consultant

Ops a la Carte



"How Reliable Is Your Product: 50 Ways to Improve Product Reliability" just celebrated its 1 year anniversary of being published.

We are pleased to announce the Mandarin translation of the book.

You can view the 1st 3 Chapters at 50 Ways to Improve Reliability – in Mandarin, or 提高产品可靠性的50种方法. It will be available in ebook January, 2012.