Aron Rolnitzky

Many people have heard of, or are familiar with various reliability prediction methods like MIL-HDBK-217, Telcordia SR332, etc.  These standardized handbook methods have widespread use in industry.  They are primarily applicable when making the assumption that the component failure rate is constant (at the bottom portion of the bathtub curve) and are thus generally applicable to most electronic components.  However, caution should be taken when using these prediction methods because there may be components for which this assumption is not correct including some electronic parts like electrolytic capacitors.  There are some handbooks that deal with mechanical parts but they also generally view the failure rates as constant for the time period of interest.

In working with a client recently, they had a reliability goal that they wanted to achieve and desired a reliability prediction to verify that the goal was achievable.  As their component parts list was reviewed, it became obvious that they had numerous parts that were subject to mechanical wear like an LCD touch screen, cable connectors, etc.  For the electronic parts, the goal could be achieved but the components subject to wear had to also be evaluated and integrated into the analysis.

It then becomes necessary to deal with such components that will experience wear individually and determine whether or not they are apt to wear out within the reliability goal period of interest (or product lifetime).  If it can be shown that the wear out occurs beyond the expected life of the product, then there is no problem.  This determination can be done through testing or other analysis methods.  If the component is likely to wear out within the expected product life, then decisions must be made regarding a maintenance strategy and the potential impact to warranty.

What has been your experience in performing predictions when you have components that can wear out?

When performing various reliability tasks, non-repairable systems or products are treated differently from repairable systems or products.  Some of the tools that are used for one type are not applicable to the other.   Obviously, at some level, repairable systems are composed of non-repairable parts.   Examples of non-repairable systems would be “one-shot” devices like light bulbs or more complex devices like pacemakers.  Examples of repairable systems are computers, automobiles, and airplanes.

 

What is unique about repairable systems?  Availability becomes a key measure of importance.  In simple terms, availability is the percentage of time that the product or system is able to perform its required functions.  When the required functions cannot be performed because a failure has occurred, the system must be repaired to restore the functionality.  This is where another measure, maintainability, impacts the system availability.  The faster the system can be repaired, the greater the availability to the customer.  For systems that require high reliability or availability, redundancy can improve the design.  However, repairable systems will benefit significantly more than non-repairable systems when using redundancy.

 

Common metrics used in measuring system types are shown in the table below.

METRIC

NON-REPAIRABLE

REPAIRABLE

Time to Failure MTTF Time to First FailureHazard Rate MTBF Time to First FailureROCOF/Failure Rate
Probability Reliability Availability(Reliability)
Maintainability N/A Maintainability Downtime
Warranty Product replacement within warranty period Part/product replacement within warranty period

The table below compares some additional areas of non-repairable systems and repairable systems.

NON-REPAIRABLE

REPAIRABLE

Discarded (recycled?) upon failure Restored to operating conditions without replacing entire system
Lifetime is random variable described by single time to failure Lifetime is age of system or total hours of operation
Group of systems – lifetime assumed independent & identically distributed (from same population) Random variables of interest are times between failure and number of failures at particular age.
Failure rate is hazard rate of a lifetime distribution – a property of time to failure Failure rate is rate of occurrence of failures (ROCOF) – a property of a sequence of failure times

 

Reliability modeling is usually more complex for repairable systems.  Often, methods like Markov models (chains) is required to adequately model repairable systems as opposed to simple series block diagram methods for non-repairable systems.

In the area of monitoring or analysis, the following table compares methods for both types of systems.

METHOD

NON-REPARIABLE

REPAIRABLE

Weibull Useful method (single failure modes only) Not used at system level
Reliability Growth –  Duane

– AMSAA

Usually not used Used during development testing
Mean Cumulative Function (MCF) Usually not used Useful method (non-parametric)
Event Series (Point Processes) HPP (For random, constant average rate events) NHPP (Parametric method) – complex

 

It is important to understand the type of system being designed and use the appropriate reliability methods and tools to match that system.  This may require some research but it’s important to use the correct methods so as not to have misleading results.

What has been your experience in doing analysis of repairable systems compared to non-repairable systems?

FMEA is great tool used in many quality, reliability, and risk analysis processes.  It is not a highly sophisticated tool and is certainly not technically complex.  As a reliability tool, the FMEA is extremely effective in identifying the risks of greatest concern and thus focusing design and test activities to eliminate that risk or reduce it to tolerable levels.

Even though there is software available to assist in performing the FMEA, a spreadsheet is often adequate.  Getting the proper team together with the patience to conscientiously fill out the spreadsheet is often a more difficult task.

A typical FMEA process for a design FMEA might be composed of the following steps:

¨       Step 1: Review the Process/Design

¨       Step 2: Brainstorm potential failure modes

¨       Step 3: List potential effects of each failure mode

¨       Step 4: Assign a severity rating for each effect

¨       Step 5: Assign an occurrence rating for failure modes

¨       Step 6: Assign a detection rating for modes/effects

¨       Step 7: Calculate the risk priority numbers

¨       Step 8: Prioritize the failure modes for action

¨       Step 9: Take action to eliminate/reduce high-risk

¨       Step 10: Calculate the resulting RPN

I believe that most of these steps are quite easy to perform but one that seems to cause a great deal of confusion is Step 6: Assign a detection rating.  To assign a detection rating, the probability of detecting a failure before the effect is realized must be determined.  So, what does that mean?  I have seen a number of different explanations for what “detection” means for an FMEA.  Does that mean detecting a potential failure prior to shipment?  Does that mean detecting that a failure is imminent but prior to occurrence in the customer use environment (a type of prevention)?  Does that mean detecting the failure after it occurs but prior to it impacting the customer?  Or, does that mean just detecting that a failure has occurred?

Here are some opinions found in an internet search:

  • First, an engineer should look at the current controls of the system, that prevent failure modes from occurring or which detect the failure before it reaches the customer. Hereafter one should identify testing, analysis, monitoring and other techniques that can be or have been used on similar systems to detect failures. From these controls an engineer can learn how likely it is for a failure to be identified or detected.
  • The Design Control Detection then allows us to describe how we will test this design and the confidence we have that this test would find any potential failure mode(s) about which we are concerned.
  • Identify process or product related controls for each failure mode and then assign a detection ranking to each control. Detection rankings evaluate the current process controls in place.
  • A control can relate to the failure mode itself, the cause (or mechanism) of failure, or the effects of a failure mode.  To make evaluating controls even more complex, controls can either prevent a failure mode or cause from occurring or detect a failure mode, cause of failure, or effect of failure after it has occurred.
  • Design Control will almost certainly detect a potential cause/mechanism and subsequent failure mode.
  • Identify Current Controls (design or process). Current Controls (design or process) are the mechanisms that prevent the cause of the failure mode from occurring or which detect the failure before it reaches the Customer. The engineer should now identify testing, analysis, monitoring, and other techniques that can or have been used on the same or similar products/processes to detect failures. Each of these controls should be assessed to determine how well it is expected to identify or detect failure modes.
  • Detection is an assessment of the likelihood that the Current Controls (design and process) will detect the Cause of the Failure Mode or the Failure Mode itself, thus preventing it from reaching the Customer.
  • Identify the existing controls that identify and reduce failures.  Controls may be Preventive (designed in) or Detective (found by functional testing, etc.)–Preventive controls are those that help reduce the likelihood that a failure mode or cause will occur (affect occurrence value)–Detective controls are those that find problems that have been designed into the product (assigned detection value).
  • It is your ability to detect the failure when it occurs.
  • Basically prior to “impending” failure.  The new AIAG FMEA manual has implemented “2” control columns in an effort to assist in this endeavor.  Preventive Controls : Essenially what are you doing to prevent the failure from occurring. This includes such things as SW diagnostics.
    In an automotive application, an ABS lamp activates prior to impending failure to allow you to take it to the dealership .  Detective controls : Essentially what tests do you have in place that can detect the failure prior to design / process release to the end user.
  • Detection: Detect the Cause/Mechanism or Failure Mode, either by analytical or physical methods, before the item is released to production.
  • FMEA is a mitigation planning tool.  Detection must be relevant to mitigation.
  • Detection is sometimes termed EFFECTIVENESS. It is a numerical subjective estimate of the effectiveness of the controls to prevent or detect the cause or failure mode before the failure reaches the customer.  The assumption is that the cause has occurred.
  • A description of the methods by which occurrence of the failure mode is detected by the operator. The failure detection means, such as visual or audible warning devices, automatic sensing devices, sensing instrumentation or none will be identified. (MIL 1629)
  • The definition of Detection usually depends on the scope of the analysis. Definitions usually fall into one of three categories:

i) Detection during the design & development process

ii) Detection during the manufacturing process

iii) Detection during operation

It’s obvious that there are a number of opinions of what “detection” means in the context of an FMEA.

 

One thing is clear, is that during the preliminary discussions prior to beginning the detailed FMEA development, that everyone should agree on what detection means for the product being addressed.

 

Does anyone have an opinion on this subject?