Robust Design (RD) Methodology is discussed for hardware development. We will compare with reliability engineering (RE) tools and practices, and highlight differences and similarities. We will present proximity to ideal function for robust design and compare to physics of failure and other reliability modeling and prediction approaches. We will show measurement selection to strongly differentiates RD and reliability engineering methods. We will show how to get the most from each methodology and show pitfalls for each set of practices. This webinar will be a lead in to Lou’s symposium classes DOE and DFSS.
For more info or to register, please Contact Us
Traditional Mahalanobis distance is a generalized distance, which can be considered a measure of the degree of similarity (or divergence) in the mean values of different characteristics of a population, considering the correlation among the characteristics. It has been used for many year in clustering classification and discriminant analysis. Mahalanobis distance is attributable to Prof.P.C. Mahalanobis , founder of the Indian Statistics Institute some 60 years ago. Mahalanobis distance has been used for various types of pattern recognition, e.g. inspection systems, face and voice recognition systems , counterfeit detection systems, etc. The figure below displays data published by Fisher (1936) and cluster analysis, where classification into three predetermined categories is demonstrated
Another generalized distance most engineers have encountered is the Euclidean distance between two multivariate points p and q. If p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by:
No consideration is given to the correlation between characteristics in Euclidean distance calculations.
Dr. G.Taguchi of Ohken Associates Japan developed an innovative method for determining the generalized distance from the centroid of a reference group (of multivariate data) to a multivariate point. For example, if a doctor were to have a group of very healthy patients, whose vital characteristics like blood pressure, body temperature, skin color, heart rate, and respiration rate, etc. were all considered exemplary, then he could define a Mahalanobis space, a reference space, with those healthy folks, and use the centroid as the zero point and define a unit distance for a continuous degree-of -health scale. If a not-so-healthy person came to the same doctor, and the same characteristics were measured, he would have an MHD number much higher than the reference group. His MHD number would be indicative of his generalized distance from the centroid of the healthy group. As time passed, the MHD number for the not-so-healthy patient could increase (or decrease) , depending on whether his health were failing or improving, respectively. In general, very healthy people tend to look quite similar , while unhealthy people tend to look quite different from one another, (and from the healthy group) . In addition, the changes in correlation structure among the unhealthy patients’ characteristics strongly affect their MHD numbers. In the case where a person’s MHD number reached a predetermined high threshold value, for example, hospitalization might be recommended by the doctor. If the MHD became similar to those of the reference group, the patient could be recommended for simple periodic occasional doctor visits.
From any number of multivariate characteristics measured, it is possible to readily identify those characteristics which are most important (in a pareto sense) . Reducing cost of measurement is an important consideration for many enterprises. There is usually a subset of measurement which provide all necessary data to make correct decisions. Strong correlations between measurement make it possible to eliminate measures that add little value. The information contained in a handful of multivariate measurements may be sufficient to identify abnormal conditions.
A medical trend chart of MHD illustrates the relative level of health of a person as a function of time. For example, daily collection of data for a patient, along with daily estimation of MHD, could be used to track overall health improvements (or deteriorations). Increasing trends could be used for prognostics, to initiate preventive countermeasures, before a threshold condition is reached. The corrective effect of the countermeasure could be captured in the MHD number from the following days. Multivariate process control charts, like Shewhart and Cusum charts are similar , but these are based on probabilistic control limits derived from various statistical distribution assumptions. No such assumptions are made with MHD. Rather, consideration of costs are used to set limits.
For manufactured products, multivariate measures from testing are typically collected following final assembly. If we assume that the health of a manufactured product is analogous to the health of a patient, we could use similar methods to identify abnormal conditions and calculate a continuous MHD number for the multivariate condition. By collecting a group of manufactured systems, with exemplary performance, a Mahalanobis space could be constructed from the multivariate characteristics. A zero point and unit distance scale would be estimated as before. The system’s health could be diagnosed at t=0, just after assembly, and even later at intervals dictated by a data collection schedule. The manufactured product could easily be classified into normal and abnormal states at t=0, and the product’s tendency to become abnormal could be tracked.
The MHD measure can be utilized for many interesting industrial problems including fault detection, fault isolation, degradation identification, and prognostics. For example, air bag deployment system decision relies on the ability to first establish a reference space for normal everyday driving, and then to release the air bags when multivariate shock loads and accelerations exceeds a threshold value. This is fault detection. Fire alarms should actuate when various fire conditions exist over and about that expected from simple kitchen cooking or cigarette smoking. Multivariate reference space would be collected from normal cooking conditions and abnormal fire condition would be declared above some threshold value. Tendency to fail for a high volume printer, with multivariate sensor data, could be inspected periodically, and a service agent could be dispatched or electronic countermeasure could be applied, before customer ever noticed. Availability of the printer would be higher without the fault downtime, and customer satisfaction would be higher.
We are pleased to announce that our HALT Calculator is now available as a cloud-based application so you can access it direclty whenever you need. We also have a Paypal payment system in place on-line for quicker access.
With the cloud application, we are also offering either pay by use (you can order as many use credits as you’d like) or yearly subscriptions. The yearly subscriptions work great when you have your own chamber or when you perform many HALTs a year and you would like to run the calculator during HALT planning to set must meet criteria, or when you are in the middle of the test and trying to determine which failures to fix and continue testing.
Here is the internet version of our HALT Calculator.
While product reliability has become a major concern to most organizations, many have overlook developing good reliability specifications. This oversight can result in ambiguous and purposeless reliability testing during the validation phase of the product development. Effective reliability testing requires well-defined reliability specification. After all, the prime objective of a reliability engineering program is to test and assess product reliability.
A common element that is vastly ignored but rather critical to a sound reliability specification is definitions of equipment failure. Even the most vigorous reliability-testing program is of little use if the product being tested has poorly defined failure parameters. This article discusses the essential requirements for establishing concise and effective reliability specifications, and proposes a method to define equipment failure.
Among the requirements that are often used to specify equipment reliability is “mean time between failures” (MTBF), which is verified during subsystem and system reliability testing. Essential to reliability testing is the development of agreed-upon definitions of equipment failure and it must be clearly defined at the earliest stages of the product development. This may seem to be fairly obvious whether a product has failed or not, but such a definition is quite necessary for a number of different reasons.
One of the most important reasons is that different manufacturers may have different definitions as to what sort of behavior actually constitutes a failure. Identical tests can be performed on the same equipment by different groups may produce radically different results simply because the different groups may have different definitions of product failure. This can result in performance values that are sometimes significantly different, which in reality it may not be true. There are cases where one manufacturer has claimed MTBF value of 2000 hours/failure whereas another manufacturer has reported 100 hours/failure for the same product. This discrepancy may not be due to the fact that one having a vastly superior product; rather, it may be due to the differences in their definition and assessment of a “failure”.
In an effort to normalize reliability criteria, standards such as SEMI E10 have been created to give customers and suppliers in semiconductor manufacturing a guideline for measuring reliability, availability, and maintainability (RAM). SEMI E10 defines an interrupt as equipment inability to perform its intended function due to occurrence of assists or failures, or 
- Equipment Interrupt = Sum of all Failures + Sum of all Assists
It has further defines assists and failures as:
Assist: Any unplanned interruption that occurs during equipment operation where all of the three following conditions apply:
- Equipment operation is resumed through external intervention.
- There is no replacement of a part, other than specified consumables.
- There is no further variation from specifications of equipment operation.
Failure: Any unplanned interruption or variance from the specifications of equipment operation other than assists.
From the above definition, one may conclude that a failure is defined as replacement of a part and assists are any external intervention. In practice however, many customers view machine performance as the cost of operation. Thus, a customer may not favor equipment if it would require frequent external interventions (e.g., equipment adjustment) since they would need to allocate many resources to operate the equipment. For this reason, customer may tend to view equipment adjustments also as failures. Thus, even with standards, a common struggle still exists between the suppliers and the customers to classify an interrupt.
Equipment Interrupt (Failure/Assist) Classification
This article proposes classifying failures and assists by considering the modes of recovery; that is, to categorize failures and assists by considering the means by which a customer amends the problems.
Just like reliability testing that must simulate product usage in the field, failures and assists should also be profiled the way they are repaired by the customers. In semiconductor production environment, many customers categorize equipment repair activities by the nature of the interrupts. They have repair policies that allocate recovery actions to machine operators and engineering technicians. This recovery plan is practical because different interrupts require different skill sets for repairs. The plan requires that the customer have a good understanding of the equipment-operating behavior, so they can accurately staff maintenance and repair personnel.
This recovery plan technique can be used as a foundation for assist and failure classifications. In other words, assists may be classified as machine induced interrupts that is recovered by a machine operator, whereas, failures are machine induced interrupts that require skilled technicians for comprehensive troubleshooting and in-depth corrective actions. Deployment of this method also helps determine the cost of equipment ownership; it costs less to resolve an assist that is repaired by a machine operator and cost more to resolve a failure since technician involvement is required.
Product specifications are no longer limited to just meeting functionality measures (i.e., speed, capacity, range, etc) because for products with poor reliability and seldom available for use, functionality measures are meaningless. Reliability specification is the backbone of a reliability program and it is a prerequisite for reliability testing. Without this, the implementation of a reliability program will be difficult and frustrating process. Typical equipment reliability specification includes performance indices such MTBF and it must always be accompanied with clear definition if failure. Effective reliability testing heavily hinges on clear definition of equipment failure. Without this definition as a baseline, any reliability discussions become meaningless.
This article stressed the importance of equipment failure as an integral part of reliability specification, and proposed a method for classifying equipment interrupts.
1. SEMI E10-99, “Standard for Definition and Measurement of Equipment Reliability, Availability, and Maintainability (RAM),” SEMI (Semiconductor Equipment and Material International), 805 East Middlefield Road, Mountain View, CA 94043, 1999.
One objective of working in reliability is to minimize Life Cycle Costs (LCC). In order to do this a Reliability Engineer must select which reliability tools will need to be used and then utilize them properly for the product life cycle.
As well he needs to be on top of the information that is generated to be certain it is utilized properly during the testing phase.
The following is a string of excerpts from e-mail messages with a potential client who attended the recent “practical reliability testing” webinar . The questions were sent to me since client is in NY. We have started up a dialog around these questions.
Q: We always struggle with sample sizes, quantifying all of the potential accelerating stresses on product like optical fiber amplifiers which have hundreds of components. other issues like speed of finishing testing while getting product to market with good reliability.
( Several days later) What I really meant by the question was what if the product you are working on is rather expensive and you are to only get 1 (and if you’re lucky 2) samples to test.
What do you do when you can’t take the product to failure and you really can’t overstress due to the fact you only have 1?
when you have 1 or 2 samples to do a full qualification as well as reliability and don’t have field data……..
When we think of HALT, we think of HALT Chambers and the stresses they provide – cold temperature, hot temperature, rapid thermal transitions, vibration stress, and combined thermal and vibration environments.
We need to expand our thinking of HALT into meaning any stress that can accelerate finding defects (in other words, to find design weaknesses before your customers find them).
Many other stresses can be used for HALT including: ESD, drop testing, bend testing, water ingress testing, and more. In this presentation, we will explore some of the HALT stresses we can apply in addition to temperature and vibration.
We will also explore some misuses of typical HALT chambers and how HALT practitioners believe just because they are using HALT chambers, they think they are performing HALT. If you are not performing HALT with the intention of discovering and expanding product margins, you are not performing HALT no matter what equipment you are using. In this presentation, we will explore some of the common errors HALT practitioners make that takes them off the path of performing HALT.
By: Mike Silverman
This blog is part of a presentation to be given by Mike at the annual Accelerated Stress Testing and Reliability (ASTR) Workshop in Denver Oct 6-8. Presentation will be posted on Ops A La Carte’s website under technical presentations.
To be competitive today, it is imperative that organizations squeeze out waste. In the past, enterprises pursued this with a narrow view – cost reduction efforts.
Today’s global environment offers a greater all encompassing opportunity to address factors impacting the total supply chain.
Competitive Advantage is the target and one of the key factors is QUALITY. Supply Chain Management needs to link Quality and other factors like TIMELINESS and COST. Change in practices needs to be put into effect but in order to do this, the organization needs to know where to employ the appropriate resources.
What are the techniques you’re employing to accomplish this effort? What are the measurements being used to ensure achievement is reached?