Soft Error Rate Count

Posted by Mike Silverman | Prediction Methods,Questions for the Experts? | Monday 21 June 2010 06:46

Moshe Valdman from Israel wrote this question:

In a telecom system we have many memories and FPGAs
Theoretically we should have quite high failure rate related to “single
event upset”. I suspect we indeed have such failures, but these could also
be just SW “bugs”. I have difficult time convincing developers to add ECC, CRC, parity and other means to correct or at least detect such temporary failures. Can you share ideas on how to estimate actual field failures rate related to SEUs and how to quantify the cost?

Charlie Slayman, our Ops SER Expert wrote back:

Yes, you will have soft errors if you have memory and FPGAs.  Typical rates for SRAM and flops vary between 100 to 1000 FIT per Mb or Mflop.  DRAM rates are much lower, around 100 FIT/Gb.  Check out Slide 25 of my presentation at last month’s IEEE SCV Reliability Society meeting archived at http://www.ewh.ieee.org/r6/scv/rl/archives.htm.

You can use these numbers as a rough rule of thumb to estimate a system soft error rate.  If you actually see field failures significantly higher than this rate, then I would suspect software bugs or signal integrity as the dominant sources of errors.

Ironically, high solar activity increases the strength of the earth’s magnetic field which in turn reduces the neutron flux at sea level. Neutrons are the only particles that make it to terrestrial levels with any significant flux.  The solar particles are scattered and deflected in the upper atmosphere. So the soft error rate of terrestrial systems is lower during high solar activity.  (It’s a different story for satellite systems since they are high enough to be hit by energetic ions from the solar storms.) But the modulation is only about 20%, so you won’t see a big change in system error rate.

As far as justifying the use of parity, ECC and CRC, that depends on the design target for system reliability and the components used.  At the soft error rates I quoted above, I would find it hard to believe that any form of complex telcom design could meet reasonable reliability targets without some form of ECC on large memory.

Hope this helps.

Charlie

If you have any comments or further insights into this, please respond to the blog

Supply Chain & Quality: Optimizing It

Posted by Mike Gozzo | Uncategorized | Sunday 13 June 2010 19:46

To be competitive today, it is imperative that organizations squeeze out waste. In the past, enterprises pursued this with a narrow view – cost reduction efforts.

Todays global environment offers a greater all encompassing opportunity to address factors impacting the total supply chain.

Competitive Advantage is the target and one of the key factors is QUALITY. Supply Chain Management needs to link Quality and other factors like TIMELINESS and COST. Change in practices needs to be put into effect but in order to do this, the organization needs to know where to employ the appropriate resources.

What are the techniques you’re employing to accomplish this effort? What are the measurements being used to ensure achievement is reached?

Analyzing Data for Repairable Systems

Posted by Greg Larsen | Availability, Maintainability, and Serviceability | Wednesday 5 May 2010 05:47

Many of the “standard” reliability methods are intended for non-repairable systems. That is, when a component, sub-assembly or system fails, it is not repaired and returned to service. The Weibull distribution and other well-known distributions which effectively describe the time to failure assume the failures are “terminal”. That is, the whole system is replaced.

In contrast, repairable systems may fail multiple times during their lifetimes and this results in “recurrent events” in which system components may be repaired or replaced to bring the system back on line. In this case, a single system actually has multiple ages, i.e. components which have been repaired or replaced are “younger” than the rest of the system.

Reliability data comprised of recurrent events should be analyzed differently than time to failure data from non-repairable systems. In particular, it is important to recognize the sequence of the events for individual systems represented in the data. This is done by modeling the cumulative failures (repairs, or costs) versus the system age (time). This model can then be used to predict the total failures (repairs, or costs) at some future point in time.

As an example, warranty data are a collection of recurrent events on many products in the field. Events include repair, replacement and preventive maintenance. Warranty data can be analyzed to estimate the cost of extending the time on a standard factory warranty. The resulting model can be used to estimate such things as cost per unit or number of repairs per unit. This information can then be used to decide whether revenue would increase sufficiently to make a longer warranty period beneficial.

Training and consulting is available for repairable systems applications.

Greg Larsen, MS, CRE
Senior Reliability Consultant
gregl@opsalacarte.com

How to Avoid Being the Next Toyota

Posted by Mike Silverman | Assessing/Planning for Reliability | Sunday 25 April 2010 21:19

Our local IEEE Reliability Society and IEEE Safety Society will host a talk this Tuesday, April 27 on things that companies should do to avoid huge recalls. We will address from a Design for Excellence (DfX) perspective. See our website at www.opsalacarte.com for more details on this event.

Design for Excellence Symposium

Posted by Mike Silverman | Assessing/Planning for Reliability | Sunday 25 April 2010 21:16

Our first of three DfX Symposia was a big success here in San Jose California. We even had a number of folks dial in for the webinar version.

Our next DfX Symposium will be in Huntsville Alabama the week of May 17-21. Check out our website at www.opsalacarte.com for more details.

Outsourcing Revisited – What have we learned?

Posted by Mike Gozzo | Uncategorized | Thursday 11 March 2010 17:05

Several years ago, outsourcing was initiated as the tool for improving competitive posture. Since then, experiences gained from many business sectors has provided further insight and new learning on better ways to approach supply chains.
Cost has been the primary driver to do offshore outsourcing. Should that change?

What about these questions:
* What is the formulation for deciding on outsourcing?
* What methodology needs to be employed?
* What is the report card and who’s keeping score?

Your comments regarding these questions can offer more insight as to how we are to proceed to address this important activity

In the Twilight Zone of Humidity Testing

Posted by Andre Kleyner | Assessing/Planning for Reliability,Climatic Testing,Reliability Testing | Thursday 25 February 2010 19:58

After 25 years in the reliability field I am still a bit mystified by the humidity testing.  Generally the environmental tests can be divided into two major categories: Durability tests, where some form of wear-out mechanism causes products to fail.  The most common examples of durability tests are vibration and thermal cycling.  The second group is the capability tests, often referred as overstress tests with the goal of determining how well the product can resist certain conditions such as high voltage, accidental drop, dust, or others. 

It appears that humidity tests belong somewhere in between the two categories.  On one hand there are electro-migration, corrosion, dendritic growth, and other failure mechanisms following the pattern of wear-out processes. Those failure mechanisms are indeed accelerated by the combined effect of temperature and humidity and the most commonly used acceleration models used to calculate the test durations are Peck’s and Eyring.  Both models have rather limited applications and varying accuracy, but currently the best what reliability science can offer to calculate the field to test ratios.  On the other hand humidity often causes the change of mechanical properties of the materials which often makes them more susceptible to failures.  For example, modules of elasticity of some materials go down after absorbing moisture and some plastics become more prone to developing cracks as a result of humidity exposure.  Those types of failure mechanisms can not be described by any known algebraic acceleration models.  Despite that engineers often mistakenly apply them for test time calculation. The desire to use algebraic acceleration models is very strong due to their relative simplicity and ease of comprehension.  The alternative to the use of acceleration models is the use of predetermined tests, like for example 164 hours of 85% relative humidity at 85 degree C.  Those tests are often based on some historical data rather than on test rationale or good understanding of failure mechanisms.

Therefore, before writing a product validation plan involving any humidity testing it is important to answer the following questions:

What are the expected humidity-triggered failure mechanisms for my product?

Are any forms or electro-migration involved?

Will the humidity affect any of the material properties and will it make my product more prone to failures?

What type of humidity test is most appropriate for my products? Steady-state, cyclic, both?

Will any of the possible failure mechanisms be accelerated by higher humidity/higher temperature combination or they would remain neutral to it?

Do any of the known acceleration models apply in my case and if not, how do we determine the test duration?

What exactly will my test represents for the life of the product? 

 Answering those and some other questions is critical to a successful validation program involving humidity testing.

Design News: Toyota Problem “Unforeseeable”?

Posted by Edward Smith | Assessing/Planning for Reliability | Monday 15 February 2010 23:39

Design News, January 28, 2010:  ”Toyota’s sticking gas pedal was an almost-unforeseeable problem, experts say, and the best course of action now is for engineers to ensure that drivers can handle the failure if it happens again.”

Q. I would like to hear how the Reliability Engineering community feels about this.  Was it unforeseeable?  How can we “ensure that drivers can handle the failure” if the failure is unforeseeable? 

Other manufacturers using throttle-by-wire added software that uses a signal from the brake pedal to override the signal from the gas pedal.  Apparently someone foresaw the problem.  

 Q. As more and more of the functions of the driver are gradually taken over by computerized equipment, with the long term goal of eliminating the driver’s participation entirely,  how do we assure the “passengers” that all is well?  How do we perform Reliability testing holistically?

And finally, though it is not technically related, is the public ready to surrender complete control?

http://www.designnews.com/article/print/446480-Toyota_s_Problem_Was_Unforseeable.php

http://www.nytimes.com/2010/02/08/08webtoyota.html

http://www.washingtonpost.com/wp-dyn/content/article/2010/02/02/AR201002020316

San Jose Mercury News, Sunday, February 14, 2010 Page A1, “Auto electronic controls drawing scrutiny”

Free Design for Reliability Seminar in L.A.

Posted by Mike Silverman | Assessing/Planning for Reliability | Thursday 4 February 2010 20:56

We are providing a FREE 1/2 Day Design for Reliability seminar on Thursday, February 11th at Wyle Labs in El Segundo, California. If you are interested in attending, please let us know and we can provide you more info.

Reliability Conferences – RAMS, MD&M, and CMSE

Posted by Mike Silverman | Uncategorized | Thursday 4 February 2010 20:48

Last week was the annual Reliability, Availability, and Maintainability Symposium (RAMS) conference. We presented 5 papers/tutorials. If you were not able to attend, let us know and we can get you copies of our presentations.

Next week is the Medical Design and Manufacturing (MD&M) show in Anaheim California. We will be presenting a paper on Monday entitled “Reliability Challenges in the Medical Industry”. Hopefully you can join us. If not, let us know and we can get you a copy of that paper.

Also next week is the Components for Military and Space Electronics CMSE Conference in Los Angeles. We will be presenting a paper on Wednesday entitled “New Fast Methods for Determining MTBF using HALT Results”. Hopefully you can join us. If not, let us know and we can get you a copy of that paper.

Next Page »