Questions for the Experts?

THEME: Why do this?  An abbreviated view of how to do it

A recent article in Bloomberg/Business Week (Dec 10-16, 2012) interview by Josh Tyrangiel of Apple’s CEO Tim Cook noted a key point in business practice/philosophy: “There are always things unknowable – if we are finding zero issues, our performance bar is in the wrong place”

WHY THINK THIS WAY?  You need to improve – it must be a way of business performance in all areas

  • People, knowkedge, & technology/information
    There is a need to understand your business performance attitude (Change it or Perish)

HOW?  (An abbreviated view)

  • Understand Value Analysis
  • Identify your Competitive Advantage
    (Quality, Availability, Flexibility, Cost)
  • Tool to employ (System Audit)
  • Practice
    Need a champion/advocate
    Establish measurements
    Use a standard (ISO – 19011)

How do you compare?  For more information or questions refer to mikeg@opsalacarte.com

 

Continuous Improvement (CI) is a driving force in business.  However, how do we go about obtaining such from our Suppliers?

What do we, as customers, do to define such to our suppliers – the expectations of improved quality, delivery service, and cost reduction?

What is done to aide the supplier in the pursuit of CI?  What steps can we take to ensure we are the track to improving our performance via compliance of our suppliers?

Moshe Valdman from Israel wrote this question:

In a telecom system we have many memories and FPGAs
Theoretically we should have quite high failure rate related to “single
event upset”. I suspect we indeed have such failures, but these could also
be just SW “bugs”. I have difficult time convincing developers to add ECC, CRC, parity and other means to correct or at least detect such temporary failures. Can you share ideas on how to estimate actual field failures rate related to SEUs and how to quantify the cost?

Charlie Slayman, our Ops SER Expert wrote back:

Yes, you will have soft errors if you have memory and FPGAs.  Typical rates for SRAM and flops vary between 100 to 1000 FIT per Mb or Mflop.  DRAM rates are much lower, around 100 FIT/Gb.  Check out Slide 25 of my presentation at last month’s IEEE SCV Reliability Society meeting archived at http://www.ewh.ieee.org/r6/scv/rl/archives.htm.

You can use these numbers as a rough rule of thumb to estimate a system soft error rate.  If you actually see field failures significantly higher than this rate, then I would suspect software bugs or signal integrity as the dominant sources of errors.

Ironically, high solar activity increases the strength of the earth’s magnetic field which in turn reduces the neutron flux at sea level. Neutrons are the only particles that make it to terrestrial levels with any significant flux.  The solar particles are scattered and deflected in the upper atmosphere. So the soft error rate of terrestrial systems is lower during high solar activity.  (It’s a different story for satellite systems since they are high enough to be hit by energetic ions from the solar storms.) But the modulation is only about 20%, so you won’t see a big change in system error rate.

As far as justifying the use of parity, ECC and CRC, that depends on the design target for system reliability and the components used.  At the soft error rates I quoted above, I would find it hard to believe that any form of complex telcom design could meet reasonable reliability targets without some form of ECC on large memory.

Hope this helps.

Charlie

If you have any comments or further insights into this, please respond to the blog

If you have a question, you can comment to this post.