Semiconductor Reliability

As electronic systems become more complex, troubleshooting manufacturing and field failures becomes more difficult.  Hard failures (i.e. permanent changes in a device that lead to reproducible failures) are straightforward to isolate and determine a root cause. The more problematic area is “no trouble found” or NTF – meaning the original failure is difficult (if not impossible) to reproduce.

There are several sources of NTF:

  • Noise – Signal reflections, cross-talk, ground bounce
  • Timing marginality – Variations in rise times and delay times over temperature and voltage
  • Weak cells (memory) – leakage in DRAM, read-write instability in SRAM
  • Soft errors – upsets in memory and logic due to radiation from alpha particles within the IC package or neutrons generated from cosmic rays

Each of these sources of errors can be difficult to reproduce.  This means that a failure initially observed at the system level will not be reproduced that the IC level.  The key to effective failure analysis is to recognize the possible sources and run the appropriate stress tests to reproduce and isolate the failure:

  • Noise and timing marginality are generally much worse at the system level than within the IC.  However, memory chips can be tested with special patterns that highlight these marginality issues.
  • Weak cells – failures will appear at random address locations from chip to chip, but refresh or voltage margin stress testing can be used to highlight the weak cell
  • Soft errors are difficult to test at the individual chip or system level.  Acceleration techniques are required:  high altitude testing or neutron beams for cosmic ray soft errors and sources containing high concentrations of radioactive materials for alpha particle soft errors.

Knowledge of the characteristics of each of these causes of NTF can then be used to help isolate what the most likely source.  Further testing at the system or component level with the correct source of stresses are necessary to confirm or refute the cause.

Semiconductor manufacturing employs high volume automated production lines using very complex wafer processing technologies requiring tight controls throughout the flow.

Specific physical, optical, chemical and electrical tests are being performed at several stages in the manufacturing flow, to screen out wafer batches, wafers or single devices outside of the tight distribution limits of a given processing step.

At the end of the wafer processing line, the wafers are tested using special test chips or test structures on the wafers and all product die are thoroughly tested for functionality.

Finished encapsulated devices are tested again for full functionality and performance against the data sheet specification limits.

New technologies, new products, new packages, are qualified for reliability with special tests and environmental and electrical stresses to demonstrate reliability and estimate population failure rate and lifetime.

All the above tests in manufacturing may result in devices not meeting the required test limits. These defective devices are highly valuable to provide detailed information on the particular failure mechanism(s) causing the defect.

It is imperative to do detailed electrical and physical characterization of the defective devices, followed by physical failure analysis including layer-by-layer de-processing of the devices and using appropriate analytical tools, as: optical microscope, SEM (scanning electron microscope), Auger analysis (chemical profiling) and a long list of other very special analytical tools.

Results of the F/A (Failure Analysis) are used to identify the root cause of the failure of a defective device.

This information set is the basis for corrective action(s) to improve chip design, manufacturing and in production test screens.  Implementing corrective actions in manufacturing should be followed up by checking/testing samples for the failure mechanisms, addressed by the corrective action(s).

There are four important sources of information for continuous quality/reliability improvement in the high volume semiconductor industry:

The first is the out-going quality assurance testing of samples of the production device population to a tight AQL quality level.

The second vehicle is on-going reliability monitoring of significant sample sizes of the outgoing finished product. This reliability monitor consists of accelerated environmental and electrical stresses, as: dynamic high temperature burn-in life test, accelerated temperature/humidity stress, temperature cycling, etc.

The third and important tool is “LIFO” (Last In First Out). To gain advance quality/reliability information on the product population in the production line we use the “LIFO” system, where samples moving through the production line with priority speed ahead of the rest of a production ”mother lot” and subjected to all of the above described quality/reliability tests and stresses. Lots, indicating problems still on the production line are put on hold, until corrective action(s) are implemented.

All information from the defective devices of the above tests and monitors, after rapid failure-and root cause analysis are being used for instant corrective action on the production line.

The most important information on failure mechanisms are extracted from field defective devices, returned from the field by customers. They are carefully analyzed by the above described methods, root cause(s) are identified, corrective action(s) implemented and the results rapidly communicated back to the customers.