Hardware reliability testing FAQ

  • What’s different about hardware reliability testing?
  • When is it most useful?
  • What are the most important things to keep in mind?
  • What should I watch out for?

What’s different about hardware reliability testing vs other types of hardware testing?

Reliability testing, aka endurance or durability testing, runs the part through many hours of use while monitoring (sometimes continuously, sometimes periodically), the performance of the part.

Different than design validation, the part is usually not subject to a wide range of extreme conditions to look for corner cases, but rather subjected to “normal” operation in an accelerated fashion. Note that “normal” might not imply static operating conditions. Rather the part might be cycled through a range of conditions seen in normal usage, such as back and forth motion for a car seat cushion, or -120 C to +120 C temperature cycling for a satellite part orbiting the earth.

Parts are often run to failure or to degraded performance to verify that one design is better or worse than another. See https://www.weibull.com/basics/growth.htm for reliability growth.  Sometimes, the performance measurements are coupled with an analysis of the failure mode(s). Other times, this failure mode and effects analysis (FMEA) is done without reviewing any measurements taken during the cycling, implying that the reliability test system controls the automated cycling without assessing operation. However, starting in the early ~2000s, data collection during cycling has become nearly universal.

When is hardware more likely to go through non-negligible reliability testing?

When the part is expensive or the cost of failure of the part is expensive.

The cost of failure could be the cost incurred by that failure, for example:

  • equipment destruction,
  • fire,
  • or other catastrophe,

or the cost of warranty and repair of a failed product, for example:

  • handling the return,
  • reduced customer satisfaction,
  • or other market implications.

What are the most important things to keep in mind to do reliability testing well?

  1. Decide how to best mimic actual use while staying within a reasonable budget. For example, if your product extends and retracts, you likely can use a pneumatic actuator rather than a motion- controlled arm. The former is less expensive but perhaps less representative of the actual motion profile seen in real use. Does that difference matter?
  2. Decide the best yet minimal set of parameters to measure to assess the performance of the part. Since reliability growth depends on longer life between failure or performance degradation, early detection and/or prediction of that imminent failure will allow faster assessment and shorter reliability testing cycle times.

What should I watch out for?

  1. Watch out for the potentially massive amount of data that might be generated during the test. For example, collecting bearing vibration continuously for a 10,000 hour test could easily generate 100s of GB per channel. Ask yourself if you need to collect all the data. A common practice relies on a scheme which combines a periodic sampling of data (slow speed data points or high speed waveforms) with a history buffer sampled “continuously” that is saved on part failure, so you have a detailed record of events leading up to the failure.
  2. If the test takes longer than you expected, you’ll end up tying up measurement equipment. So, assume that the test will last as long as the system takes to run through the expected number of cycles of the part.
  3. Waiting too long to start reliability testing. Don’t wait until the product is about to launch, since reliability testing might take as long as the several years you expect for the life cycle of the entire system.

My part is expensive so I don’t want to break it more often than necessary. What should I consider?

You likely (and should) have some idea of the failure modes for your part. Consider setting or estimating measurement threshold levels above which indicate performance degradation. Alternatively, you could monitor for one or more parameters changing quickly, with a rate of change above some threshold level. Both these threshold-breaching events would indicate that you should stop the testing and review the condition of the part before continuing.

What hardware reliability tests have you implemented?

  1. Tests based on the customer’s study of FMEA.
  2. Accelerated testing, including HALT/HASS, that run the part repeatedly with no deadtime between cycles, in either static or dynamic environmental conditions.
  3. Pre-process test results for the customer so they can perform “Reliability growth analysis”.
  4. Measured part performance while it is being subjected to thermal-induced, shock and vibration fatigue analysis.
  5. Continuous cycling a specific profile to simulate actual use, while changing that profile randomly. The profile can be for load, temperature, speed, and so forth.
  6. When parts wear during use, it can be important to adjust the load profile to accommodate stretch and wear in the part so that the load profile is matched rather than following the same position-based trajectory initially applied to the part.