In block diagram form, it likely looks something like this:
The process you want to follow typically looks like this:
- Isolate and separate
- Simplify the input stimulus as much as possible
- Observe the output
- Iterate to zero in on the failure
Isolate and separate
You’re usually observing a symptom, not the root problem itself directly. Look for clues that point you toward a general region of the problem. It’s okay to be wrong. You’ll get better at this with experience.
From there, divide and conquer. In other words, select intermediate input and output locations to separate the system and create your own little (temporary) subsystem. Note these need to be locations that you’re able to stimulate and observe.
Simplify the input stimulus as much as possible
Your custom test equipment may be working with dynamic, complex, and sometimes noisy signals. We want to simplify the problem to diagnose and find the root cause. So we want to simplify the input stimulus as much as possible (as much as possible means just complex enough that the failure activates). What that means is dependent on what you’re trying to stimulate, but some ideas will include stimulus such as:
- Constant/static input voltage away from the noise floor.
- Sinusoids or other alternating analog waveforms which result in known outputs
- Counting patterns
- Noise-free digital inputs
- Walking ones
Remember to keep in mind as you’re injecting new stimulus, you don’t want to damage the system further, so keep in mind maximum allowable input levels, be careful not to short circuit something, etc.
Observe the output
This aspect can range from super-easy to pretty involved, depending on how hard it is to access/probe the intermediate output (physically, electrically, or digitally), as well as whether or not you need to be able to trigger the capture of the output based on the timing of the input.
Iterate to zero in on the failure
At this point one of two things have likely occurred:
Option 1: You were able to replicate the problem within the temporary sub-system you created. If this occurs, either you’ve narrowed the problem down as far as you need to, or if there are still several components in play, repeat the process starting at the isolate and separate step to continue to narrow down and locate the failure.
Option 2: You weren’t able to replicate the problem within the temporary sub-system you created. If this is the case, there’s several reasons that could be the case. The most likely reasons are:
- You didn’t activate the failure.
- The failure exists outside of the temporary subsystem you created.
- You weren’t able to observe the intermediate output correctly.
There are several things you can try here. You can:
- Try modifying the stimulus.
- Try changing the location of your temporary inputs and/or outputs.
- Make sure you’re set up to observe the output well.
- Pull in another engineer that’s not close to the problem but has a lot of experience to bounce your ideas off of and get a fresh set of eyes on it.
A word about repeatability of the failure
If the initial symptom/problem can be replicated consistently, be thankful. Some failures occur very infrequently. The first step is to be able to reproduce the problem.
A few things to consider if you can’t re-produce the problem off the bat:
- Replicate the stimulus and environmental conditions as closely as possible. Is the input the same? Is the temperature the same?
- Stress the test system a bit. Sometimes operating the test system at its limits can encourage an intermittent failure to occur more frequently.
- Consider adding some detailed error/diagnostics logging functionality to the code. The caution flag here is that since you’re changing the overall code base, you might actually make a problem less likely to occur because you subtly changed the timing. You’re also invalidating your released code, so make sure to branch off the released main line before making changes.