Debug Part 1: General Strategy

Charles Eidsness

Due to the complexity of even the simplest High Speed Digital Designs no matter how much time and money is spent developing a new product there is always some risk that there will be issues with the design; unless you're incredibly lucky at some point in your career you're going to have to deal with design issues (or more likely, already have). These issues can be very costly, in time, money, and reputations and if not resolved quickly can become very stressful. I've had the good fortune (or maybe bad fortune) to be brought into several projects to resolve dozens of issues on a variety designs. This article collects some of the advice that I've gathered.

"If you can keep your head when those all about you are losing theirs and blaming it on you... You will be a Man my Son." Kipling

Debugging would be a fun puzzle (kind of like a Colombo movie) if it wasn't for the pressure to resolve the issue. Pressure can come from external and internal sources. In my experience the heaviest pressure comes from within. Everyone wants to do a good job; most Engineers are a at least a little bit obsessive compulsive (just ask my wife). An open issue has a way of consuming you, you can't think of much else, you have trouble sleeping, you end up working crazy hours; one of my more eloquent colleges calls it "Being in the Sh*t.". You just have to deal with the pressure you put on yourself but it's not necessarily a bad thing. On the other hand by making sure that everyone is getting timely status updates (managers, owners, executives, directors, etc...) external pressure can be minimized. By providing status updates you can put those not directly involved in the technical aspects of debugging at ease. This could be in the form of daily status meetings, phone calls, emails, or whatever else gets the job done. I recommend reserving at least a half-hour a day for status updates. One word of warning, status meetings can become a double edged sword by morphing into "debug by committee" sessions; I'll cover that more later, but just keep in mind that the point of a status update should be to update status. If you want to do some brainstorming I recommend separating it from status updating.

Know Thyself Young Grasshopper

Debugging is a great way for a Junior Designer to start his or her career, keeping in mind that like all Junior positions they will have to have support from good Senior Engineers. In fact I think everyone should start their career debugging other folk's designs. Not only do you develop some great lab skills but you learn about common and not so common issues in ways that you never forget.

The Early Bird Gets the Worm

The cost of resolving an issue goes up exponentially at each design stage, because of this it is crucial to catch issues early. During the initial design stages it is important to effectively use simulation, analysis, peer reviews, eval cards, etc. The best place to resolve an issue is before you build any hardware. When you're performing analysis, simulations, and reviews it is important to keep in mind that the worst (most expensive to resolve) issues are the flaky ones that only fail under some specific but rare conditions. If things are going to fail I usually want them to fail spectacularly and reliably, preferably in way that doesn't involve the fire department but a little explosion never hurt anyone. I spend a good deal of my development time reducing the risk of encountering issues that are difficult to resolve. For instance, a signal integrity induced failure (unless it's incredibly bad) is going to present itself as a flaky bug, on a few cards, at specific temperatures, running specific firmware. These types of bugs will most likely slip through Designer Verification Testing, will hopefully get caught in System Integration Testing, but may even end up at a Customer's Site. They are also very time-consuming to resolve. On the other hand putting a Tantalum Capacitor in backwards is something you're going to notice right away, and is easy to resolve. It's not okay to make either mistake but given the option between the two I will pick the latter. Stated another way, given limited design resources I would rather spend them mitigating flaky issues as opposed to mitigating easy to resolve issues.

The Lets Just Try Stuff Syndrome

One of the dangers of having daily status meetings (mentioned above) is that you might find you're getting pulled in several directions by well meaning people. It's important to make sure that everyone has up to date information but in my experience issues are never resolved in meetings; they're resolved in that lab, during conversations between experienced technical people, reading spec sheets, or in the shower but solution by committee is very inefficient and seldom effective. One strategy that generally comes out of a "solution by committee" meeting is the "Lets Just Try Stuff" strategy. Trying stuff that you hope will solve the problem without understanding the fundamentals behind the issue is almost always a waste of time. Make sure that if you are performing an experiment you have a clear course of action depending on the outcome. No one ever won the Super Bowl throwing nothing but Hail Marys, you need to have a solid running game. A little bit of Lets Just Try Stuff isn't going to kill you and may provide some incites for starting a more structured problem solving process but I've seen many times where Designers get sucked into months of Lets Just Try Stuff at the end of which they still don't have a solution, and haven't narrowed down the potential cause of the problem. For a lot of work they have produced nothing of value.

Break it Down

I recommend using the opposite of the Lets Try Stuff Strategy. The time tested Engineering Problem Solving Strategy which involves breaking a problem down into small manageable blocks and methodologically stepping through the blocks to narrow down the source of the issue. Then breaking down the suspect block again, over and over until the both the source of the issue and the resolution are evident. For example, lets say you have a USB interface between two cards in a system that is failing at -30degC. The first step is to figure out whether it's card A's fault, card B's fault, the interconnect, or if you're unlucky some combination of the three. The next step is to isolate the failing component or interface on the failing block, and then the failing signal or function, etc... Only in this way can you systematically resolve an issue.

Tier One Suspects: Resets, Clocks, and Power

Once you've narrowed the issue down to a specific device or devices the first place I recommend starting is with Resets, Clocks, and Power. Make sure that you're bringing the devices out of reset in a way that is consistent with what the Spec Sheets recommend, and that there are no glitches on the reset lines. Make sure that the clocks are the correct frequencies, that the clocks don't have signal integrity issues (including Voh/Vih and Vol/Vil matching) and that the clocks meet all jitter (or phase noise) and duty cycle requirements. Make sure that the Power Supplies are at the correct levels (in fact I usually start with this one), that there isn't an excessive amount of power-supply noise, and that any power-on-sequencing requirements are met. By covering these three suspects first you will probably solve about 25% of the issues you'll see. At the least you will have ruled out these the usual suspects.

Tier Two Suspects: Floating Inputs, Manufacturing Issues, and Signal Integrity

These three suspects should cover about 10% of the remaining possible issues. Make sure that there are no floating inputs on the card, of specific concern are Interrupt Requests but with CMOS any floating input can cause problems. Make sure that every terminal of all devices are soldered down, that there are no shorts and that the correct components are populated. Actually you might want to start with this one, it's of particular interest if you have one card not working from a batch of working cards (which is why it's important to always have more than one card built at a time). Finally, make sure that you don't have any signal integrity issues, of particular interest are high-speed-serial links, make sure that they all conform to any receiver requirements. Also make sure that Voh/Vih and Vol/Vil work out with a reasonable amount of noise margin.

Tier Three Suspects: Everything Else

Then there's everything else. To resolve these issues I generally systematically work my way through Device Spec Sheets, ensuring that each requirement is met (including sane register settings), sorting Specs in order of most likely culprit. Device Errata are also your friend at this stage, as are Device Manufacturers. Once you have it narrowed down to a specific device you can lean on the Manufacturer for assistance. Don't expect any miracles from anyone though, but the odd time they've come through for me with some help. Brainstorming with other experienced Designers is also helpful, as is comparing the issue in front of you to other issues you've seen. I plan on including some case studies in Part 2 of this article that may also be of some help (when it's complete).