The last 24 hours have been an exercise in frustration, sleeplessness, wrong turns, dead ends, and embarrassment. I've been working on a little project that developed a problem, seemingly inexplicably, and I could not find the cause.
This little AVR-based project includes a 16x2 character LCD display. For the last week, it has been working like a champ. I got a lot of fundamentals worked out, and decided to start cleaning up the code. It had gotten to be quite a mess, as I quickly worked through the various building blocks of the overall device, and I needed to make it look more like the final product would.
I would make small changes, compile them and load them onto the device, making sure things worked, or worked the way I wanted them to. Suddenly, the LCD stopped working. I could tell the MCU was doing its job, as the heartbeat LED kept blinking, and I was getting debugging output from the serial console attached. But no LCD.
Since the last change I made was to add a MOSFET so that I could power down the LCD when not in use, I thought perhaps I had damaged the LCD. I removed the new circuitry, and spent some time searching for a similar LCD to try. Found one, popped it in, and it behaved the same way!
Perhaps I had damaged the MCU. Unlikely, because everything else was working perfectly. So I replaced that, flashed it, and tried again. No dice. Now, I was already operating on very little sleep this weekend, and it was after midnight. Had I been firing on all cylinders, I would've abandoned the effort and gone to bed. Or had looked for a software problem first. But I had been on such a roll, and I don't back down easily from a challenge like this.
I fired up the oscilloscope and started probing the connections between the MCU and LCD. They were one of the first things I had checked, making sure everything was still connected. Since I had tried the new MCU with the old LCD, I thought maybe it was damaging the MCU pin drivers. So I checked each one to see if it was changing state.
I found five that weren't! I pored over the code, thinking somewhere I had introduced a change (by accident), that disabled some of the pins. I couldn't find anything. I tried a third MCU. I tried a third LCD. Nothing. Same five pins not working. Then I realized that there must be an internal peripheral on those pins inside the MCU that was overriding the general I/O functionality. Looking at the data sheet, I saw that the JTAG interface lived on those pins, and a vague memory floated up: new MCUs have the JTAG enabled by default.
So, I disabled that, excited that I had figured things out, and tried again.
No luck. Still didn't work. Argghhhhhh!
I gave up. Went to bed (now well past 2 am), got up late the next morning, went to work, came home. Watched an hour of TV, then came in here to figure out the problem.
I decided to revert my source code back to a known-working revision. I was currently on revision 15, and the checkin comments showed the LCD had started working after r7. So, I updated my code to that revision, and tried it. It worked! Praise jeebus!
I tried to see the differences between that code and the latest, but there were too many changes. I updated to the latest code and tried again, just to verify the problem was in the code, and it still worked! WTF? Now, somewhere in here I had made a couple other changes to the code, trying to undo the most recent additions. The latest code had those changes (I had checked them in before reverting), so I was really confused.
But sure enough, it seemed to work. So I put back the code I had just taken out, to try to reproduce the problem. It still worked. WTH? (Ironically, I was now looking for failure, because that would tell me I had figured out the "root cause," as NASA likes to put it.) I could not reproduce the problem.
So, I cleaned up the recent experimentation, checked in the code, and set about to do new work. I wanted to measure the current consumption in various operating modes, so I wired in the ammeter.
Suddenly, the problem reappeared. Argghhhhhh!
I started to think that I wasn't giving the LCD sufficient voltage. The prototype design had a diode from the Vcc to the rest of the circuit, and I thought maybe that little voltage drop was enough to cause problems. That didn't really explain why it had been so reliable up until now, or why it was so unpredictable. I thought maybe the LCD backlight was drawing too much current, and the supply was sagging. Maybe I had accidentally increased its brightness in the code. So I set it to be really dim. No help. I removed the diode. No help.
It finally occurred to me that perhaps I wasn't giving the LCD enough time to get stable power before beginning the initialization sequence. I went looking for the lcdInit()
call in the code to add a small delay before it. But I couldn't find it! Some where in last night's cleanup, I had deleted the call to initialize the LCD! The cascade of emotion that washed over me was intense. Relief that I had finally tracked down the problem, anger at the hours wasted, regret for the sleepiness I'd felt all day, all stewing in the embarrassment that I hadn't gone about the troubleshooting more methodically.
I put the call back in, and everything worked great. Phew.
But why did it sometimes work and sometimes not?
It turns out, the LCD currently doesn't get shut down when the system goes to sleep or resets. So, when I ran older code, the LCD got initialized. When I loaded newer code, it remained initialized, and so would work. But when I disconnected power in order to insert the ammeter into the circuit, the LCD reset itself. Because I was pulling power to the circuit deliberately to try to track down the problem, it was resetting itself.
So many time-consuming wrong turns, so many red herrings. In the end, it was a software problem, one I should have caught much earlier, but because I didn't carefully examine all the changes made between revisions, I didn't notice it.
Hopefully a lesson re-learned.