All was going well.
The service to set up the command was working. Check.
The message was being picked up and processed correctly. Check.
The response was being generated correctly. Check
Hook it all up together ... and the response wasn't coming back.
To say I was quite confused was putting it mildly. I had effectively duplicated an existing process ... that was working perfectly (well it was after I fixed some minor configuration and environment issues). So what the heck was going on?
I debugged about 15 different ways. Ran it normally and wrote in some logging outputs. I attached to the service process. I attached to the async process. I attached to multiple processes. I injected data. I used our test stub. I hacked the data in the database. I ran unit tests. I ran a console runner.
In final desperation I handed the code over another dev and asked them to take a look. 1 shelveset and some database scripts to run against the base code. An hour or so later they came back (Actually it may have been earlier, but I'd been called to a meeting so was gone for an hour). Apparently it worked on their machine. No changes. No problems It magically worked.
Can I scream now?
So the long and the short of it? I have a very good document written with detailed instructions on setting up and debugging this code. I pretty much understand the code on a level I certainly didn't expect to have to know. I have no clue as to why my code is not running on my machine, yet will happily run on someone elses (although I swear it hates me, unlike my other old faithful which I tearfully left to a tester because she couldn't handle the RAM upgrade necessary to run my shiny pretties).
Which leads me to my conclusion. Timebox the problem (I probably spent far too much time on it, even if I did solve some other bugs in the process). Grab a partner to give you some fresh perspective. Give someone else the code to look at on a different machine. If you're like me, it may be that you dev environment may be the culprit. (That theory is being tested on Monday after I get the code running again on my machine. We are so not trusting the 'it worked on this machine, so that machine can go to prod' theory. I want answers damn it. If my machine is evil, so be it. She gets reimaged again - 2nd time this year - and I pull down the code and the sql scripts and try it again).
Some random notes around debugging a frustrating problem
- Check your permissions on everything. You should be running under the lowest security necessary, however different OS's may differ slightly in their permission sets.
- Check that the project has been set to run under the OS correctly. If you are on a 64bit machine and you are dealing with a 32bit dll/ COM interop, then the individual project will probably need to run as X86 rather than 'Any CPU'. (Main solution can be Any CPU, but project that hosts the 32 bit dll can't be)
- Use unit tests to verify that each part is working in isolation
- Run through the code without debug. Sometimes things work different in debug to when it runs 'normally' (i.e. no timeout issues)
- Information is power. As a vb6 dev, easiest way to debug something complex was often to log to a file chunks of data. It's ugly, but it works. The more you know about the state of the data, the easier it is to pinpoint where in the code the problem is
- Get someone else to help. Your eyes start to gloss over the details you need to be looking at.
Turns out that I was a bit of a nong and had screwed up/ missed some dependency injection with unity. So all working now, but boy, what a mess to track down! Lucky one of the guys at work was more cluey than me and got it sorted!

No comments:
Post a Comment
Note: Only a member of this blog may post a comment.