Testing: Coverage Reports Considered Dangerous
Software testing is one of those topics that you don't discuss in bars, along with religion, politics, and logging methodology. Everyone seems to have a different understanding of how testing should be accomplished, and many of us are over-zealous in defending our stances. However, I believe everyone would agree that the real issue surrounding testing methodology is how best to achieve some level of confidence in your software. When written properly, tests can be liberating, allowing you to refactor endlessly (within the constraints of backward compatibility, of course). When written poorly, tests act like a Ziploc bag, locking everything - even bugs - inside an airtight barrier. However, when you're seeking confidence through testing, perhaps the worst thing you can do is to look at a test coverage report. Consider my own recent experience...
Recently, I spent quite a bit of time refactoring Maven's assembly plugin. I'm sorry to say that I was fooled when I started by the siren song of coverage reporting, thinking that I could use this tool to mark my progress in testing the plugin as I went, to make sure I didn't break anything in the process. When I started, the plugin had a suite of tests that were so rigidly constructed that even fixing a simple bug would result in hours of debugging the test cases to find out what assumption had been violated. The plugin itself was an organic mess. Its first implementation had been well-architected, and was fairly sound in its design. However, over time, the design was absorbed into organic chaos, as layer on layer of new features and bug fixes were applied without revisiting the overarching design. The culmination of all of this was a base class which was over 1600 lines long, and which contained substantial duplication of logic...each instance with it's own regional dialect and cultural differences, so to speak.
My idea here was relatively simple: I planned to move the existing tests to an undisclosed location for safe keeping (until the war was over, in a manner of speaking), then construct a completely new suite of tests to focus on small, specific units of the new plugin's implementation. If things got dicey, or I found a way to salvage them, the old tests would still be there waiting. I followed some specific principles for my new unit tests, which I hope to relate more fully in a later post...but for now, suffice it to say that I was following sound practices. As I went, I planned to use the coverage report (in this case, generated by Cobertura, though it's really not their fault) to mark my progress, and let me know how much I had left to do.
The refactoring went well, as did the testing. By the time I was ready to merge my refactor branch back to the trunk, the only real problem I'd had was the time involved - it had taken something like twice as long as I'd budgeted. I had nearly perfect coverage numbers on the parts I felt were critical - not on constructing exceptions where the constructor consists of super( message, error ), mind you - and felt fairly confident that this plugin would almost drop in as a replacement of the old incarnation. As a first test, I attempted to create the assembly for Maven itself, at which point the whole plugin fell apart. After battling several NullPointerExceptions, I finally managed to build this fairly straightforward assembly, but by this time I was badly shaken. I had the coverage numbers, but I was starting to see that they weren't telling me the whole story.
Coverage reporting wasn't showing places where null values were reasonable, but unexpected by the code. This meant that relatively simple configurations could lead to a meltdown. To make things worse, I began to notice that some of the string handling expected by the system wasn't handled properly in the code...archive paths with leading slashes, for instance. Finally, I stopped looking at the report, and wrote a small plugin to help me run a battery of real Maven builds using the freshly built plugin. This enabled me to capture real-world use cases, and test whether they were supported properly.
In the end, the combination of these two approaches resulted in an easy-to-maintain, comprehensive testing strategy. More importantly, I learned quite a bit about the necessity for a composed approach to testing. Unit testing allowed me to catch quite a few niggling little logical errors (let's face it, probably the majority) like a missing '!' or forgetting to assign the result of a formatting method back to the string variable in place of the old value. However, because the unit testing had focused my mind on a tactical level - greatly reinforced by the coverage report I kept gazing at - I forgot to question whether my tests encompassed the full range of real-world values and scenarios. This is where the integration testing saved me. By looking at real example builds, I was able to say with confidence that yes, this plugin would work under those conditions. It also provided a critical side-effect, which is somewhat tangential to this post: integration testing allowed me to capture error reports directly from users, and put them to the test. This provides a quick way to verify a bug - even at a high level - and gives that all-important direction to the debugging effort.
All of the anecdotal points aside, I want to be very clear about one thing: coverage reporting is dangerous. It has a tendency to distract from the use cases that should drive the software development process. It also tends to give the impression that when the code is covered, it is tested and ready for real life. However, just because a line of code has been executed by one (or more) tests, doesn't mean it was exercised with all the weird values that are possible in the real world. It won't tell you that you've forgotten to test the contract on null handling. It won't tell you that you were supposed to check that String for a leading slash, and trim it off if necessary. To achieve real confidence in your tests, you would have to achieve multiple coverage for most of your code, in order to test each line under various conditions. Bumping the progress bar of your coverage report up to account for a line of code which has been touched by one test is a fatally flawed concept, and it can lead to misperceptions on the parts of managers and developers alike. And, since there is no hard-and-fast rule about how many test passes it will take to test any given line of code adequately, the concept of a coverage report is itself fatally flawed.
Technorati Tags: code, darkside, maven, process, smells, testing
Re: Testing: Coverage Reports Considered Dangerous
Fundamentally, if you use coverage to look for gaps in your testing, it's useful, but I wouldn't recommend you go much farther than that; like most statistics, they can be misleading.
Re: Testing: Coverage Reports Considered Dangerous
I realize it's not the right way to do things, but I didn't run the existing tests after the refactor. I could have; I saved them. However, given the nature of the existing tests, I figured it would be a waste of time to do so, for two reasons:
- I couldn't read the existing tests to understand exactly what they were testing.
- The tests were so fragile that any change was breaking them, even if it didn't affect their supposed test sites.
In fact, the quality of these tests was in part derived from the design (or lack thereof) of the plugin code...if we'd paid more attention to refactoring the plugin design as we added features, we probably wouldn't have found ourselves in that situation. Improved testability was one of the biggest motivations for the refactoring.
Since the refactoring, I believe I've managed to recover the actual use-case test coverage back to its previous level, and even improved it substantially. While I can't make that statement with absolute certainty, I don't actually know what we were testing before, so I don't feel like I could ever know this for certain.
Re: Testing: Coverage Reports Considered Dangerous
Would this repeat if someone else now picks up to refactor the plugin? Are your tests readable enough with adequate explanation on assumptions.
The point that i am trying to drive home is that how do we know that the tests written by one person are equally parseable by the second person?
Re: Testing: Coverage Reports Considered Dangerous
You seem to assume that I'm the only one that found the last test suite impenetrable. I should mention that I had already spent days playing with the input configurations of these tests to try to understand what they were testing. In many cases, fixing a bug would cause tests in a completely unrelated part of the code to fail. I also should mention that the testing harness used by the previous suite led to unnecessarily complex configurations, forcing one (if they used it to the exclusion of other testing methods) to test the entire plugin through a single entry-point method.
While I'm not against such functional testing, it should be setup to run the way a user might configure it, with normal user inputs and verifiable outputs. But more importantly, great pains should be taken to isolate the functionality you're testing, to limit the ability of a single change to cause cascading failures in your tests, and obscure the real failure. Also, while I do agree that it's fine to have functional tests that look and act like a real user interaction, it's also critical to have unit tests in place as a first line of defense, to ensure that each part of the system - whether that be methods or classes - is functioning as expected with respect to its own inputs and outputs.
IMO, this is where the principle of Test at the highest level possible is lacking: having only high-level tests makes it really difficult to pinpoint a problem when it happens. This, in turn, makes it much more labor intensive to refactor, fix bugs, or add features. To enable this, you need both high-level tests to verify that unpredicted side effects haven't derailed your efforts, and low-level tests to serve as early warning that you might have broken something.
Also, I'd ask you: if a test doesn't have a relatively easy to understand purpose, how does it help to have it in place?
BTW, my unit test methods describe the functionality they attempt to test, and all configuration for those tests is either in the method, or no more than one class away (in the case of a testing utility). My functional tests take the form of Maven builds which construct assemblies, and they have a list of goals in a goals.txt file, a typical POM and assembly descriptor, and a verification script called verify.bsh. Additionally, each test build is in a directory that attempts to describe the purpose of the test, to enable quick reading of failed tests during the plugin build. Not so hard to understand.

