Microsoft’s David Platt is known to say “Users don’t care about your program in and of itself. Never have, never will. Your mother might, because you wrote it and she loves you, and then again she might not; but no one else does. Users care only about their own productivity or their own pleasure.” In this light; if software is “correct” if it can be shown to:
- Meet the users expectations and requirements
- Deployable and supportable in a manner acceptable to the operations group
- Maintainable in a manner acceptable to the development group
The purpose of testing can be reduced to verifying these statements about the program are in fact true. In many ways these can be considered the Prime Directives of a testing approach.
“System Testing” is one type of testing that can be used to verify this. It is typically done in an environment that closely matches the production environment, and performed by people who will act as the eventual consumers (Users, Administrators, etc.). At this level, only the externally visible aspects of the program are considered, and the tests typically fall into one of three categories:
- Script based Manual Testing
- Exploratory Manual Testing
- Automated Testing
This approach does have some limitations including:
- Testing cannot begin until the program is in a sufficiently mature state
- Regression testing is time consuming and therefore costly
- Setting up specific scenarios (especially error scenarios) can be very difficult
As a result, other types of testing get introduced to minimize the impact of these limitations. But the question remains what types of testing are best for different scenarios.
“Functional Testing” breaks the system down to individual elements that map directly to the (wait for it….) Functional Specifications, or for Agile teams, User Stories. The primary advantage of this is that is correlates very well with the concepts of “deliver working software” without having to necessarily have the whole system (or even large parts of it) complete. With the judicious use of Mock objects we can easily address most of the limitations with system level testing. A large number of teams focus on this type of testing during each iteration (sprint) and use this as the primary methodology of determining with when something is “complete”; the completed items are regularly propagated into the Test/QA environment, where system type testing can be performed.
Before continuing, let us think about what “perfect” testing would be. I propose the following as a working definition: “Perfect testing would insure that any change made to the software, which would cause a violation of any of the Prime Directives, is caught by one or more failing tests”. However, we know that nothing is perfect, and that even if this is the goal, it can not be reached even with infinite resources dedicated to testing. What we are left with is a continuum. At one end, a minimal effort expended on testing with a high probability of issues which incur associated costs. At the other end, an extremely high investment in testing with the goal of minimizing the number and severity of any issues which slip past the tests.
The “sweet spot” is where we have minimized the sum of both the testing costs and “issues” costs. Of course, the type of system has a major impact on finding this spot. Some of the projects I work on are Industrial Automation, where serious injury or loss of life is a distinct possibility if certain types of malfunctions occur. Other systems such as banking and trading may incur millions of dollars in financial losses in the event of certain defects. What I have repeatedly found working in different domains (vertical markets), is that many of the methodologies in these demanding environments can be applied, with modification, to “normal” business systems where they are not usually considered; and that this can be done with a minimal cost, often resulting in overall savings.
I have come to the conclusion that while Functional Testing is nearly always a key component, it is not sufficient to minimize the chances of violating a Prime Directive by making a change to the code. This is because Functional testing only captures one of the 5 “W”’s:
- Who – Which piece of code is responsible for a given item
- What – What does the code accomplish
- When – When is the operation actually performed (Aggressive Loading, Lazy Evaluation, etc.)
- Where – Where do the inputs come from?, where do the outputs go?, where are there side effects?
- Why – What drove a specific approach? Why were others eliminated?
It is important to remember the qualifier as “by making a change to the code”. Initially determining these items is definitely a human activity; and the best way to achieve this is through a solid design methodology with a strong emphasis on code-reviews (pair-programing can also be very helpful in many cases). What I am specifically addressing are the changes that occur to a piece of code, for any one of a number of reasons that “seem like a good idea”, but fail to take into account that (presumably) the code was developed the way it was for a specific reason.
As a specific example, I was recently on a project where there was a small (never more than 20 items) collection that had a huge number of replacements (one item being removed, a different item being added) occurring at a very high rate. There were also a fair number of places where the entire collection was iterated over, and a few places where an element was being accessed by its “key”. The original implementation was a .NET List<T> so that locating an item for replacement and accessing an element by key were both done by a linear scan. One of the developers was working with code that interacted with the collection, and noticed the linear scans of the collection. It looked like a good idea to replace the List<T> with a Dictionary<K,T>, because of this change it was also necessary to add some synchronization (lock) code. The code was reviewed, the change was made, the Functional tests all passed, and even the Application level Load/performance tests passed.
About two weeks after the code was deployed to production, there were various performance anomalies that were causing a multitude of problems. It took about two days to track the problem back to this change. The initial thought was that it was the locking that caused the problem, but that did not account for all of the issues. The original developer was contacted [he had moved to a different team], and after some time thinking about it remembered the reason he had chosen the List<T> with the linear operations – the overhead of managing the index far outweighed the cost of the sequential scans.
Many will argue that the code should have been commented, or that the load testing should have been enhanced, and neither of these are incorrect. But, we all know that comments are often not read, or become out of date; and, in this particular case, setting up a load test for this specific condition at the application level would have been extremely difficult. Had there been a test at a lower level designed to detect the conditions that caused the original author to choose the specific implementation, this entire episode would have been virtually eliminated.
This is only one simple sample, but the general concept applies in many situations. In order to address these, I generally recommend the following types of tests be implemented (or at least considered):
“Granular Unit Testing” (which I often will refer to as GUT testing) These tests focus tightly on specific code elements at the lowest level practical, typically individual methods and properties of each class. Heavy use of Mock object, and even Reflection to setup the desired states is quite common. In many environments, this type of testing has fallen out of favor; being viewed as “testing the trivially obvious”; Yet, there is significant value in this type of testing. The ability to quickly execute a focused test can reduce the “cycle time” from minutes down to seconds for quick verification of the change. The tests themselves often provide useful information about the amount of work to setup a given item (often revealing tighter than intended coupling), and can also be useful as “sample use-cases” for new developers coming up to speed. In order to reduce the costs in developing these tests, some automation is helpful. Here are Dynamic Concepts Development Corp. we have developed a custom toolkit (hopefully to be released as a product later this year) that makes it very easy to setup the initial conditions and have a clean separation between the manually created elements and the generated code.
“Design Rule Check Testing” (commonly referred to as DRC tests) These tests have the greatest degree of separation from the “functional” aspects. The goal here is to minimize the risk that the code has deviated from the intended/desired design. Some “tests” can be accomplished by having a robust set of StyleCop/FxCop/etc. rules. Others may perform more detailed analysis on the code. Items that are often a good target focus (some specifically for .Net): Resource Utilization, Thread Safety Identification (even for applications that are currently single threaded), Complexity of Object Graphs (which may a bigger impact on Garbage collection than the number of temporary objects created!).
“Negative Testing” (i.e. testing things that don’t happen!) These tests seem to be the the rarest of them all, and typically need explanation. As a quick example consider a class that exposes 4 events. Changing a specific property should raise one of the events. Validating that this event is raised at the appropriate time is likely to be in the Functional tests and/or the GUT tests. But most of these tests will not catch a bug where the property change triggers other events in addition to the one that is specified. This can be very significant. At the low end of the impact spectrum it can create unnecessary processing overhead -especially if the handler[s] are heavy, or there are a large number of handlers registered. At the high end of the impact spectrum, it may reveal a significant defect in the logic of the code under test which might not be caught in other conditions. As with the GUT tests, it is very helpful to have a tool which can compare initial state with final state as well as register for events which should not fire.
When all of these are put together to form a multi-dimensional approach to testing, I have seen drops in defect rates of up to 60%. If you are currently engaged on a greenfield (initial development) project, I strongly recommend this approach. If you are currently on a brownfield (ongoing maintenance/upgrade) project, then an incremental adoption is typically the best bet. When one component is being updated, take the time to make sure that the testing for that component (not just the elements you are currently changing) have a solid set of tests before making the change, validate that the change will cause test failures to the existing tests, and also look for cases where the old code is capable of passing the relevant test suite. This last step goes a long way in ensuring that the changes you are making are thoroughly tested.