So I just spent an hour listening to the herding code podcast on continuous testing. We were invited to come speak on it but oddly when asked at 0330 to be available at 0730; attendance is unlikely🙂.
One particular comment however particularly got to me from Remco Mulder (author of ncrunch).
Remco Mulder: “There are tools continuous test runners that work really hard to do impact detection and only run the tests impacted by that change. The problem with that approach though is that no tool is perfect. You may very well have the impact detection to the point that you know that every test touching a particular change point and you can say to run just those tests but that doesn’t account for external dependencies outside the testing environment. What if something changed in the database or some sql file thats not part of the solution but happens to exist on the disk somewhere. You know there is a whole range of things that can cause tests to fail that a test runner may just not be aware of at all.
Yeah that would be talking about Mighty Moose. Of course we weren’t there to be able to actually reply.
Let’s get into the meat of these comments. There is a “major” difference between Mighty Moose and nCrunch. Mighty Moose has a massive amount of effort in only running tests that need to be run through static/dynamic analysis where as nCrunch runs all of your tests but attempts to prioritize which to run based on heuristics and dynamic analysis. Mighty Moose has a huge investment in static analysis (>9 man months, its a really complicated problem).
The argument put forward by Remco would appear on the surface to make sense but its actually as we will see quite misleading.
As a side note you can configure mighty moose to run selected then background all tests just like what is being said with prioritization (e.g. static analysis becomes a form of prioritization) hence why I put “major” in quotes since we support prioritization as well.
Before Jumping In/Side Note
Before jumping into things, so my tests are dependent on data in a database and/or some obscure sql file that sits on disk outside of my solution (who knows maybe even the USB key that needs to be plugged into the machine). Is this a good thing? I could also make my tests require Eric Cartman to show up personally to where they are being run wearing a pink tutu, carrying an Antonio Banderas love doll but this is probably not such a good idea. We are basically talking about bad/super brittle tests here.
So many tools out there have been built with the primary use case being the worst possible mess that you can imagine, this case fits firmly in that place. So you are getting occasional failures because you depend on data in a shared database that people change? Stop doing that if that causes pain (then again how often is it happening?). My guess as well is that such a code base would not survive having tests run in parallel either.
We need to remember as well, that the worst case of a test being missed (or badly prioritized for that matter) is that you find out about it later not now. This is the trade off that is at the core of Continuous Testing.
So Mighty Moose (Static + Dynamic Analysis + other heursitics) will according to the argument given above fail to find successfully tests that are for instance dependent on data in the database. This is true. nCrunch (Prioritization based on dynamic analysis + heuristics) would also fail to be able to prioritize these tests to be run in any intelligent way. The problem is that there is not enough information as its external.
OK So they will be run … eventually with prioritization. As you keep coding new tests keep getting prioritized over the rest of those tests in the background 10-15 minutes later it tells you that you have a failing test (I mean we are talking about codebases that are hitting databases during tests so I can’t imagine these are wonderful to work in code bases with fast tests :)).
Is this actually that much better off that finding out right before you check in (what if its 5 minutes? 283 seconds?)? What is the cost of the few minutes of notice? Are there other ways of mitigating this risk? What’s the probability of this happening? What’s the cost of failure? This should sound familiar.
The heart of Continuous Testing is risk mitigation. I do some work up front such as prioritizing your tests based on time to run/previous probabilities of failure/static analysis/dynamic analysis/etc to minimize the amount of time that you have to wait before reaching a point to continue. I also have a possibility of failure in which case I also want to minimize the time I find unexpected errors.
At any time heuristics can become a burden. I am trying to get you the best situation most of the time (this also changes from code base to code base which makes it more challenging!). The reason for this is that every form of heuristic has a cost associated with it. To instrument code or to do static analysis or to measure timings/previous failures … all of them have a cost to maintain the information to provide the heuristic. The idea is that this upfront cost works out in the long run (sound familiar to pricing strategies?).
Now what is being discussed is that constantly running tests in the background as a beta (think alpha predicting beta heuristic). The “beta” goes and actually does what the “alpha” e.g.: heuristics predicted. Is it a worthwhile endeavour? Is the cost of actually running beta to verify your predictions of value?
Another quote though was striking into what leads to the fallacious assumption of being better to run.
Remco Mulder: And you shouldn’t notice any degradation of performance as they are all using background cores
A zero cost pricing model would be a no-brainer unless it had zero benefit. Ah if only it had zero cost our lives would be so much simple. Of course we were talking about bad tests that are using all sorts of wonky external resources that we couldn’t find. My first thought my be that the disk thrashing as they pound my local database or read every file on my hard drive could feasibly be an issue (or as monitoring information on them is collected/merged into my data stores). Also buses can be an issue, memory access, locking of resources, network access … loads of things. To presume running tests on a machine has no impact on the machine is insane.
The point is we have a cost. Constantly running tests has a cost on the machine. There are possible benefits as well like any of the heuristics though it should be put into a pricing model.
For us we found that our up-front strategies (alpha) were good enough and the likelihood of failure after we say “this is what needs to be run” was so low that we didn’t need a full run (beta) in most situations. We have failure modes sure but they just aren’t hitting people often enough to justify the cost. If they were hitting problems and they wanted to reduce the cost of failure by finding them sooner they could go through and turn on our beta.
On all of these decisions for heuristics on tests we need to consider:
Probability of failure
Cost of failure
Cost to mitigate
Other mitigation strategies (e.g.: stop doing that)
There are two key points where these come into play. The first point is where “expected errors” (alpha) would occur and the second place is where unexpected (beta) would occur. If your alpha is good and your probability of failure * cost of failure in beta is low enough. We can sometimes even bypass beta. Conversely if the cost of beta is low overall (I have a solution with 10 tests) I shouldn’t bother with even having an alpha (e.g.: the prediction of beta might be more expensive than actually calculating beta … this gets to be a bit more tricky as often the information used in predicting beta say profiled data has value in its own right)
Sometimes this makes perfect sense to run all tests in the background. In some situations it does not. The ability to run without a beta search on vast number of projects should be seen not through the eyes of possible failure modes but as a sign of great heuristics (alpha) to alleviate the need for running all tests (beta) in enough circumstances that it becomes feasible to do. Conversely the sign of needing a beta and not being able to run without one is a sign of weak initial (alpha) heuristics.