## Monday, March 29, 2010

### Normal distribution in action: defect distribution modeling and prediction

Preface: to be on the same page, it's recommended to review the following WP articles:
http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Theory_of_errors

http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics

Though, the notion normal (Gaussian) distribution appears potentially helpful for s/w project planning and analysis, but it's strongly recommended to play with this tool carefully and what more important - with statistically sufficient and significant data set. Yes there is a constraint - the project should be big enough. However, does it really important to forecast new defects and persistent ones on a project having 5000 SLOC? Don't think so.

Obviously this theory may help to:
- estimate volume of non-covered/discovered defects. By this - rebuild test plan to achieve proper test coverage.
- predict volume and distribution of newly revealed product defects. By this - come up and/or numerically adjust with project sign off date. Somehow metrics could shift project release or may play start point of more resources negotiation.
- model efficient test automation (functional and unit) coverage. By this - achieve high ROI on automation.

All you need is to:
- select a sampling variable. In our case likely it is number of valid reported defects, e.g. weekly/daily. It depends periods you are going to operate on.
- calculate mean over this variable
- calculate variance of this sampling variable
- Then build and graph Probability density function (PDF)
$\frac{1}{\sqrt{2\pi\sigma^2}}\; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }$

This graph as well as mean and variance calculations are easy task using Excel formulas and graphs. The resulting graph should be build together with original sample in time series (histogram). The view of the curve itself shows "normality» of this sample. Overlapping of two graphs shows divergence of normality with real state. But don't hurry to make judgments on this curve it's just tip for you to feel confidence of project control. E.g. the curve with open tail (end in time series) may signal that testing should be prolonged as there are undiscovered defects as expected statistically.

The next advanced application is prediction. To make it you need to build either ideal normal distribution or use existing one. Then the restore function will give you remaining sample sub-set (show future). So that, you may say how many defects will be found for example week by week. Or what functionality needs to be tested with more effort

Finally to calculate precision over your calculations you have to come up with confidence intervals of your observations.