TY - GEN
T1 - Robust prediction of fault-proneness by random forests
AU - Guo, Lan
AU - Ma, Yan
AU - Cukic, Bojan
AU - Singh, Harshinder
PY - 2004
Y1 - 2004
N2 - Accurate prediction of fault prone modules (a module is equivalent to a C function or a C++ method) in software development process enables effective detection and identification of defects. Such prediction models are especially beneficial for large-scale systems, where verification experts need to focus their attention and resources to problem areas in the system under development. This paper presents a novel methodology for predicting fault prone modules, based on random forests. Random forests are an extension of decision tree learning. Instead of generating one decision tree, this methodology generates hundreds or even thousands of trees using subsets of the training data. Classification decision is obtained by voting. We applied random forests in five case studies based on NASA data sets. The prediction accuracy of the proposed methodology is generally higher than that achieved by logistic regression, discriminant analysis and the algorithms in two machine learning software packages, WEKA and See5. The difference in the performance of the proposed methodology over other methods is statistically significant. Further, the classification accuracy of random forests is more significant over other methods in larger data sets.
AB - Accurate prediction of fault prone modules (a module is equivalent to a C function or a C++ method) in software development process enables effective detection and identification of defects. Such prediction models are especially beneficial for large-scale systems, where verification experts need to focus their attention and resources to problem areas in the system under development. This paper presents a novel methodology for predicting fault prone modules, based on random forests. Random forests are an extension of decision tree learning. Instead of generating one decision tree, this methodology generates hundreds or even thousands of trees using subsets of the training data. Classification decision is obtained by voting. We applied random forests in five case studies based on NASA data sets. The prediction accuracy of the proposed methodology is generally higher than that achieved by logistic regression, discriminant analysis and the algorithms in two machine learning software packages, WEKA and See5. The difference in the performance of the proposed methodology over other methods is statistically significant. Further, the classification accuracy of random forests is more significant over other methods in larger data sets.
UR - https://www.scopus.com/pages/publications/16244370106
U2 - 10.1109/ISSRE.2004.35
DO - 10.1109/ISSRE.2004.35
M3 - Conference contribution
SN - 0769522157
T3 - Proceedings - International Symposium on Software Reliability Engineering, ISSRE
SP - 417
EP - 428
BT - ISSRE 2004 Proceedings; 15th International Symposium on Software Reliability Engineering
T2 - ISSRE 2004 Proceedings; 15th International Symposium on Software Reliability Engineering
Y2 - 2 November 2004 through 5 November 2004
ER -