Detecting False-passing Products and Mitigating their Impact on Variability Fault Localization in Software Product Lines

In a Software Product Line (SPL) system, variability bugs can cause failures in certain products (buggy products), not in the others. In practice, variability bugs are not always exposed, and buggy products can still pass all the tests due to their ineffective test suites (so-called false-passing products). The misleading indications by those false-passing products’ test results can negatively impact variability fault localization performance. In this paper, we introduce CLAP, a novel approach to detect false-passing products in SPL systems failed by variability bugs. Our key idea is to collect failure indications in failing products based on their implementation and test quality. For a passing product, we evaluate these indications, and the stronger indications, the more likely the product is false-passing. The possibility of being false-passing of the product is evaluated based on if it is implemented by a large number of the statements which are highly suspicious in the failing products, and if its test suite is in lower quality compared to the failing products’ test suites. We conducted several experiments to evaluate our false-passing product detection approach on a large benchmark of 14,191 false-passing products and 22,555 true-passing products in 823 buggy versions of the existing SPL systems. The experimental results show that CLAP can effectively detect false-passing and true-passing products with the average accuracy of +90%. Especially, the precision of false-passing product detection by CLAP is up to 96%. This means, among 10 products predicted as false-passing products, more than 9 products are precisely detected. Furthermore, we propose two simple and effective methods to mitigate the negative impact of false-passing products on variability fault localization. These methods can improve the performance of the state-of-the-art variability fault localization techniques by up to 34%.

Dataset overview

System	#Versions	#Fs	#FPs	#TPs
BankAccountTP	187	2,055	2,328	1,975
Elevator	41	217	326	195
Email	69	553	587	723
ExamDB	77	201	127	288
GPL	355	6,612	9,995	18,538
ZipMe	94	686	828	836
Total	823	10,433	14,191	22,555

Note that:

#Versions: the number of buggy versions

#Fs: the number of failing products

#FPs: the number of false-passing products

#TPs: the number of true-passing products

Dataset can be found here

Empirical results

Accuracy of false-passing product detection model

Classifer	SVM		KNN		Naive Bayes		Logistic Regression		Decision Tree
Lable	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP
Precision	88.16%	94.19%	90.41%	89.30%	88.36%	90.95%	88.75%	92.30%	90.03%	92.99%
Recall	97.09%	78.36%	93.97%	83.46%	95.25%	79.18%	95.99%	79.81%	96.26%	82.30%
F1-Score	92.41%	85.55%	92.16%	86.28%	91.68%	84.66%	92.23%	85.60%	93.04%	87.32%
Accuracy	90.04%		90.02%		89.21%		89.91%		91.01%

Mitigating the false-passing products’ negative impact on fault localization performance

Metric	VARCOP			SBFL
Metric	Original	Removing FPs	Adding tests for FPs	Original	Removing FPs	Adding tests for FPs
Tarantula	3.35	2.52	2.22	5.10	4.75	4.53
Ochiai	2.39	2.23	2.28	3.00	2.77	2.86
Op2	4.31	4.18	4.33	7.04	6.84	6.96
Barinel	3.69	2.83	2.91	5.10	4.74	4.53
Dstar	2.55	2.14	2.19	3.06	2.91	2.98

Impact of different experimental scenarios

Edition	System-based		Version-based		Product-based		Within-system
Lable	TP	FP	TP	FP	TP	FP	TP	FP
Precision	87.51%	89.42%	88.16%	94.19%	87.53%	94.27%	88.73%	96.12%
Recall	92.16%	85.83%	97.09%	78.36%	96.97%	78.26%	96.29%	87.02%
F1-Score	89.15%	86.83%	92.41%	85.55%	92.01%	85.52%	92.21%	91.16%
Accuracy	88.44%		90.04%		89.70%		92.29%

Impact of different training data sizes (the number of systems)

#Systems	1		2		3		4		5
Lable	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP
Precision	92.02%	74.90%	96.82%	68.18%	95.37%	77.03%	90.18%	79.48%	91.19%	82.50%
Recall	81.88%	93.19%	63.07%	97.44%	76.90%	95.40%	81.33%	89.10%	84.51%	89.95%
F1-Score	86.65%	83.05%	76.38%	80.23%	85.14%	85.24%	85.53%	84.02%	87.72%	86.06%
Accuracy	82.60%		78.47%		85.19%		84.81%		86.95%

Impact of CLAP’s attributes on the false-passing product detection performance

Attributes	Product Implementation		Test Adequacy		Test Effectiveness		All
Lable	TP	FP	TP	FP	TP	FP	TP	FP
Precision	84.69%	74.80%	80.45%	99.07%	78.74%	88.50%	87.47%	88.29%
Recall	87.71%	69.69%	99.74%	53.69%	96.59%	50.18%	94.66%	74.82%
F1-Score	86.17%	72.15%	89.06%	69.64%	86.76%	64.05%	90.02%	81.00%
Accuracy	81.52%		83.92%		80.64%		87.71%