ARist
ARist: An Effective API Argument Recommendation Approach
Learning and remembering to use APIs are difficult. Several techniques have been proposed to assist developers in using APIs. Most existing techniques focus on recommending the right API methods to call, but very few techniques focus on recommending API arguments. In this paper, we propose ARist, a novel automated argument recommendation approach which suggests arguments by predicting developers’ expectations when they define and use API methods. To implement this idea in the recommendation process, ARist combines program analysis (PA), language models (LMs), and several features specialized for the recommendation task which consider the functionality of formal parameters and the positional information of code elements (e.g., variables or method calls) in the given context. In ARist, the LMs and the recommending features are used to suggest the promising candidates identified by PA. Meanwhile, PA navigates the LMs and the features working on the set of valid candidates which satisfy syntax, accessibility, and type-compatibility constraints defined by the programming language in use. Our empirical evaluation on a large dataset of real-world projects shows that ARist improves the state-of-the-art approach by 19% and 18% in top-1 precision and recall for recommending arguments of frequently-used libraries. For general argument recommendation task, i.e., recommending arguments for every method call, ARist outperforms the baseline approaches by up to 125% top-1 accuracy. Moreover, for newly-encountered projects, ARist achieves more than 60% top-3 accuracy when evaluating on a larger dataset. For working/maintaining projects, with a personalized LM to capture developers’ coding practice, ARist can productively rank the expected arguments at the top-1 position in 7/10 requests.
Data.
- List of 1000 most starred project in the empirical study section here
- Small corpus: Eclipse and Netbeans
- Large corpus
Source code.
Experimental results
- Statistics of the dataset
Small corpus | Large corpus | |
---|---|---|
#Projects | Eclipse & Netbeans | 9,271 |
#Files | 53,787 | 961,493 |
#LOCs | 7,218,637 | 84,236,829 |
#AR requests | 700,696 | 913,175 |
- Accuracy Comparison (RQ1)
2.1 Performance of the AR approaches for the methods in the frequently-used libraries
Project | ARist | PARC | GPT-2 | SLP | |||||
---|---|---|---|---|---|---|---|---|---|
Precision | Recall | Precision | Recall | Precision | Recall | Precision | Recall | ||
Netbeans | Top-1 | 52.92% | 51.67% | 46.46% | 44.86% | 47.72% | 46.63% | 36.04% | 36.04% |
Top-3 | 70.18% | 68.28% | 66.20% | 66.75% | 55.15% | 53.90% | 49.52% | 49.52% | |
Top-10 | 78.36% | 76.15% | 72.06% | 69.57% | 55.94% | 54.67% | 64.52% | 64.52% | |
Eclipse | Top-1 | 56.66% | 55.04% | 47.65% | 46.65% | 61.37% | 58.87% | 26.24% | 26.24% |
Top-3 | 67.88% | 65.63% | 65.05% | 63.68% | 68.85% | 66.03% | 37.00% | 37.00% | |
Top-10 | 73.14% | 70.76% | 72.26% | 70.73% | 69.75% | 66.85% | 54.39% | 54.39% |
2.2. Comparison in general AR task
Project | ARist | GPT-2 | CodeT5 | SLP | |
---|---|---|---|---|---|
Netbeans | Top-1 | 65.15% | 52.63% | 59.97% | 34.91% |
Top-3 | 78.16% | 57.69% | 67.16% | 48.10% | |
Top-5 | 81.10% | 57.87% | 67.57% | 55.02% | |
Top-10 | 83.53% | 57.88% | 67.60% | 67.20% | |
MRR | 0.72 | 0.55 | 0.63 | 0.44 | |
Eclipse | Top-1 | 64.19% | 56.53% | 61.20% | 28.52% |
Top-3 | 76.29% | 61.89% | 67.21% | 41.60% | |
Top-5 | 79.23% | 62.09% | 67.53% | 49.46% | |
Top-10 | 81.65% | 62.10% | 67.54% | 62.67% | |
MRR | 0.70 | 0.59 | 0.64 | 0.38 |
- Sensitivity analysis
3.1. Top-𝑘 accuracy of ARist in different scenarios
New project | Working project | Maintain project | |
---|---|---|---|
Top-1 | 53.42% | 69.96% | 74.49% |
Top-3 | 61.50% | 81.14% | 83.23% |
Top-5 | 64.21% | 83.74% | 85.38% |
Top-10 | 67.96% | 85.88% | 87.38% |
MRR | 0.58 | 0.76 | 0.79 |
3.2. ARist’s performance by the expression types of expected arguments
Expression type | Distribution (%) | Top-1 (%) |
---|---|---|
Simple Name | 48.14 | 83.66 |
Method Invocation | 15.19 | 45.51 |
Field Access | 6.09 | 31.01 |
Array Access | 0.74 | 53.26 |
Cast Expr | 0.99 | 18.46 |
String Literal | 10.03 | 98.14 |
Number Literal | 5.06 | 95.66 |
Character Literal | 0.47 | 87.93 |
Type Literal | 0.90 | 81.92 |
Bool Literal | 1.50 | 78.43 |
Null Literal | 0.79 | 84.45 |
Object Creation | 2.09 | 51.96 |
Array Creation | 0.29 | 43.14 |
This Expr | 1.06 | 91.05 |
Super Expr | 0.00 | 0.00 |
Compound Expr | 5.65 | 3.69 |
Lamda Expr | 0.73 | 78.83 |
Method Reference | 0.28 | 0.56 |
Total | 100.00 | 69.96 |
3.3. Impact of Context Length on Performance
l1 | l2 | l3 | l4 | l5 | |
---|---|---|---|---|---|
Top-1 (%) | 62.05 | 65.86 | 66.14 | 67.00 | 67.83 |
MRR | 0.70 | 0.72 | 0.72 | 0.73 | 0.74 |
Run. time (s) | 0.33 | 0.39 | 0.42 | 0.51 | 0.56 |
- Intrinsic Evaluation Results
4.1. Impact of Valid Candidate Identification
Top-1 (%) | MRR | Run. time (s) | |
---|---|---|---|
ON | 69.96 | 0.76 | 0.444 |
OFF | 47.50 | 0.51 | 0.809 |
4.2. Impact of Candidate Reduction
Top-1 (%) | MRR | Run. time (s) | |
---|---|---|---|
ON | 69.96 | 0.76 | 0.444 |
OFF | 61.98 | 0.69 | 2.424 |
4.3. Impact of reducing threshold, 𝑅𝑇
RT | 10 | 20 | 30 | 40 | 50 |
---|---|---|---|---|---|
Top-1 (%) | 63.77 | 64.67 | 65.10 | 65.34 | 65.49 |
Run. time (s) | 0.342 | 0.406 | 0.418 | 0.464 | 0.508 |
4.4. Impact of heavy-ranking stage
P_ℎ𝑟 | Top-1 (%) | MRR | Run. time (s) |
---|---|---|---|
OFF | 65.37 | 0.72 | 0.125 |
GPT-2 | 70.71 | 0.76 | 0.732 |
CodeT5 | 68.59 | 0.74 | 0.186 |
LSTM | 49.26 | 0.61 | 0.198 |
n-gram | 36.89 | 0.51 | 0.137 |