Skip to the content.

ARist

ARist: An Effective API Argument Recommendation Approach

Learning and remembering to use APIs are difficult. Several techniques have been proposed to assist developers in using APIs. Most existing techniques focus on recommending the right API methods to call, but very few techniques focus on recommending API arguments. In this paper, we propose ARist, a novel automated argument recommendation approach which suggests arguments by predicting developers’ expectations when they define and use API methods. To implement this idea in the recommendation process, ARist combines program analysis (PA), language models (LMs), and several features specialized for the recommendation task which consider the functionality of formal parameters and the positional information of code elements (e.g., variables or method calls) in the given context. In ARist, the LMs and the recommending features are used to suggest the promising candidates identified by PA. Meanwhile, PA navigates the LMs and the features working on the set of valid candidates which satisfy syntax, accessibility, and type-compatibility constraints defined by the programming language in use. Our empirical evaluation on a large dataset of real-world projects shows that ARist improves the state-of-the-art approach by 19% and 18% in top-1 precision and recall for recommending arguments of frequently-used libraries. For general argument recommendation task, i.e., recommending arguments for every method call, ARist outperforms the baseline approaches by up to 125% top-1 accuracy. Moreover, for newly-encountered projects, ARist achieves more than 60% top-3 accuracy when evaluating on a larger dataset. For working/maintaining projects, with a personalized LM to capture developers’ coding practice, ARist can productively rank the expected arguments at the top-1 position in 7/10 requests.

Data.

  1. List of 1000 most starred project in the empirical study section here
  2. Small corpus: Eclipse and Netbeans
  3. Large corpus

Source code.

  1. Identify valid candidates
  2. Reduce candidates
  3. Rank candidates

Experimental results

  1. Statistics of the dataset
  Small corpus Large corpus
#Projects Eclipse & Netbeans 9,271
#Files 53,787 961,493
#LOCs 7,218,637 84,236,829
#AR requests 700,696 913,175
  1. Accuracy Comparison (RQ1)

2.1 Performance of the AR approaches for the methods in the frequently-used libraries

Project ARist PARC GPT-2 SLP
Precision Recall Precision Recall Precision Recall Precision Recall
Netbeans Top-1 52.92% 51.67% 46.46% 44.86% 47.72% 46.63% 36.04% 36.04%
Top-3 70.18% 68.28% 66.20% 66.75% 55.15% 53.90% 49.52% 49.52%
Top-10 78.36% 76.15% 72.06% 69.57% 55.94% 54.67% 64.52% 64.52%
Eclipse Top-1 56.66% 55.04% 47.65% 46.65% 61.37% 58.87% 26.24% 26.24%
Top-3 67.88% 65.63% 65.05% 63.68% 68.85% 66.03% 37.00% 37.00%
Top-10 73.14% 70.76% 72.26% 70.73% 69.75% 66.85% 54.39% 54.39%

2.2. Comparison in general AR task

Project ARist GPT-2 CodeT5 SLP
Netbeans Top-1 65.15% 52.63% 59.97% 34.91%
Top-3 78.16% 57.69% 67.16% 48.10%
Top-5 81.10% 57.87% 67.57% 55.02%
Top-10 83.53% 57.88% 67.60% 67.20%
MRR 0.72 0.55 0.63 0.44
Eclipse Top-1 64.19% 56.53% 61.20% 28.52%
Top-3 76.29% 61.89% 67.21% 41.60%
Top-5 79.23% 62.09% 67.53% 49.46%
Top-10 81.65% 62.10% 67.54% 62.67%
MRR 0.70 0.59 0.64 0.38
  1. Sensitivity analysis

3.1. Top-𝑘 accuracy of ARist in different scenarios

New project Working project Maintain project
Top-1 53.42% 69.96% 74.49%
Top-3 61.50% 81.14% 83.23%
Top-5 64.21% 83.74% 85.38%
Top-10 67.96% 85.88% 87.38%
MRR 0.58 0.76 0.79

3.2. ARist’s performance by the expression types of expected arguments

Expression type Distribution (%) Top-1 (%)
Simple Name 48.14 83.66
Method Invocation 15.19 45.51
Field Access 6.09 31.01
Array Access 0.74 53.26
Cast Expr 0.99 18.46
String Literal 10.03 98.14
Number Literal 5.06 95.66
Character Literal 0.47 87.93
Type Literal 0.90 81.92
Bool Literal 1.50 78.43
Null Literal 0.79 84.45
Object Creation 2.09 51.96
Array Creation 0.29 43.14
This Expr 1.06 91.05
Super Expr 0.00 0.00
Compound Expr 5.65 3.69
Lamda Expr 0.73 78.83
Method Reference 0.28 0.56
Total 100.00 69.96

3.3. Impact of Context Length on Performance

l1 l2 l3 l4 l5
Top-1 (%) 62.05 65.86 66.14 67.00 67.83
MRR 0.70 0.72 0.72 0.73 0.74
Run. time (s) 0.33 0.39 0.42 0.51 0.56
  1. Intrinsic Evaluation Results

4.1. Impact of Valid Candidate Identification

Top-1 (%) MRR Run. time (s)
ON 69.96 0.76 0.444
OFF 47.50 0.51 0.809

4.2. Impact of Candidate Reduction

Top-1 (%) MRR Run. time (s)
ON 69.96 0.76 0.444
OFF 61.98 0.69 2.424

4.3. Impact of reducing threshold, 𝑅𝑇

RT 10 20 30 40 50
Top-1 (%) 63.77 64.67 65.10 65.34 65.49
Run. time (s) 0.342 0.406 0.418 0.464 0.508

4.4. Impact of heavy-ranking stage

P_ℎ𝑟 Top-1 (%) MRR Run. time (s)
OFF 65.37 0.72 0.125
GPT-2 70.71 0.76 0.732
CodeT5 68.59 0.74 0.186
LSTM 49.26 0.61 0.198
n-gram 36.89 0.51 0.137