EMPICA

An Empirical Study on Capability of Code Large Language Models in Understanding Code Semantics

Large Language Models for Code (Code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the utilization of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities of these models, “whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks”. To address these concerns, in this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the ability of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models’ responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to nonequivalent ones for all SE tasks.Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformation than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model’s capabilities of understanding code semantics, especially the sensitivity property.

Source code for reproduce experiments can be found here

Experimental results for HumanEval Benchmark

  1. Code summarization
    About sematic similarity
Transformation Type Category Subcategory Java Python
Code Llama GPT-3.5 DeepSeek MagicCoder Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.93 0.95 0.95 0.94 0.90 0.92 0.92 0.93
Control Flip IfElse 0.66 0.95 0.94 0.93 0.94 0.95 0.95 0.95
Data Rename variable 0.82 0.93 0.88 0.87 0.84 0.81 0.85 0.85
Data Reorder parameters 0.63 0.95 0.93 0.94 0.94 0.90 0.95 0.93
Sensitivity to SNP Transform. Control Negate relational operator 0.07 0.05 0.06 0.06 0.05 0.07 0.05 0.05
Control Remove conditional statement 0.11 0.09 0.12 0.11 0.11 0.09 0.11 0.11
Data Replace arithmetic operator 0.40 0.07 0.08 0.09 0.07 0.12 0.09 0.08
Data Remove def statement 0.10 0.07 0.09 0.11 0.08 0.12 0.09 0.10
About lexical similarity
Transformation Type Category Subcategory Java Python
Code Llama GPT-3.5 DeepSeek MagicCoder Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.87 0.83 0.89 0.87 0.84 0.76 0.97 0.84
Control Flip IfElse 0.71 0.86 0.86 0.87 0.90 0.83 0.90 0.90
Data Rename variable 0.78 0.77 0.80 0.79 0.77 0.62 0.78 0.76
Data Reorder parameters 0.70 0.83 0.85 0.86 0.87 0.77 0.88 0.87
Sensitivity to SNP Transform. Control Negate relational operator 0.11 0.16 0.13 0.13 0.11 0.18 0.11 0.11
Control Remove conditional statement 0.17 0.27 0.20 0.20 0.21 0.29 0.20 0.21
Data Replace arithmetic operator 0.33 0.20 0.15 0.17 0.14 0.29 0.15 0.15
Data Remove def statement 0.16 0.24 0.17 0.20 0.16 0.29 0.17 0.19
  1. Method Name Prediction
    About exactly match
Transformation Type Category Subcategory Java Python
Code Llama GPT-3.5 DeepSeek MagicCoder Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.81 0.71 0.72 0.85 0.42 0.53 0.60 0.53
Control Flip IfElse 0.59 0.46 0.80 0.83 0.47 0.41 0.53 0.67
Data Rename variable 0.40 0.37 0.34 0.52 0.32 0.18 0.32 0.36
Data Reorder parameters 0.56 0.59 0.72 0.63 0.53 0.49 0.66 0.66
Sensitivity to SNP Transform. Control Negate relational operator 0.47 0.57 0.55 0.41 0.54 0.54 0.62 0.50
Control Remove conditional statement 0.65 0.65 0.61 0.62 0.70 0.54 0.72 0.67
Data Replace arithmetic operator 0.57 0.72 0.51 0.56 0.70 0.66 0.65 0.63
Data Remove def statement 0.32 0.53 0.49 0.47 0.38 0.58 0.61 0.45

###### F1-Score

Transformation Type Category Subcategory Java Python
Code Llama GPT-3.5 DeepSeek MagicCoder Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.90 0.89 0.85 0.93 0.67 0.80 0.82 0.77
Control Flip IfElse 0.78 0.72 0.91 0.94 0.70 0.76 0.80 0.82
Data Rename variable 0.61 0.72 0.62 0.74 0.55 0.44 0.62 0.63
Data Reorder parameters 0.71 0.83 0.88 0.79 0.67 0.77 0.81 0.82
Sensitivity to SNP Transform. Control Negate relational operator 0.25 0.26 0.30 0.20 0.32 0.23 0.31 0.24
Control Remove conditional statement 0.41 0.35 0.42 0.44 0.47 0.31 0.44 0.42
Data Replace arithmetic operator 0.36 0.39 0.34 0.36 0.50 0.38 0.38 0.37
Data Remove def statement 0.17 0.26 0.25 0.24 0.25 0.26 0.30 0.25
  1. Output Prediction
Transformation Type Category Subcategory Java Python
Code Llama GPT-3.5 DeepSeek MagicCoder Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.75 0.69 0.78 0.80 0.69 0.69 0.57 0.63
Control Flip IfElse 0.70 0.66 0.68 0.68 0.70 0.70 0.52 0.67
Data Rename variable 0.54 0.59 0.62 0.60 0.62 0.47 0.51 0.49
Data Reorder parameters - - - - - - - -
Sensitivity to SNP Transform. Control Negate relational operator 0.35 0.46 0.38 0.41 0.33 0.44 0.50 0.46
Control Remove conditional statement 0.43 0.49 0.52 0.57 0.48 0.52 0.62 0.61
Data Replace arithmetic operator 0.44 0.51 0.51 0.49 0.37 0.47 0.62 0.59
Data Remove def statement 0.36 0.43 0.50 0.41 0.31 0.44 0.54 0.55

For the detail responses of Code LLMs for the orignal program and the transformed code of the HumanEval benchmark, you can find here.

Experimental results for MPBB

  1. Code summarization
    About semantic similarity
Transformation Type Category Subcategory Python
Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.90 0.93 0.93 0.91
Control Flip IfElse 0.93 0.94 0.95 0.93
Data Rename variable 0.79 0.81 0.80 0.81
Data Reorder parameters 0.95 0.91 0.94 0.95
Sensitivity to SNP Transform. Control Negate relational operator 0.06 0.08 0.05 0.05
Control Remove conditional statement 0.09 0.12 0.09 0.11
Data Replace arithmetic operator 0.07 0.12 0.08 0.09
Data Remove def statement 0.08 0.13 0.10 0.10
About lexical similarity
Transformation Type Category Subcategory Python
Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.85 0.82 0.87 0.83
Control Flip IfElse 0.89 0.84 0.88 0.87
Data Rename variable 0.73 0.63 0.71 0.71
Data Reorder parameters 0.89 0.78 0.88 0.87
Sensitivity to SNP Transform. Control Negate relational operator 0.11 0.18 0.11 0.12
Control Remove conditional statement 0.17 0.28 0.19 0.21
Data Replace arithmetic operator 0.13 0.24 0.14 0.16
Data Remove def statement 0.14 0.31 0.19 0.20
  1. Method Name Prediction
    About exactly match
Transformation Type Category Subcategory Python
Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.52 0.64 0.62 0.78
Control Flip IfElse 0.50 0.58 0.65 0.67
Data Rename variable 0.20 0.16 0.16 0.21
Data Reorder parameters 0.61 0.61 0.71 0.72
Sensitivity to SNP Transform. Control Negate relational operator 0.53 0.57 0.47 0.53
Control Remove conditional statement 0.64 0.68 0.62 0.62
Data Replace arithmetic operator 0.61 0.68 0.63 0.66
Data Remove def statement 0.50 0.58 0.64 0.55
About F1-Score
Transformation Type Category Subcategory Python
Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.70 0.85 0.83 0.87
Control Flip IfElse 0.68 0.79 0.81 0.79
Data Rename variable 0.36 0.42 0.38 0.36
Data Reorder parameters 0.76 0.84 0.85 0.86
Sensitivity to SNP Transform. Control Negate relational operator 0.32 0.27 0.26 0.28
Control Remove conditional statement 0.43 0.39 0.38 0.42
Data Replace arithmetic operator 0.42 0.39 0.38 0.43
Data Remove def statement 0.32 0.33 0.35 0.34
  1. Output Prediction
Transformation Type Category Subcategory Python
Code Llama GPT-3.5 DeepSeek MagicCoder
Robustness to SP Transform. Control Convert For/While 0.67 0.82 0.53 0.47
Control Flip IfElse 0.63 0.77 0.62 0.49
Data Rename variable 0.46 0.63 0.40 0.31
Data Reorder parameters 0.55 0.65 0.44 0.37
Sensitivity to SNP Transform. Control Negate relational operator 0.42 0.37 0.54 0.60
Control Remove conditional statement 0.52 0.38 0.60 0.69
Data Replace arithmetic operator 0.57 0.56 0.74 0.78
Data Remove def statement 0.43 0.38 0.66 0.68

For the detail responses of Code LLMs for the orignal program and the transformed code of the MBPP benchmark, you can find here.