An Empirical Study on Capability of Code Large Language Models in Understanding Code Semantics
Large Language Models for Code (Code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the utilization of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities of these models, “whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks”. To address these concerns, in this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the ability of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models’ responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to nonequivalent ones for all SE tasks.Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformation than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model’s capabilities of understanding code semantics, especially the sensitivity property.
Source code for reproduce experiments can be found here
Experimental results for HumanEval Benchmark
-
Code summarization
About sematic similarity
Transformation Type |
Category |
Subcategory |
Java |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.93 |
0.95 |
0.95 |
0.94 |
0.90 |
0.92 |
0.92 |
0.93 |
Control |
Flip IfElse |
0.66 |
0.95 |
0.94 |
0.93 |
0.94 |
0.95 |
0.95 |
0.95 |
Data |
Rename variable |
0.82 |
0.93 |
0.88 |
0.87 |
0.84 |
0.81 |
0.85 |
0.85 |
Data |
Reorder parameters |
0.63 |
0.95 |
0.93 |
0.94 |
0.94 |
0.90 |
0.95 |
0.93 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.07 |
0.05 |
0.06 |
0.06 |
0.05 |
0.07 |
0.05 |
0.05 |
Control |
Remove conditional statement |
0.11 |
0.09 |
0.12 |
0.11 |
0.11 |
0.09 |
0.11 |
0.11 |
Data |
Replace arithmetic operator |
0.40 |
0.07 |
0.08 |
0.09 |
0.07 |
0.12 |
0.09 |
0.08 |
Data |
Remove def statement |
0.10 |
0.07 |
0.09 |
0.11 |
0.08 |
0.12 |
0.09 |
0.10 |
About lexical similarity
Transformation Type |
Category |
Subcategory |
Java |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.87 |
0.83 |
0.89 |
0.87 |
0.84 |
0.76 |
0.97 |
0.84 |
Control |
Flip IfElse |
0.71 |
0.86 |
0.86 |
0.87 |
0.90 |
0.83 |
0.90 |
0.90 |
Data |
Rename variable |
0.78 |
0.77 |
0.80 |
0.79 |
0.77 |
0.62 |
0.78 |
0.76 |
Data |
Reorder parameters |
0.70 |
0.83 |
0.85 |
0.86 |
0.87 |
0.77 |
0.88 |
0.87 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.11 |
0.16 |
0.13 |
0.13 |
0.11 |
0.18 |
0.11 |
0.11 |
Control |
Remove conditional statement |
0.17 |
0.27 |
0.20 |
0.20 |
0.21 |
0.29 |
0.20 |
0.21 |
Data |
Replace arithmetic operator |
0.33 |
0.20 |
0.15 |
0.17 |
0.14 |
0.29 |
0.15 |
0.15 |
Data |
Remove def statement |
0.16 |
0.24 |
0.17 |
0.20 |
0.16 |
0.29 |
0.17 |
0.19 |
-
Method Name Prediction
About exactly match
Transformation Type |
Category |
Subcategory |
Java |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.81 |
0.71 |
0.72 |
0.85 |
0.42 |
0.53 |
0.60 |
0.53 |
Control |
Flip IfElse |
0.59 |
0.46 |
0.80 |
0.83 |
0.47 |
0.41 |
0.53 |
0.67 |
Data |
Rename variable |
0.40 |
0.37 |
0.34 |
0.52 |
0.32 |
0.18 |
0.32 |
0.36 |
Data |
Reorder parameters |
0.56 |
0.59 |
0.72 |
0.63 |
0.53 |
0.49 |
0.66 |
0.66 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.47 |
0.57 |
0.55 |
0.41 |
0.54 |
0.54 |
0.62 |
0.50 |
Control |
Remove conditional statement |
0.65 |
0.65 |
0.61 |
0.62 |
0.70 |
0.54 |
0.72 |
0.67 |
Data |
Replace arithmetic operator |
0.57 |
0.72 |
0.51 |
0.56 |
0.70 |
0.66 |
0.65 |
0.63 |
Data |
Remove def statement |
0.32 |
0.53 |
0.49 |
0.47 |
0.38 |
0.58 |
0.61 |
0.45 |
###### F1-Score
Transformation Type |
Category |
Subcategory |
Java |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.90 |
0.89 |
0.85 |
0.93 |
0.67 |
0.80 |
0.82 |
0.77 |
Control |
Flip IfElse |
0.78 |
0.72 |
0.91 |
0.94 |
0.70 |
0.76 |
0.80 |
0.82 |
Data |
Rename variable |
0.61 |
0.72 |
0.62 |
0.74 |
0.55 |
0.44 |
0.62 |
0.63 |
Data |
Reorder parameters |
0.71 |
0.83 |
0.88 |
0.79 |
0.67 |
0.77 |
0.81 |
0.82 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.25 |
0.26 |
0.30 |
0.20 |
0.32 |
0.23 |
0.31 |
0.24 |
Control |
Remove conditional statement |
0.41 |
0.35 |
0.42 |
0.44 |
0.47 |
0.31 |
0.44 |
0.42 |
Data |
Replace arithmetic operator |
0.36 |
0.39 |
0.34 |
0.36 |
0.50 |
0.38 |
0.38 |
0.37 |
Data |
Remove def statement |
0.17 |
0.26 |
0.25 |
0.24 |
0.25 |
0.26 |
0.30 |
0.25 |
-
Output Prediction
Transformation Type |
Category |
Subcategory |
Java |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.75 |
0.69 |
0.78 |
0.80 |
0.69 |
0.69 |
0.57 |
0.63 |
Control |
Flip IfElse |
0.70 |
0.66 |
0.68 |
0.68 |
0.70 |
0.70 |
0.52 |
0.67 |
Data |
Rename variable |
0.54 |
0.59 |
0.62 |
0.60 |
0.62 |
0.47 |
0.51 |
0.49 |
Data |
Reorder parameters |
- |
- |
- |
- |
- |
- |
- |
- |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.35 |
0.46 |
0.38 |
0.41 |
0.33 |
0.44 |
0.50 |
0.46 |
Control |
Remove conditional statement |
0.43 |
0.49 |
0.52 |
0.57 |
0.48 |
0.52 |
0.62 |
0.61 |
Data |
Replace arithmetic operator |
0.44 |
0.51 |
0.51 |
0.49 |
0.37 |
0.47 |
0.62 |
0.59 |
Data |
Remove def statement |
0.36 |
0.43 |
0.50 |
0.41 |
0.31 |
0.44 |
0.54 |
0.55 |
For the detail responses of Code LLMs for the orignal program and the transformed code of the HumanEval benchmark, you can find here.
Experimental results for MPBB
-
Code summarization
About semantic similarity
Transformation Type |
Category |
Subcategory |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.90 |
0.93 |
0.93 |
0.91 |
Control |
Flip IfElse |
0.93 |
0.94 |
0.95 |
0.93 |
Data |
Rename variable |
0.79 |
0.81 |
0.80 |
0.81 |
Data |
Reorder parameters |
0.95 |
0.91 |
0.94 |
0.95 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.06 |
0.08 |
0.05 |
0.05 |
Control |
Remove conditional statement |
0.09 |
0.12 |
0.09 |
0.11 |
Data |
Replace arithmetic operator |
0.07 |
0.12 |
0.08 |
0.09 |
Data |
Remove def statement |
0.08 |
0.13 |
0.10 |
0.10 |
About lexical similarity
Transformation Type |
Category |
Subcategory |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.85 |
0.82 |
0.87 |
0.83 |
Control |
Flip IfElse |
0.89 |
0.84 |
0.88 |
0.87 |
Data |
Rename variable |
0.73 |
0.63 |
0.71 |
0.71 |
Data |
Reorder parameters |
0.89 |
0.78 |
0.88 |
0.87 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.11 |
0.18 |
0.11 |
0.12 |
Control |
Remove conditional statement |
0.17 |
0.28 |
0.19 |
0.21 |
Data |
Replace arithmetic operator |
0.13 |
0.24 |
0.14 |
0.16 |
Data |
Remove def statement |
0.14 |
0.31 |
0.19 |
0.20 |
-
Method Name Prediction
About exactly match
Transformation Type |
Category |
Subcategory |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.52 |
0.64 |
0.62 |
0.78 |
Control |
Flip IfElse |
0.50 |
0.58 |
0.65 |
0.67 |
Data |
Rename variable |
0.20 |
0.16 |
0.16 |
0.21 |
Data |
Reorder parameters |
0.61 |
0.61 |
0.71 |
0.72 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.53 |
0.57 |
0.47 |
0.53 |
Control |
Remove conditional statement |
0.64 |
0.68 |
0.62 |
0.62 |
Data |
Replace arithmetic operator |
0.61 |
0.68 |
0.63 |
0.66 |
Data |
Remove def statement |
0.50 |
0.58 |
0.64 |
0.55 |
About F1-Score
Transformation Type |
Category |
Subcategory |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.70 |
0.85 |
0.83 |
0.87 |
Control |
Flip IfElse |
0.68 |
0.79 |
0.81 |
0.79 |
Data |
Rename variable |
0.36 |
0.42 |
0.38 |
0.36 |
Data |
Reorder parameters |
0.76 |
0.84 |
0.85 |
0.86 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.32 |
0.27 |
0.26 |
0.28 |
Control |
Remove conditional statement |
0.43 |
0.39 |
0.38 |
0.42 |
Data |
Replace arithmetic operator |
0.42 |
0.39 |
0.38 |
0.43 |
Data |
Remove def statement |
0.32 |
0.33 |
0.35 |
0.34 |
-
Output Prediction
Transformation Type |
Category |
Subcategory |
Python |
Code Llama |
GPT-3.5 |
DeepSeek |
MagicCoder |
Robustness to SP Transform. |
Control |
Convert For/While |
0.67 |
0.82 |
0.53 |
0.47 |
Control |
Flip IfElse |
0.63 |
0.77 |
0.62 |
0.49 |
Data |
Rename variable |
0.46 |
0.63 |
0.40 |
0.31 |
Data |
Reorder parameters |
0.55 |
0.65 |
0.44 |
0.37 |
Sensitivity to SNP Transform. |
Control |
Negate relational operator |
0.42 |
0.37 |
0.54 |
0.60 |
Control |
Remove conditional statement |
0.52 |
0.38 |
0.60 |
0.69 |
Data |
Replace arithmetic operator |
0.57 |
0.56 |
0.74 |
0.78 |
Data |
Remove def statement |
0.43 |
0.38 |
0.66 |
0.68 |
For the detail responses of Code LLMs for the orignal program and the transformed code of the MBPP benchmark, you can find here.