EMPICA

An Empirical Study on Capability of Code Large Language Models in Understanding Code Semantics

Large Language Models for Code (Code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the utilization of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities of these models, “whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks”. To address these concerns, in this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the ability of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models’ responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to nonequivalent ones for all SE tasks.Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformation than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model’s capabilities of understanding code semantics, especially the sensitivity property.

Source code for reproduce experiments can be found here

Experimental results for HumanEval Benchmark

Code summarization

About sematic similarity

Transformation Type	Category	Subcategory	Java				Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.93	0.95	0.95	0.94	0.90	0.92	0.92	0.93
	Control	Flip IfElse	0.66	0.95	0.94	0.93	0.94	0.95	0.95	0.95
	Data	Rename variable	0.82	0.93	0.88	0.87	0.84	0.81	0.85	0.85
	Data	Reorder parameters	0.63	0.95	0.93	0.94	0.94	0.90	0.95	0.93
Sensitivity to SNP Transform.	Control	Negate relational operator	0.07	0.05	0.06	0.06	0.05	0.07	0.05	0.05
	Control	Remove conditional statement	0.11	0.09	0.12	0.11	0.11	0.09	0.11	0.11
	Data	Replace arithmetic operator	0.40	0.07	0.08	0.09	0.07	0.12	0.09	0.08
	Data	Remove def statement	0.10	0.07	0.09	0.11	0.08	0.12	0.09	0.10

About lexical similarity

Transformation Type	Category	Subcategory	Java				Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.87	0.83	0.89	0.87	0.84	0.76	0.97	0.84
	Control	Flip IfElse	0.71	0.86	0.86	0.87	0.90	0.83	0.90	0.90
	Data	Rename variable	0.78	0.77	0.80	0.79	0.77	0.62	0.78	0.76
	Data	Reorder parameters	0.70	0.83	0.85	0.86	0.87	0.77	0.88	0.87
Sensitivity to SNP Transform.	Control	Negate relational operator	0.11	0.16	0.13	0.13	0.11	0.18	0.11	0.11
	Control	Remove conditional statement	0.17	0.27	0.20	0.20	0.21	0.29	0.20	0.21
	Data	Replace arithmetic operator	0.33	0.20	0.15	0.17	0.14	0.29	0.15	0.15
	Data	Remove def statement	0.16	0.24	0.17	0.20	0.16	0.29	0.17	0.19

Method Name Prediction

About exactly match

Transformation Type	Category	Subcategory	Java				Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.81	0.71	0.72	0.85	0.42	0.53	0.60	0.53
	Control	Flip IfElse	0.59	0.46	0.80	0.83	0.47	0.41	0.53	0.67
	Data	Rename variable	0.40	0.37	0.34	0.52	0.32	0.18	0.32	0.36
	Data	Reorder parameters	0.56	0.59	0.72	0.63	0.53	0.49	0.66	0.66
Sensitivity to SNP Transform.	Control	Negate relational operator	0.47	0.57	0.55	0.41	0.54	0.54	0.62	0.50
	Control	Remove conditional statement	0.65	0.65	0.61	0.62	0.70	0.54	0.72	0.67
	Data	Replace arithmetic operator	0.57	0.72	0.51	0.56	0.70	0.66	0.65	0.63
	Data	Remove def statement	0.32	0.53	0.49	0.47	0.38	0.58	0.61	0.45

###### F1-Score

Transformation Type	Category	Subcategory	Java				Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.90	0.89	0.85	0.93	0.67	0.80	0.82	0.77
	Control	Flip IfElse	0.78	0.72	0.91	0.94	0.70	0.76	0.80	0.82
	Data	Rename variable	0.61	0.72	0.62	0.74	0.55	0.44	0.62	0.63
	Data	Reorder parameters	0.71	0.83	0.88	0.79	0.67	0.77	0.81	0.82
Sensitivity to SNP Transform.	Control	Negate relational operator	0.25	0.26	0.30	0.20	0.32	0.23	0.31	0.24
	Control	Remove conditional statement	0.41	0.35	0.42	0.44	0.47	0.31	0.44	0.42
	Data	Replace arithmetic operator	0.36	0.39	0.34	0.36	0.50	0.38	0.38	0.37
	Data	Remove def statement	0.17	0.26	0.25	0.24	0.25	0.26	0.30	0.25

Output Prediction

Transformation Type	Category	Subcategory	Java				Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.75	0.69	0.78	0.80	0.69	0.69	0.57	0.63
	Control	Flip IfElse	0.70	0.66	0.68	0.68	0.70	0.70	0.52	0.67
	Data	Rename variable	0.54	0.59	0.62	0.60	0.62	0.47	0.51	0.49
	Data	Reorder parameters	-	-	-	-	-	-	-	-
Sensitivity to SNP Transform.	Control	Negate relational operator	0.35	0.46	0.38	0.41	0.33	0.44	0.50	0.46
	Control	Remove conditional statement	0.43	0.49	0.52	0.57	0.48	0.52	0.62	0.61
	Data	Replace arithmetic operator	0.44	0.51	0.51	0.49	0.37	0.47	0.62	0.59
	Data	Remove def statement	0.36	0.43	0.50	0.41	0.31	0.44	0.54	0.55

For the detail responses of Code LLMs for the orignal program and the transformed code of the HumanEval benchmark, you can find here.

Experimental results for MPBB

Code summarization

About semantic similarity

Transformation Type	Category	Subcategory	Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.90	0.93	0.93	0.91
	Control	Flip IfElse	0.93	0.94	0.95	0.93
	Data	Rename variable	0.79	0.81	0.80	0.81
	Data	Reorder parameters	0.95	0.91	0.94	0.95
Sensitivity to SNP Transform.	Control	Negate relational operator	0.06	0.08	0.05	0.05
	Control	Remove conditional statement	0.09	0.12	0.09	0.11
	Data	Replace arithmetic operator	0.07	0.12	0.08	0.09
	Data	Remove def statement	0.08	0.13	0.10	0.10

About lexical similarity

Transformation Type	Category	Subcategory	Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.85	0.82	0.87	0.83
	Control	Flip IfElse	0.89	0.84	0.88	0.87
	Data	Rename variable	0.73	0.63	0.71	0.71
	Data	Reorder parameters	0.89	0.78	0.88	0.87
Sensitivity to SNP Transform.	Control	Negate relational operator	0.11	0.18	0.11	0.12
	Control	Remove conditional statement	0.17	0.28	0.19	0.21
	Data	Replace arithmetic operator	0.13	0.24	0.14	0.16
	Data	Remove def statement	0.14	0.31	0.19	0.20

Method Name Prediction

About exactly match

Transformation Type	Category	Subcategory	Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.52	0.64	0.62	0.78
	Control	Flip IfElse	0.50	0.58	0.65	0.67
	Data	Rename variable	0.20	0.16	0.16	0.21
	Data	Reorder parameters	0.61	0.61	0.71	0.72
Sensitivity to SNP Transform.	Control	Negate relational operator	0.53	0.57	0.47	0.53
	Control	Remove conditional statement	0.64	0.68	0.62	0.62
	Data	Replace arithmetic operator	0.61	0.68	0.63	0.66
	Data	Remove def statement	0.50	0.58	0.64	0.55

About F1-Score

Transformation Type	Category	Subcategory	Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.70	0.85	0.83	0.87
	Control	Flip IfElse	0.68	0.79	0.81	0.79
	Data	Rename variable	0.36	0.42	0.38	0.36
	Data	Reorder parameters	0.76	0.84	0.85	0.86
Sensitivity to SNP Transform.	Control	Negate relational operator	0.32	0.27	0.26	0.28
	Control	Remove conditional statement	0.43	0.39	0.38	0.42
	Data	Replace arithmetic operator	0.42	0.39	0.38	0.43
	Data	Remove def statement	0.32	0.33	0.35	0.34

Output Prediction

Transformation Type	Category	Subcategory	Python
Transformation Type	Category	Subcategory	Code Llama	GPT-3.5	DeepSeek	MagicCoder
Robustness to SP Transform.	Control	Convert For/While	0.67	0.82	0.53	0.47
	Control	Flip IfElse	0.63	0.77	0.62	0.49
	Data	Rename variable	0.46	0.63	0.40	0.31
	Data	Reorder parameters	0.55	0.65	0.44	0.37
Sensitivity to SNP Transform.	Control	Negate relational operator	0.42	0.37	0.54	0.60
	Control	Remove conditional statement	0.52	0.38	0.60	0.69
	Data	Replace arithmetic operator	0.57	0.56	0.74	0.78
	Data	Remove def statement	0.43	0.38	0.66	0.68

For the detail responses of Code LLMs for the orignal program and the transformed code of the MBPP benchmark, you can find here.

EMPICA

An Empirical Study on Capability of Code Large Language Models in Understanding Code Semantics

Experimental results for HumanEval Benchmark

Code summarization

About sematic similarity

About lexical similarity

Method Name Prediction

About exactly match

Output Prediction

Experimental results for MPBB

Code summarization

About semantic similarity

About lexical similarity

Method Name Prediction

About exactly match

About F1-Score

Output Prediction