Code Search Smarts: Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?

1Independent Researcher 2New York University
search results for the heap data structure show evidence of solving the vocabulary problem, two data structures are returned from the repo, a heap and a priority queue. Each has a different implementation but the same api.

Code Search Smarts:Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?

Abstract

Code search is an important information retrieval application. Bene- fits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories. Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind. One reason is the high cost of human annotation for code queries and answers. While humans may annotate search results in general text QA systems, code annotations require specialized knowledge of a programming language (PL), as well as domain specific software engineering knowledge.

In this work we study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results. We compare the impact of the retriever representation (sparse vs. semantic), programming language, and LLM by comparing human annota- tions across several popular languages (C, Java, Javascript, Go, and Python). We focus on repositories that implement common data structures likely to be implemented in any PLs. For the same hu- man annotations, we compare several LLM-as-a-Judge models to evaluate programming language and other affinities between LLMs.

We find that the chosen retriever and PL exhibit affinities that can be leveraged to improve alignment of human and AI relevance determinations, with significant performance implications. We also find differences in representation (sparse vs. semantic) across PLs that impact alignment of human and AI relevance determinations. We propose using transpilers to bootstrap scalable code search benchmark datasets in other PLs and in a case study demonstrate that human-AI relevance agreement rates largely match the (worst case) human-human agreement under study. The application code used in this work is available a

Research Questions

Do LLM-as-a-judge models prefer certain Programming Languages when generating relevance annotations?

To what extent does the representation (sparse vs semantic) matter when using LLM-as-a-judge for generating relevance annotations?

Most existing benchmarks with relevance annotations are in a single programming language. How can we scale the benchmarks to other PLs?

The findings are on slide 16 of the presentation and some evidence in contained in the slides. The paper contains even more experimental details and metrics to support the claims.

SIGIR-AP 2025 Slides

FAQs

Click for answers

You mentioned Broder's taxonomy, what is the transactional query here?

Great Question. So this is a spot where the taxonomy is about user intent and not necessarily specified in the query itself. The action to be taken would be something on the page associated with the code result. This could be a file download, a text edit, or something else.

BibTeX


@inproceedings{10.1145/3673791.3769503,
author        = {Lucas Roberts and Denisa Roberts},
title         = {Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3767695.3769503},
doi = {10.1145/3767695.3769503},
booktitle = {Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region},
numpages = {10},
keywords = {code search, large language models, relevance, representation},
location = {Xi'an, China},
series = {SIGIR-AP 2025}
}