Code search is an important information retrieval application. Bene-
fits of better code search include faster new developer on-boarding,
reduced software maintenance, and ease of understanding for large
repositories. Despite improvements in search algorithms and search
benchmarks, the domain of code search has lagged behind. One
reason is the high cost of human annotation for code queries and
answers. While humans may annotate search results in general text
QA systems, code annotations require specialized knowledge of a
programming language (PL), as well as domain specific software
engineering knowledge.
In this work we study the use of Large
Language Models (LLMs) to retrieve code at the level of functions
and to generate annotations for code search results. We compare
the impact of the retriever representation (sparse vs. semantic),
programming language, and LLM by comparing human annota-
tions across several popular languages (C, Java, Javascript, Go, and
Python). We focus on repositories that implement common data
structures likely to be implemented in any PLs. For the same hu-
man annotations, we compare several LLM-as-a-Judge models to
evaluate programming language and other affinities between LLMs.
We find that the chosen retriever and PL exhibit affinities that can
be leveraged to improve alignment of human and AI relevance
determinations, with significant performance implications. We also
find differences in representation (sparse vs. semantic) across PLs
that impact alignment of human and AI relevance determinations.
We propose using transpilers to bootstrap scalable code search
benchmark datasets in other PLs and in a case study demonstrate
that human-AI relevance agreement rates largely match the (worst
case) human-human agreement under study. The application code
used in this work is available a