Large Codebase Search Benchmarks for Cursor, Windsurf, Claude Code, Copilot, Codex, and Jolt

We benchmarked the code search speed and quality of today’s top AI coding tools on open-source codebases, ranging from 780,000 to 4.7 million lines of code.

Code search sits at the core of every AI coding tool. We cannot stuff entire codebases into an LLM’s context window. Today’s largest 1M context windows only fit ~50K lines of code1, and codebases typically exceed that in less than 6 months. Even if LLMs had infinite context, their speed and recall accuracy degrade significantly as the input length increases2. A thorough and precise search is mandatory for fast, accurate LLM responses.

To measure how coding tools stack up, we benchmarked Cursor, Windsurf, GitHub Copilot, GitHub Copilot Agent, Claude Code, OpenAI Codex, and Jolt. Using six closed pull requests from the Django, Grafana, and Kubernetes codebases, we measured search speed, thoroughness, and accuracy. These are large, complex, and well-known codebases, and their PRs represent real-world programming tasks. Each PR was run four times across all seven tools, resulting in a total of 168 runs.

This benchmark isolates the search layer and ignores code-generation quality. Each tool ultimately calls the same foundation models (Anthropic, Google, or OpenAI), so code generation is largely similar. Real differentiation comes from the user experience and the search. We focused on search here and scored it on three criteria:

  1. Speed: turnaround time to return code context. A faster search results in faster response and iteration speed for both humans and agents.
  2. Thoroughness: does the tool surface every important file.
  3. Accuracy: does it avoid pulling in irrelevant files.

These three metrics capture the critical aspects of an effective code search, which delivers the highest quality answers.

The most popular code search techniques are text search, vector search, and agentic search. Recently, many AI coding tools have either demoted or abandoned vector search in favor of agentic search. Here’s a snapshot of each method:

Search MethodDescriptionProsCons
User-specified contextAn early and lazy approach, where the users manually specify relevant files and folders.- Gives user full control over context- Presents an undue burden on users
- Tedious for large codebases, and impossible for unfamiliar codebases
TextA strict or fuzzy text matching search. It’s fast, but is rarely useful in any real-world scenarios. Occasionally, it’s paired with AST parsing or dependency graphs to pull in interdependent files.- Sub-second latency, negligible cost- No semantic understanding, you must already know symbol names
- Misses indirect references and renamed identifiers
VectorThe default for many RAG workflows. Codebase files (or their chunks) are embedded and stored in a vector database, and then a query finds the vectors most similar to the user’s prompt.- Very fast look-ups, < 3 s after indexing
- Cheap at query time
- Accuracy degrades quickly on codebases over 50K lines Cannot “reason” about planned changes
- Requires separate ingestion and a vector store Requires indexing, which can take minutes to hours depending on codebase size and ingest engine
AgenticAn LLM-driven loop that utilizes ls, find, grep, rg (ripgrep), or similar commands, refining its target set and keywords across iterations. It’s more accurate than vector search, but often turns into a game of guess-and-check as the agent exhausts search term ideas.- Accurate on 10M+ line codebases
- No indexing required
- Search time balloons if initial guesses miss, 2-5 min was common during testing
- Expensive due to multiple model calls and high token usage, > $0.8 per search was common during testing
- Requires a terminal to operate, either locally or in a cloud sandbox
Hybrid semantic (Jolt HyperContext)One-shot pipeline that combines structural analysis (AST + import graphs), text search, LLM semantic understanding, and LLM scoring.- Accurate on 10M+ line codebases Fast, 7-15 seconds
- No indexing required
- Cheaper than agentic search
- Scales horizontally, e.g., a 2M line search takes about as long as 10M lines
- More expensive than pure text or vector look-ups
- A few seconds slower than text and vector search

Benchmark methodology and considerations

Codebases and prompts

We ran the benchmark against three open-source repos: Django (780K lines), Grafana (2.8M lines), and Kubernetes (4.7M lines). For each repo, we chose two recently closed pull requests. The PRs had a clear, self-contained feature or bug-fix goal, avoided routine maintenance chores, and crucially, did not explicitly name the files involved. Success depends squarely on each tool’s ability to discover the relevant code through search.

Every pull request was tested with two prompt variants, identical across all tools:

  1. List Files variant: ask to list files that should be modified in the PR
  2. Codegen variant: ask to generate code changes for the PR

Each variant was run twice per tool (168 runs in total) to smooth out the inherent randomness of LLMs, giving us a more balanced view of search speed, thoroughness, and accuracy.

Measuring speed

We captured two speed metrics for every run:

  1. Search Time: the interval between submitting the prompt and the end of the search phase. For agentic tools (Cursor, Windsurf, GitHub Copilot, Copilot Agent, Claude Code, Codex), this point occurs when the initial phase of search, ls, grep, and ripgrep finishes and the tool begins composing a response. For Jolt, internal telemetry marks when the search completes and generation starts.
  2. Total Response Time: the interval from prompt submission to the final token of the assistant’s reply.

Note: OpenAI Codex starts a sandbox container and clones the repo before searching; that startup time is included in both time metrics.

Search quality scoring

We first graded each tool’s output with the point-based rubric below. For the List Files prompt variant, we scored the list of files the tool planned to change. For the Codegen prompt variant, we scored the actual files the tool modified in its response. In both cases, we checked against the files that were ultimately merged in the PR, which serve as the ground-truth reference.

Primary files are files changed in the PR with the most important code or logic changes. Secondary files also require changes, but are ancillary to the core changes. Documentation, test, and example files are neutral and do not affect the score.

Type of FilePoints
Primary file correctly listed+1.0
Secondary file correctly listed+0.5
Missing primary file-0.25
Missing secondary file0
Unnecessary file-0.25
Folder path instead of file-0.5
Hallucinated / non-existent file-0.5

After tallying the raw points for a run, we divide that total by the maximum points possible for that PR and map that percentage to the 0-to-3 scale shown below, making results comparable across PRs of different sizes.

% of Total Possible PointsSearch ScoreInterpretation
100%3All required files found; minimal extraneous files
60-99%2Most required files found; minor misses or noise
40-59%1Missing critical files
< 40%0Few or no relevant files

A score of 2 or 3 is a pass. Scores below 2 indicate that the tool’s search failed.

Tool settings

The following settings were used for each tool:

ToolTier/PlanModelSpecific Settings
CursorProSonnet 4 (Max mode)Internet use disabled
Waited for codebase indexing to be done
WindsurfProSonnet 4 (BYOK "bring your own key")Internet use disabled
Waited for codebase indexing to be done
GitHub CopilotProSonnet 4Ask mode was used for the List Files prompt variant
Edit mode was used for the Codegen prompt variant
Each prompt was run with #codebase
GitHub Copilot AgentProSonnet 4Copilot's "agent" mode
Claude CodeBYOKSonnet 4 (Default model)Default settings
CodexProo3 codex-1 (Default model)Ask mode ("fast mode") was used for the List Files prompt variant
Code mode was used for the Codegen prompt variant

Note: The memory of each tool was cleared between runs to prevent memories from steering or short-circuiting the search process.

Additional considerations and notes

We designed this benchmark to be as objective, transparent, and repeatable as possible. The notes below outline the guardrails we used to keep results fair.

Each run was timed manually when internal telemetry was unavailable. When an agent paused for user confirmation, we replied in under a second and excluded that pause from the timing. To isolate the code search, we stopped the clock once a tool started running unit tests, linting, or self-verification. In a few cases, we had to adjust the prompt to prevent the agent from running tests or scripts.

We marked the search phase as complete when the initial pass of ls, grep, ripgrep, and other search-related commands ended. If the agent later realized it was missing information and launched additional searches, that extra time did not count toward Search Time. Therefore, some reported Search Time values for agentic tools are conservative. These controls aim to give a real-world view of how quickly each tool can surface the code a developer needs.

Result summary

RankAvg Search Score (0-3)Avg Search Time (sec)Avg Total Response Time (sec)Search Technology
Jolt1🥇2.38 (pass)🥇15.67 (1x)🥇58.04 (1x)hybrid semantic (LLM + AST + text)
OpenAI Codex2🥈2.21 (pass)232.17 (14.8x slower)291.92 (5.0x slower)agentic
GitHub Copilot Agent3🥉1.88 (fail)103.17 (6.6x slower)173.83 (3.0x slower)agentic + vector
Windsurf4🥉1.88 (fail)131.04 (8.4x slower)197.29 (3.4x slower)agentic + vector
Cursor51.71 (fail)🥉88.88 (5.7x slower)🥉161.88 (2.8x slower)agentic + vector
Claude Code61.50 (fail)212.88 (13.6x slower)317.75 (5.5x slower)agentic
GitHub Copilot71.21 (fail)🥈49.08 (3.1x slower)🥈101.75 (1.8x slower)vector, maybe partially agentic

Detailed results table at the end of the post

Jolt's hybrid search was the clear speed leader and surfaced the essential files in most runs. It is the only tool with a passing score and sub-minute average Total Response Time.

OpenAI's Codex nearly matched Jolt on recall, but relied on a fully agentic loop, resulting in average Search Time and Total Response Time that were 14.8x and 5x slower, respectively. Claude Code's speed was similar to Codex, but the search quality was significantly worse, as the tool either missed important files or suggested irrelevant ones.

Several of the agentic tools read parts of files in 100-200 line chunks, rather than read the entire file at once. This contributed to more LLM calls and higher latency, particularly for Claude Code. Pure vector search (GitHub Copilot non-agent modes) returned quickly but often missed important files and returned irrelevant results, illustrating the precision limits of vector search.

The hybrid agentic and vector search tools, Copilot Agent, Windsurf, and Cursor, had similar results, likely because they defer the workflow to the LLM, rather than prescribe process. There is little differentiation between the three, but Copilot Agent's had a slight edge.

Code search matters for humans and agents

This benchmark reveals a stark reality: the difference between the fastest and slowest tools is the difference between waiting 1 minute and waiting 5 minutes for a complete response. For a developer submitting 20-30 AI prompts per day, this translates to either 20-30 minutes or 1.5-2.5 hours spent waiting. That's the difference between maintaining a productive flow state and constantly context-switching.

The results also expose a critical trade-off in today's tools. Pure vector search (GitHub Copilot) delivers sub-minute responses but fails to find the right files. Agentic search (Codex, Claude Code) finds more relevant code but takes 3-5 minutes per search. Only Jolt’s hybrid approach delivers both speed and accuracy.

For autonomous agents, these differences compound further. An agent completing a complex PR might perform 50+ searches. With Jolt's 15-second search, that's 12.5 minutes of search time. With Claude Code's 3.5-minute average, it's nearly 3 hours. This directly impacts both infrastructure costs and shipping velocity.

These results demonstrate that Jolt’s hybrid semantic search delivers both the speed and accuracy developers need. Code search has been our core focus. We’ll continue to refine Jolt and deliver the best possible AI coding experience for production codebases. You can try Jolt for free here. Have questions or feedback about this benchmark? Reach us at hello@usejolt.ai.

Full results

Footnotes

  1. https://ai.google.dev/gemini-api/docs/long-context#getting-started-with-long-context

  2. https://arxiv.org/pdf/2502.05167#page=5

Yev Spektor

July 8th, 2025