Custom Fine-Tuned Models in Cursor
A glimpse into how Cursor uses fine-tuned models in Tab and Apply
One thing that makes the current “agent” explosion exciting is the creativity involved in chaining together small, fine-tuned LLMs to carry out a specific task better than simply using the latest state-of-the-art model. Not only can this strategy help get the job done better, but it’s usually also faster and cheaper, meaning whatever task can be run on a much larger scale.
In listening to the Lex Fridman interview with the Cursor team I appreciated that the Cursor founders dropped a few interesting tidbits about how they are using custom-tuned LLMs in conjunction with foundation models like Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1. Most companies in the application-layer are closely guarding those secrets (as they are one of the few moats in the space), so getting some insights into how an LLM-native company like Cursor is thinking about this is a rare treat. Let’s see what they had to say.
Two Cursor Features
If you haven’t heard of Cursor, it’s an AI assisted code editor that integrates LLMs natively to assist the programmer. They have two features that work super well, and make the life of the programmer much more enjoyable. These are:
Tab: Autocomplete on steroids. Autocomplete when writing emails is now something we take for granted, but here it just works so, so well. Under the hood the Cursor team have employed a couple of tricks to get it to work: they cache tokens, feed in the most relevant files in the repo, and rank lines based on how relevant they are to make it work so well. The end result: wherever your cursor is, Cursor will try to predict what goes in that line or the line below, and hitting the tab button will implement it.
Composer/Apply: You can type in the chat what changes you want Composer to make to your code, and it will make suggestions and show it in the form of github-style diffs. This works even across multiple files in your repo. Apply is the act of, in a single button, accepting Composer’s suggestions into your code, and removing the things Composer thought were unnecessary/old. You can do these “chuck” by chunk, or all at once.
Lex Interview
At one point in the Lex Fridman interview with the 4 Cursor founders, he asks the team to give some insight into “some of the ML stuff that makes it all work.” Aman Sanger, one of the founders of Cursor, gives a roughly 3 minute answer where he walks through the details of how they use custom models (15:46 to 18:40 in the video above):
Cursor really works via this ensemble of custom models that we’ve trained alongside the frontier models that are fantastic at the reasoning intense things.
It seems like the terminology is still settling a bit, but personally I like “ensemble of custom models.” People also talk about an “orchestration,” which I also like, as an orchestra is composed of different instruments that all play together. At Anthropic, they might use the term workflow to describe this, while others would consider this an agentic workflow.
Tab
To illustrate how this ensemble works in practice, Aman first describes Tab in greater detail:
And so Cursor Tab for example, is a great example of where you can specialize this model to be even better than even frontier models if you look at evals on the task we set it at.
An important part of this process is having reliable benchmarks (evals) and tests to prove that your new thing is actually better than the old thing. Sounds easy, but of course creating an evaluation dataset of problems that are representative of your users’ problems is its own challenge.
I am surprised their model beats foundational models at code writing. But I think there are also other things behind this decision (like caching and speculative decoding, which they discuss in the last part of the interview but still don’t understand too well yet).
As for what data they used to fine-tune this autocomplete, we can only speculate. Given that they have a unique prompt, and that fine-tuning involves giving both the input and output, perhaps part of it is due to the unique way they format the prompt. I really wish he went a bit deeper here.
Apply: diffs
Then Aman gets into Apply, and specifically, the part of Apply that is responsible for creating diffs:
The other domain, which it’s surprising that it requires custom models but it’s necessary and works quite well, is in Apply. …the frontier models are quite good at sketching out plans for code and generating rough sketches of the change, but actually, creating diffs is quite hard for frontier models, for your training models. You try to do this with Sonnet, with o1, any frontier model and it really messes up stupid things like counting line numbers, especially in super, super large files. And so what we’ve done to alleviate this is we let the model sketch out this rough code block that indicates what the change will be and we train a model to then Apply that change to the file.
I love this part because I’ve noticed that Sonnet and other LLMs are really bad at keeping track of line numbers and things, and I assumed it was just one of those things LLMs weren’t that good at.
Sauleh Asif (another founder) jumps in to add: “contrary to popular perception, [the diff matching in Apply] is not a deterministic algorithm.” Aman continues:
Yeah, I think you see shallow copies of Apply elsewhere and it just breaks most of the time because you think you can try to do some deterministic matching and then it fails at least 40% of the time and that just results in a terrible product experience.
I would not have expected this for a task of a fine-tuned LLM, given that Apply happens so fast and works so consistently for me when I use Cursor.
Apply: Writing Code
The final part of Apply is in actually writing the code for Composer. Instead of just asking the smartest model to write the code, Aman explains the decision to break down the task into two steps: asking a big model to make the high-level code decisions and a small model to fill out the details:
So one other thing that Apply lets you do is it lets you use fewer tokens with the most intelligent models. This is both expensive in terms of latency for generating all these tokens and cost. So you can give this very, very rough sketch and then have your model models go and implement it because it’s a much easier task to implement this very, very sketched out code. And I think that this regime will continue where you can use smarter and smarter models to do the planning and then maybe the implementation details can be handled by the less intelligent ones. Perhaps you’ll have maybe o1, maybe it’ll be even more capable models given an even higher level plan that is recursively applied by Sonnet and then the Apply model.
The implication for me is that they are using the frontier models for the “hardest” task (high level understanding of what approach to use when coding), which makes sense given that big models are able to understand things from training data that smaller models are not able to. And these models are not something that Cursor is going to make or even fine-tune. So where they must, they use the frontier models, but for anything else, outsourcing it to a custom smaller model seems to be their strategy.
Wrapping Up
It’s clear that the Cursor team has quite a hot product on their hands—it’s one of the best LLM-native tools I have used and that is no accident. Their exploration and use of custom models and custom workflows shows that they are employing a mixture strategies and being very careful about where exactly they need to use the raw power of the foundation models, and where they can train their own models to do better. The way they use the custom models in Tab and Apply with the diffs went against some of my initial expectations of where I would have thought they fit.
I was also super impressed and motivated by what the Cursor team is doing. It’s clear they have a very good command of the technical aspects of this complex and rapidly changing field, but also they show that they have a good product sense. I would bet that since DeepSeek’s R1 paper came out, they’ve been working on distilling reasoning into smaller models as well. I’m excited to see what else the Cursor team comes up with.
Of course, I wish there were more details. I'm particularly interested in learning more because I'd like to fine-tune my own LLMs for workflows, both for my day job and a personal language learning project with Alex. So if any of you have any further insight into how Cursor or other companies are chaining together these fine-tuned models, I’d love to hear about it.