Feature idea: LLM-powered Prompt compression #11

arthurwolf · 2024-05-05T18:42:47Z

Observation 1: Large-context/smart LLMs (GPT4, Gemini, Claude) can be expensive.

Ideally, you want to send as few tokens as needed to them when crafting your prompt, to reduce cost (and also possibly make them better at answering).

This means you can't send entire files at a time (which would be faster/more convenient) but you instead have to think about which parts of the files are important to your prompt and which are not, and send only the important parts. This takes time/effort.

Observation 2: Local (llama3-8b/oolama etc) / smaller-but-remote (groq, gpt3.5) LLMs are free or much cheaper.

So, what if we could delegate the task of "filtering" what to send in the final prompt, to a small/local LLM?

This would work like this:

Pass 1: Extract everything from the prompt that is not a file, meaning it is "the question" / "the task" the user needs done. Ask the small LLM to "summarize" this task.
Pass 2: Go over each file, and for each file, ask it, which part of this file is relevant to the question/task, and which is not. Filter out the irrelevant parts, only keep the relevant parts.
Finally, generate the prompt keeping only the relevant parts, resulting in a (possibly much) more compact prompt, without losing any important information.

If this works, it would significantly reduce cost without reducing usefulness/accuracy (at the cost of a bit of time to process the initial passes, and a bit of effort to initially set things up).

Just an idea. Sorry for all the noise, I'm presuming you'd rather people give ideas even if you don't end up implementing them, tell me if I need to calm down.

Cheers.

backnotprop · 2025-03-25T04:46:13Z

First, it's important to acknowledge that code embeddings and general RAG + reranking approaches are the common solutions to this problem. However, I believe there's a more powerful idea here that current agentic IDEs struggle with, regardless of their RAG implementation.

I've been considering a similar LLM-driven approach. The key factors are cost and scale - some codebases range from ~80K tokens to beyond 1M tokens. Accessing these tokens per request becomes expensive over time (which could potentially become a paid service if the market demands it).

While smaller models are cheaper, the question becomes whether their performance could exceed embedding approaches in quality. This is where it gets interesting: we could move beyond simple RAG to prompt-reprompting for agents. Instead of just injecting context, we could:

Create detailed task breakdowns
Direct other agents (like Cursor) on how to complete tasks
Specify exactly where to look for relevant information
Provide simplified context summarizations for the codebase

Gemini is particularly intriguing as a supporting model with its 1M+ context windows and somewhat cost efficiencies (context-caching).

This feature could significantly augment agents like Cursor, making them more efficient and effective.

However, we also need to be practical ($) about the intended use here. Local models means cost efficiency but possible lack of ability to parse large codebases. Using Gemini in such a way can still rack up costs quickly.

FWIW... A very smart friend of mine in big tech AI thinks the solution is better embedding reranking and summarization (vs this type of service)

This tool could enable embeddings as well.

backnotprop mentioned this issue Mar 25, 2025

Bug: file selection does not show up with large projects #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature idea: LLM-powered Prompt compression #11

Feature idea: LLM-powered Prompt compression #11

arthurwolf commented May 5, 2024

backnotprop commented Mar 25, 2025 •

edited

Loading

Feature idea: LLM-powered Prompt compression #11

Feature idea: LLM-powered Prompt compression #11

Comments

arthurwolf commented May 5, 2024

backnotprop commented Mar 25, 2025 • edited Loading

backnotprop commented Mar 25, 2025 •

edited

Loading