Combating concept drift in LLM applications

December 26, 2025 3 minute read

Your LLM application works beautifully at launch. The model understands user preferences, accuracy metrics look great, and feedback is positive. Then, gradually—or sometimes suddenly—performance starts to degrade. Users complain. Metrics slide. What happened?

The answer is concept drift: the underlying relationship between inputs and desired outputs changes over time, even when the statistical distribution of inputs looks similar. In traditional ML, we retrain models on new data. But for LLM applications built with prompt engineering (what Andrej Karpathy calls “Software 3.0”) retraining isn’t the first option. We need a different approach.

Concept drift and feedback loop optimization — Combating concept drift through data-driven prompt engineering. Image generated with ChatGPT and edited with GIMP.

The Challenge: Real Concept Drift in Production

I built Noteworthy Differences, an LLM-powered system that analyzes Wikipedia page revisions and predicts which changes are noteworthy enough to notify editors. Noteworthiness is inherently subjective; what matters to one person may not matter to another. The system learns from user feedback to align with individual preferences.

To study concept drift systematically, I became the user. Over four time periods in production, I deliberately became more selective about what I considered noteworthy. I used the app’s feedback feature to collect labeled examples, randomly split into training (60%) and test sets (40%). The result: a real-world dataset showing severe concept drift.

The left panel below tells the story. At time step 1 (development), I labeled 42% of differences as noteworthy. By time step 4, that dropped to just 22%, a 48% decrease in the positive class rate. This is textbook concept drift.

Plots of concept drift and model accuracy — Left: Percentage of examples labeled as noteworthy decreases dramatically over time. Right: Model accuracy under different alignment strategies.

The Architecture: Learning from Hard Examples

The system uses a multi-level architecture:

Two classifier models (heuristic and few-shot) independently analyze each revision
An AI judge makes the final decision, with alignment text in its prompt
Disagreements between classifiers identify hard examples (~9% of cases) that need human review

This architecture serves two purposes: it focuses human attention on truly ambiguous cases, and it provides a confidence score based on the level of agreement among models.

For easier feedback collection the app has a “🎲 Special Random” button that classifies random pages until a hard example is found. Try it out!

The Solution: Continuous Realignment

The alignment text is generated by summarizing user feedback and incorporating it into the AI judge’s prompt. Importantly, both the realignment and evaluations are automated by Python scripts available in the GitHub repository.

During development, the initial alignment achieved 53% accuracy on the test set, a 16-point improvement over the baseline (37%). (These values are quoted from the development cycle documented in the project’s README; averages of multiple repetitions resulted in somewhat lower accuracy for the initial alignment as shown in the plot.)

But what happens during production when preferences change? I tested two strategies:

No realignment: Use the development alignment throughout production (red line in right panel)
Continuous realignment: Update the alignment at every time step with accumulated feedback (blue line)

The results are striking. The model with no realignment fluctuates between 35% and 57% accuracy, struggling to adapt to shifting preferences. Only continuous realignment maintains high accuracy, reaching 77% by time step 4, a 35-point improvement over the static model. This 35-point difference represents the cost of ignoring concept drift in production LLM applications.

What I Learned

Concept drift is measurable and consequential. In this experiment, the positive class rate dropped 48% over time. The static model’s accuracy fluctuated wildly in response.
Continuous monitoring is essential. Without evaluation harness tracking performance over time, concept drift would have been invisible until users complained.
One-time alignment isn’t enough. Even a significant improvement during development degrades as user preferences evolve. The system needs to keep learning.

The most important lesson: Prompt engineering can be iterative and data-driven. The realignment process systematically updates prompt instructions through a feedback loop, treating prompts as adaptable components rather than static text.

Implications for LLM Practitioners

In production LLM applications, standing still means falling behind. Iteratively refining prompts with accumulated user feedback isn’t just good practice—it’s the difference between a system that degrades and one that improves with use.

For practitioners building production LLM systems:

Implement feedback collection from day one
Build evaluation infrastructure that runs automatically on new data
Monitor both input distributions and output quality over time
Plan for prompt engineering as a production process, not a one-time development task

The complete implementation, including evaluation and realignment scripts, is available on GitHub. The live app is deployed on 🤗 Hugging Face Spaces. Feedback data is publicly available in this 🤗 Hugging Face Dataset.

Jeffrey Dick

Combating concept drift in LLM applications

The Challenge: Real Concept Drift in Production

The Architecture: Learning from Hard Examples

The Solution: Continuous Realignment

What I Learned

Implications for LLM Practitioners

Further Reading

Share on

You may also enjoy

Engineering challenges for a real RAG application

Noteworthy Differences

R-help chat

Deploying AI4citations: From research to production