Dynamic Context Evolution for Scalable Synthetic Data Generation

Abstract

Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling, which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains and two model families, DCE achieves 0.0 ± 0.0% collapse versus 5.6 ± 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17. These results hold across sensitivity sweeps and are validated with an independent embedding model, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

The Problem

When you need thousands of synthetic training examples from an LLM, you prompt it in batches. But each API call starts with a blank context window. The model has no memory of what it already generated, so it keeps returning to the same high-probability outputs. We call this cross-batch mode collapse.

Consider a concrete experiment: prompt a model to generate five educational exam questions, clear the conversation, and repeat 200 times. In the first 30 batches, you get genuinely distinct questions spanning diverse topics. By batch 50, the same question structures resurface with superficial variation. By batch 200, 34% of questions in the final 50 batches are near-duplicates of questions from the first 50.

This matters because diversity directly determines downstream utility. A classifier trained on 1,000 paraphrases of 50 concepts learns 50 categories. A classifier trained on 1,000 genuinely distinct concepts can learn far richer structure.

Animated UMAP comparison over 200 batches: naive prompting collapses into a tight cluster while DCE maintains broad, even coverage across the embedding space. — Each dot is a generated idea, plotted in embedding space over 200 batches. **Left:** without DCE, the model keeps generating ideas in the same region. New ideas land on top of old ones. **Right:** with DCE, ideas spread across the full concept space, even at batch 200.

Animated chart showing batch novelty declining over 200 batches as the model runs out of fresh territory. — Batch novelty measures how different each new batch is from all previous outputs. Without intervention, it declines steadily as the model runs out of fresh territory and starts recycling the same concepts.

How DCE Works

DCE gives the model a memory of everything it has already generated, and a prompt that adapts based on what's missing. Three mechanisms run together each batch:

1. Verbalized tail sampling asks the model to rate how obvious each idea is. The model estimates the probability that another LLM, given the same prompt, would produce the same concept. Predictable ideas (like "smart water bottle," which scores 0.45) are filtered out. Only surprising, low-probability ideas are kept (like "shipping containers with walls of compressed agricultural waste," which scores 0.03).

2. Semantic memory stores every accepted idea as an embedding in a persistent vector database. Before a new idea is accepted, the system computes its cosine similarity to all stored ideas. This catches duplicates at the meaning level, not just the word level. "Smart water bottle" and "intelligent hydration vessel" share few words but produce nearly identical embeddings. The memory catches both.

3. Adaptive prompt evolution rewrites the generation prompt every batch based on the current state of memory. It injects three types of context: the most recently accepted ideas (so the model knows what to avoid), the densest regions of the embedding space (so it steers away from overrepresented areas), and the distribution of ideas across categories (so it can target gaps). Four diversity strategies rotate across batches: gap targeting, assumption inversion, cross-industry stimulus, and constraint variation.

Early batches prioritize broad exploration. Later batches shift to gap-filling, targeting underrepresented categories. By batch 190, the prompt tracks 47 categories and flags 3 dense regions to avoid.

Results

DCE achieves 0.0% collapse versus 5.6% for naive prompting, tested across three domains (sustainable packaging, educational exam questions, creative writing prompts), two model families (GPT-5-mini and Claude Haiku 4.5), and three random seeds. It consistently produces 17-18 distinct concept clusters where naive prompting produces anywhere from 2 to 17.

A seven-method ablation shows that no single component is sufficient. Deduplication alone prevents near-duplicates from entering the output set, but doesn't stop the model from generating the same ideas every batch. Prompt evolution alone steers the model toward new territory, but can't catch the near-duplicates that slip through. The three mechanisms are individually insufficient but jointly effective.

Results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold and deduplication threshold. The cost is roughly $0.50 per 1,000 candidates using standard API calls. No fine-tuning or custom architectures required.

BibTeX

@article{lingo2026dynamic,
  title={Dynamic Context Evolution for Scalable Synthetic Data Generation},
  author={Lingo, Ryan and Chhajer, Rajeev},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}