Prompt Optimization
Table of Contents
- TextGrad: Automatic “Differentiation” via Text (11 Jun 2024)
- Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison (11 Nov 2025)
- Learning from Contrastive Prompts: Automated Optimization and Adaptation (23 Sep 2024)
- Self-Supervised Prompt Optimization (21 Aug 2025)
TextGrad: Automatic “Differentiation” via Text (11 Jun 2024)
we introduce TEXTGRAD, automatic differentiation via text. Here we use differentiation and gradients as a metaphor for textual feedback from LLMs. In this framework, each AI system is transformed into a computation graph, where variables are inputs and outputs of complex (not necessarily differentiable) function calls. The feedback to the variables (dubbed ‘textual gradients’ [25]) are provided in the form of informative and interpretable natural language criticism to the variables; describing how a variable should be changed to improve the system.
the gradient operator when the forward function is an LLM call.a In particular, this function returns natural language feedback such as ‘This prediction can be improved by. . . ’ where the feedback describes how to modify the variable to improve the downstream objective, analogous to gradients in optimization
There are two classes of optimization problems we explore. In instance optimization, we directly treat a solution to a problem—e.g., a code snippet, the solution to a problem or a molecule—as an optimization variable. For instance, in Equation 13, we have a code instance that we would like to improve at test time. Our framework produces the gradients for and directly optimizes the code variable. In prompt optimization, the goal is to find a prompt that improves the performance of an LLM across multiple queries for a task. For example, we may want to find a system prompt to an LLM that improve the performance on mathematical reasoning questions (see Section 3.3 for examples). In particular, we want the system prompt to generalize, in contrast to instance optimization where the only goal is to improve the solution for a given query at test time. Crucially, both types of problems can be solved without hand-crafting the framework.
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison (11 Nov 2025)
By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers.
Unfortunately, existing reinforcement learning frameworks are designed to learn from impoverished supervision signals, typically either scalar rewards or pairwise preference data, where each annotation conveys at most a single bit per pair. These bottlenecks discard information about why one behavior is better and how to improve—information available in environment feedback or easily elicited from humans during annotation [30, 76].
Because free-form feedback does not define a differentiable objective, it cannot directly drive weight updates via backpropagation. Our approach iterates at inference time, using language models to translate accumulated feedback into targeted edits of text artifacts (prompts, code, molecules, JSON configs) that improve a final performance objective, without any weight updates.
Our contributions are threefold. First, we formalize why textual feedback enables dimension-free convergence while zeroth-order methods suffer exponential slowdown with effective dimensionality, identifying when and why structured feedback outperforms scalar rewards. Second, we demonstrate cross-domain generality: Feedback Descent works across three qualitatively distinct domains (visual design, prompt optimization, molecule design) with the same iterative loop. Third, we validate competitive or superior performance versus specialized methods, achieving competitive results with the state-of-the-art in prompt optimization (GEPA)
At iteration 𝑡, we prompt the model with the current best artifact 𝑥 and accumulated feedback ℛ𝑡−1 to generate an improved candidate. The prompt instructs the model to address previous critiques while preserving successful elements. These prompts are intentionally minimal: the optimization signal comes from the accumulated feedback rather than heavy prompt engineering.
in highdimensional search, even weak directional cues can accumulate into steady improvement, whereas unguided search scales exponentially worse.
Prompt optimization. We follow the setup of GEPA [3] across four diverse tasks: multi-hop reasoning (HotpotQA; Yang et al. [81]), instruction following (IFBench; Pyatkin et al. [54]), privacy-aware delegation (PUPA; Li et al. [35]), and retrieval-augmented verification (HoVer; Jiang et al. [28]). We evaluate on both open-source (Qwen3-8B; Yang et al. [78]) and proprietary (GPT-4.1 mini) models. For each task, we use the same multi-stage programs from GEPA, where the number of stages differs across datasets, and we jointly optimize the prompts for all stages using Feedback Descent. Optimization is driven by training examples: candidate prompts are updated based on performance on the training set and textual feedback describing which constraints were satisfied or violated. All candidate prompts are scored on validation examples, and the prompt with the highest validation accuracy rate is selected. We report performance on held-out test examples.
Despite GEPA being specifically engineered for prompt optimization with coordinate descent and Pareto frontier maintenance, Feedback Descent achieves competitive performance with a simpler approach: jointly optimizing all prompts at once via automated textual summaries of pairwise performance differences. We do not claim to present a state-of-the-art prompt optimizer; rather, these results demonstrate that our general-purpose framework remains competitive with specialized methods while requiring minimal domain-specific engineering.
We programmatically generate a textual description of the difference between two prompts. To compare two prompts, we first partition the training set into four quadrants based on outcomes: examples where prompt A succeeds and B fails (A_wins), A fails and B succeeds (B_wins), both fail (tie_fail), or both succeed (tie_success). We then use the same LLM to propose hypotheses about input characteristics (based on the prompt text) and output patterns (based on the response and evaluation feedback), producing around 20 hypotheses for each cat-egory. To evaluate whether each hypothesis applies to each example, we use the same LLM as the tagger that outputs binary labels (1 if the hypothesis matches, 0 otherwise) for all hypotheses in a single call per example, processing hundreds of examples in parallel. We then compute lift metrics for each hypothesis-quadrant pair, where lift is the ratio of conditional probability to the base rate (i.e., how much more likely an outcome is given the hypothesis holds). We validate the hypotheses statistically using Fisher’s exact test, and filter for hypotheses that are statistically significant at 𝑝 < 0.1 with minimum support of 3 examples and lift ≥ 2.0 for A/B wins or ≥ 1.5 for failures. This analysis identifies which input patterns correlate with differential performance and which output characteristics appear when one prompt outperforms the other, providing actionable insights for prompt improvement.
We use the following prompt templates for candidate generation and rationale generation for prompt optimization.
Improve the assistant’s prompt by extracting actionable insights from the data. ## Goal Create prompts that generalize well beyond the training examples you see here. The patterns below come from a small sample; your output must work on thousands of unseen cases. ## Current Prompts **Approach A (Baseline):** ‘‘‘python {prompt_a_dict} ‘‘‘ **Approach B (Challenger):** ‘‘‘python {prompt_b_dict} ‘‘‘ ## Training Signals {comparison} ## Prompt Improvement Strategy **1. Extract Core Insights** Identify patterns with strong evidence (low p-value, high lift, good support): - What fundamental strategies distinguish success from failure? - What misunderstandings or errors repeatedly occur? - Are there essential facts or constraints the model needs to know? **2. Avoid Common Pitfalls** - Redundancy: Don’t say the same thing multiple ways - Over-specification: Don’t list every possible format, constraint, or edge case - Defensive bloat: Don’t add uncertainty handling or safety warnings unless critical - Surface patterns: Look for deep semantic strategies, not superficial formatting rules - Enumerationitis: Avoid long numbered checklists; prefer flowing prose **3. Craft Effective Instructions** - State principles clearly and concisely - Use specific language when precision matters ("identify the missing fact" vs "analyze the information") - Keep instructions proportional to task complexity - Test in your mind: would this help on examples you haven’t seen? **4. Preserve What Works** - If baseline is effective and simple, make minimal changes - Don’t fix what isn’t broken - Complexity should buy you something measurable The prompt must be a Python dictionary with the following keys: {module_keys_description} Output EXACTLY in this format: ‘‘‘python {prompt_template} ‘‘‘
Learning from Contrastive Prompts: Automated Optimization and Adaptation (23 Sep 2024)
they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples.
We inject diverse prompts into prompt optimization by generating multiple prompt candidates to explore the prompt space. To overcome the shortcomings of existing methods, we take an inspiration from the principle of contrastive learning (Chen et al., 2020) by allowing the LLM to contrast between good and bad prompts from the generated prompt candidates while learning to improve on error cases.
given a list of prompts with their corresponding quality scores, we compare a batch of high-quality prompts to a batch of low-quality prompts, drawing conclusions about the patterns that characterize effective prompts.
A problem with generating the reason for each wrong sample and crafting a prompt candidate based on it, is that the candidate prompt can be biased towards that sample, making it too specific. To inject some consistency, we select multiple incorrectly predicted samples and summarize the common failure reasons … we add diversity to the generated summary by setting a more creative temperature and repeating this step multiple times to generate N prompts, referred to as prompt candidates. This approach helps the model to explore the prompt space. We use N = 10 based on our experiments.
Inspired by contrastive learning, we instruct the LLM to identify the underlying patterns that distinguish good prompts from bad prompts. Specifically, we define the top-K prompts as the good prompts and the bottom-K prompts as the bad prompts
Self-Supervised Prompt Optimization (21 Aug 2025)
they rely heavily on external references such as ground truth or by humans, limiting their applicability in realworld scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. … SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements.
(2) Cost-effective Optimization. SPO optimizes prompts with minimal computational overhead ($0.15 per dataset) and sample requirements (3 samples), significantly reducing resource demands. (3) Extensive Evaluation. As shown in Figure 2, SPO requires only 1.1% to 5.6% of the cost of state-of-the-art methods while maintaining superior performance across both closed and open-ended tasks.
Output vs. Output (OvO): When ground truth is unavailable, we turn to direct output comparison. The core idea behind OvO is that even in the absence of perfect ground truth, comparing outputs generated by different prompts can offer valuable signals about their relative quality.