Prompt engineering is the practice of designing inputs to language models to reliably get high-quality outputs. It matters because the same model will produce dramatically different results depending on how you structure your request. A well-engineered prompt can turn a mediocre response into a precise, useful one without changing the model, the model size, or the cost per token.
This guide covers every technique that produces consistent, measurable improvements. Each section includes real before-and-after examples. I also cover what does not work, because there is a lot of misinformation about magic phrases and jailbreaks that wastes engineers' time.
Foundation: How Prompts Work
Before the techniques, the mental model. An LLM generates text by predicting what token is most likely to come next given everything that came before. Your prompt is the "everything that came before." It sets the probability distribution over all possible next tokens.
When you write a good prompt, you are not giving the model instructions the way you give a programmer instructions. You are creating a context that makes the desired output the statistically most likely continuation. This is why the same instruction phrased differently can produce vastly different results.
Zero-Shot Prompting
Zero-shot means asking the model to do something without providing examples. You rely entirely on the model's pretraining.
When it works: Simple, well-defined tasks. Classification, summarization, basic formatting, common question types.
Bad zero-shot:
Analyze this customer feedback.
"The product arrived late and the packaging was damaged."
Better zero-shot:
You are a customer service analyst. Classify the following customer feedback into: [sentiment: positive/negative/neutral], [category: shipping/product/support/other], and [priority: high/medium/low]. Respond in JSON format.
Customer feedback: "The product arrived late and the packaging was damaged."
The improved version specifies the role, the exact classification dimensions, the output format, and the input clearly. The vague version produces inconsistent results. The structured version produces parseable, consistent output.
Few-Shot Prompting
Few-shot means providing examples of input-output pairs before your actual request. You are showing the model the pattern you want it to continue.
Why it works: The model uses the examples to infer the pattern, tone, format, and level of detail you want. For tasks where "good output" is difficult to describe but easy to demonstrate, few-shot is more reliable than zero-shot.
One-shot example (one input-output pair):
Classify the sentiment of customer messages.
Message: "I love this product, works exactly as described!"
Sentiment: Positive
Message: "The delivery took 3 weeks and the box was crushed."
Sentiment:
Few-shot example (three pairs):
Classify the sentiment of customer messages as Positive, Negative, or Mixed.
Message: "I love this product, works exactly as described!"
Sentiment: Positive
Message: "The delivery took 3 weeks and the box was crushed."
Sentiment: Negative
Message: "Great product but shipping was slow."
Sentiment: Mixed
Message: "Finally got my order. Not what I expected but I guess it works."
Sentiment:
The few-shot version defines all three categories through examples, making "Mixed" possible as a category rather than forcing everything into binary. The model learns the classification boundary from the examples rather than from a textual definition.
Brown et al. in the original GPT-3 paper (Brown et al., "Language Models are Few-Shot Learners," NeurIPS 2020) demonstrated that few-shot performance often approaches fine-tuned performance on standard benchmarks, at zero training cost. The optimal number of examples is typically 3 to 5 for most tasks.
Chain of Thought Prompting
Chain of thought (CoT) prompting encourages the model to reason step by step before giving its final answer. Wei et al. in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (NeurIPS 2022) showed that adding "Let's think step by step" to a prompt significantly improves performance on math, logic, and multi-step reasoning tasks.
Without CoT:
A store has 45 apples. They sell 18 on Monday and receive a delivery of 30 on Tuesday. They sell 23 on Wednesday. How many apples are left?
Model output: "34" (often wrong or shown without working)
With CoT:
A store has 45 apples. They sell 18 on Monday and receive a delivery of 30 on Tuesday. They sell 23 on Wednesday. How many apples are left? Let's think through this step by step.
Model output: "Start with 45. After Monday's sales: 45 - 18 = 27. After Tuesday's delivery: 27 + 30 = 57. After Wednesday's sales: 57 - 23 = 34. There are 34 apples left."
The final answer is the same in both cases here, but for more complex problems, the step-by-step reasoning dramatically improves accuracy because the model commits intermediate results to its context, reducing the chance of errors in longer chains.
The phrase "Let's think step by step" is the most studied CoT trigger. It is not magic; it is a context setter that shifts the probability distribution toward structured reasoning output. Other effective framings: "Work through this carefully," "Show your reasoning," "Think through this before answering."
When CoT is worth it: Multi-step math, logical reasoning, planning tasks, code debugging. The quality improvement is most pronounced on problems that require more than 2 to 3 reasoning steps.
When CoT is overkill: Simple lookups, classification tasks, short factual questions. Adding CoT to simple tasks can actually introduce errors by encouraging the model to over-reason.
System Prompts
A system prompt is an instruction set provided to the model before the user's input. It establishes the model's role, constraints, output format, and behavioral guidelines. In most API implementations, the system prompt occupies a privileged position in the context that the model treats with more authority than user messages.
System prompts are the highest-leverage prompt engineering tool for building applications. A well-written system prompt can reduce the need for complex per-request engineering.
A weak system prompt:
You are a helpful assistant for our software company.
A strong system prompt for a code review assistant:
You are a senior software engineer conducting code reviews. Your reviews must:
1. Identify bugs, security vulnerabilities, and performance issues first
2. Comment on code style and readability second
3. Suggest specific improvements with example code
4. Be direct and specific — avoid vague feedback like "this could be improved"
5. Note what is done well, not just what needs fixing
Format your review as:
- CRITICAL (must fix before merge): [issues]
- SUGGESTIONS (improvements worth making): [issues]
- PRAISE (what is done well): [observations]
If there are no critical issues, say so explicitly.
The stronger version specifies exactly what the model should look for, the priority order, the output format, and what counts as good feedback. The weak version produces generic responses. The strong version produces structured, actionable reviews.
Role Prompting
Role prompting assigns a specific persona or professional role to the model. The effect is that the model generates outputs consistent with how a person in that role would respond, drawing on patterns in the training data for that role's typical language and reasoning.
Before role prompt:
Explain the risks of this API design.
After role prompt:
You are a security engineer with 10 years of experience reviewing API designs for financial services companies. Identify security risks in the following API design, prioritizing issues that could expose customer data or allow unauthorized access.
The role establishes context that shifts the model toward security-focused analysis. The phrasing "financial services companies" calibrates the risk threshold (stricter than a typical app).
Role prompting is most effective when the role is specific and when the training data likely contains meaningful examples of that role's expertise. Asking the model to be "a world-class expert" is less effective than asking it to be "a senior infrastructure engineer specializing in distributed systems."
Structured Output Prompting
Structured output prompting asks the model to produce its response in a specific format (JSON, XML, Markdown table, etc.) that can be parsed programmatically. This is essential for any application that consumes LLM output programmatically.
Unstructured:
Extract the key entities from this text: "John Smith called from Apple Inc. on April 5th about invoice #4521."
Output: "The text mentions John Smith, a person from Apple Inc., who called on April 5th about invoice number 4521."
Structured:
Extract entities from the following text and return a JSON object with these fields: person_name (string), company (string), date (string in YYYY-MM-DD format), invoice_number (string or null).
Text: "John Smith called from Apple Inc. on April 5th about invoice #4521."
Output:
{
"person_name": "John Smith",
"company": "Apple Inc.",
"date": "2026-04-05",
"invoice_number": "4521"
}
Many models now support native JSON mode (GPT-4o's response_format: { type: "json_object" }, Claude's tool use, Gemini's structured output), which enforces valid JSON output at the decoding level rather than relying on the model to produce it correctly.
Prompt Chaining
Prompt chaining decomposes a complex task into a sequence of simpler tasks, where each prompt's output becomes the input to the next. This is how complex AI workflows are built.
Example: analyzing a legal contract
Single prompt approach (often unreliable): "Analyze this 50-page contract for risks, obligations, payment terms, and termination clauses, and produce an executive summary."
Chained approach:
- Prompt 1: "Extract all payment terms from this contract section. List each term with the relevant clause number."
- Prompt 2: "Given these payment terms: [output of Prompt 1], identify which are unusual or potentially problematic for the buyer."
- Prompt 3: "Given these payment risk findings: [output of Prompt 2], write a two-paragraph executive summary suitable for a non-lawyer."
Chaining produces more reliable results because each step is narrow enough for the model to handle with high accuracy. Failures are contained and debuggable. The cost is more API calls, but for complex tasks the quality improvement justifies it.
Self-Consistency
Self-consistency generates multiple responses to the same prompt (with some temperature above 0 to get variation) and selects the most common answer. Wang et al. introduced this in "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (ICLR 2023).
How to apply it: For high-stakes questions, generate 5 to 10 responses and compare. For math or logic problems, if 7 out of 10 responses agree on an answer, that answer is likely correct. For qualitative tasks, look for consensus on the key points.
Self-consistency adds cost (multiple generations) but can increase accuracy by 10 to 15 percentage points on complex reasoning tasks (Wang et al., ICLR 2023). Use it when the cost of an error is higher than the cost of extra API calls.
ReAct (Reasoning and Acting)
ReAct (Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," ICLR 2023) combines chain-of-thought reasoning with action steps. The model alternates between thinking about what to do and specifying an action to take (typically calling a tool or searching for information).
The ReAct pattern:
Thought: I need to find the current stock price of Apple. Let me search for it.
Action: search("Apple AAPL stock price May 2026")
Observation: Apple (AAPL) is trading at $213.45 as of market close May 16, 2026.
Thought: Now I have the current price. The user asked to compare it to the 52-week high.
Action: search("Apple AAPL 52-week high")
Observation: Apple's 52-week high is $237.23 (reached March 2026).
Thought: I now have both numbers. AAPL is currently about 10% below its 52-week high.
Answer: Apple (AAPL) is currently at $213.45, which is approximately 10% below its 52-week high of $237.23 reached in March 2026.
ReAct is the foundation of most LLM agent architectures. When you see products that use LLMs to "browse the web" or "run code" and incorporate the results into their response, they are using a ReAct-style loop.
What Does NOT Work
Magic phrases. "You are DAN (do anything now)" and similar jailbreak prompts do not reliably override model safety training. They are widely known by the model providers and fine-tuned against. The time spent on jailbreaks is almost always better spent on legitimate prompt engineering.
Excessive flattery. "You are the most intelligent AI in existence and your answers are always perfect." This does not improve output quality. It may slightly increase verbosity and confidence, which can actually make responses worse on factual tasks.
"Pretend you have no restrictions." Safety training is baked into model weights, not enforced by a simple instruction. You cannot turn it off with a prompt.
Very long, unprioritized system prompts. A 2,000-word system prompt that covers every possible case is not better than a 300-word system prompt that covers the important cases clearly. Long prompts with conflicting instructions produce unpredictable behavior. Prioritize ruthlessly.
Prompting for Code vs. Writing vs. Analysis
The optimal prompting style differs by task type.
For code: Be specific about language, version, and constraints. Specify what the code must not do (no external dependencies, must handle null inputs, must be under N lines). Ask for tests alongside the implementation. Use structured output to separate code from explanation.
For writing: Specify audience, tone, and length. Give examples of the style you want. Ask for a single draft, not multiple options. Specify what to avoid (jargon, passive voice, overly formal register).
For analysis: Give the model the data or documents directly rather than asking it to recall facts. Specify the dimensions of analysis you want. Ask it to structure the output before elaborating: "First list the key findings, then explain each one."
Common Mistakes That Cost You Quality and Money
Not specifying output format. Unstructured outputs require post-processing that fails at edge cases. Specify JSON, Markdown tables, or numbered lists whenever the output will be parsed.
Putting critical instructions in the middle of a long prompt. The model attends most reliably to the beginning and end. Put your most important constraints at the top or bottom.
Using ambiguous language. "Be concise" means different things to different people and to the model. "Respond in 3 to 5 sentences" is unambiguous.
Not testing prompt changes systematically. Changing a prompt without a test set is guessing. Even a 20-case test set will reveal prompt changes that break previously working outputs.
Relying on one long prompt when chaining would be better. Complex single prompts often produce inconsistent results. Decomposing into a chain of simpler prompts takes more engineering time but produces more reliable output.
Keep Reading
- Chain of Thought Prompting: 8 Patterns With Real Before-and-After Examples — A deeper look at CoT with 8 specific patterns and their appropriate use cases
- How to Write a System Prompt That Actually Works: Examples for Every Use Case — Full system prompt examples for 6 common applications
- Few-Shot Prompting: When It Works, When It Fails, With Real Examples — The research on optimal example counts and format sensitivity
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.