Prompt Engineering: What Actually Matters vs. Superstition


Prompt engineering has developed its own mythology. People claim that specific phrasings, magic words, or elaborate structures dramatically improve LLM outputs. Much of this is superstition.

After extensive testing with GPT-4, Claude, and other models, here’s what actually matters and what’s mostly noise.

What Definitely Works

Clear task description: Tell the model exactly what you want. Vague prompts get vague responses. Specific prompts get specific responses.

Example:

  • Vague: “Write about AI”
  • Specific: “Write a 500-word explanation of transformer architecture for software engineers with no ML background”

The specific version works better every time. Not because of any trick, but because it’s clearer.

Relevant context: Give the model information it needs to complete the task. Don’t make it guess what you meant or what constraints apply.

If you want code in a specific framework, say so. If you want analysis focused on particular aspects, specify them. If there are constraints (length, tone, audience), include them.

Examples when appropriate: For complex or unusual tasks, examples help. Show the model what you want rather than just describing it.

One good example is often worth paragraphs of instructions. Two examples are better for establishing patterns. Beyond three, returns diminish rapidly.

Structured output requests: If you want specific format (JSON, bullet points, tables), say so explicitly. Models handle structured output well when requested clearly.

What Probably Doesn’t Work

Magic phrases: People claim phrases like “think step by step” or “take a deep breath” improve outputs. Testing shows minimal or no consistent effect.

Some phrases might help for specific models or tasks, but there’s no universal magic wording that makes LLMs dramatically better.

Role-playing prompts: “You are an expert in X” is common advice. Testing shows it rarely makes meaningful difference to output quality.

Models don’t become more knowledgeable about a topic because you told them they’re an expert. They output the same information with slightly different phrasing.

There might be small stylistic changes, but if you need domain expertise, you need to provide it through context, not through role-playing.

Elaborate prompt templates: Some people use multi-paragraph prompt structures with sections like “Context”, “Task”, “Format”, “Constraints”, etc.

This can be useful for organizing your own thinking, but models don’t specifically benefit from formal structure. A clear, concise prompt works as well as an elaborately formatted one.

Emotional language: Phrases like “this is very important” or “please try your best” don’t affect model performance. Models don’t have emotions or motivation. They’re predicting text.

What Sometimes Works

Chain of thought prompting: Asking models to show reasoning steps can improve performance on complex logical or mathematical tasks.

This works because models that generate intermediate steps are less likely to skip logical leaps. But it’s not magic - it’s just exposing the reasoning process.

For simple tasks, chain of thought adds verbosity without improving quality.

Few-shot learning: Providing examples (few-shot) vs. no examples (zero-shot) helps with unusual tasks or specific formatting requirements.

The improvement varies by task complexity and how clear your instructions are. For straightforward tasks, examples might not help. For complex or unusual tasks, they often do.

Iterative refinement: Running multiple prompts to refine output (generate draft, then critique it, then improve it) can produce better results than a single prompt.

This takes more tokens and time but can be worth it for important outputs. For routine tasks, it’s overkill.

Model Differences Matter More

Prompt engineering has limited impact compared to model selection. GPT-4 with a mediocre prompt outperforms GPT-3.5 with an optimized prompt for most tasks.

Claude, GPT-4, and Gemini have different strengths. Claude is often better at long-form content and analysis. GPT-4 has strong general capabilities. Gemini has some advantages in multimodal tasks.

Picking the right model for your task matters more than optimizing prompts.

Context Window Utilization

Modern LLMs have large context windows (100k+ tokens). Using this effectively matters.

Good uses of context:

  • Providing relevant documentation
  • Including conversation history
  • Giving examples and specifications
  • Adding domain-specific information

Poor uses of context:

  • Dumping entire codebases hoping the model figures it out
  • Including irrelevant information
  • Exceeding practical context limits (models degrade with very long contexts)

Quality of context matters more than quantity.

Temperature and Parameters

Temperature setting affects output randomness. Lower temperature (0-0.3) gives more deterministic outputs. Higher temperature (0.7-1.0) gives more creative variation.

For factual tasks, code generation, and analysis, use low temperature. For creative writing, brainstorming, and ideation, higher temperature can help.

Other parameters (top_p, frequency_penalty, presence_penalty) exist but matter less for most use cases.

Testing Your Prompts

Don’t rely on anecdotes or one-off tests. Systematic testing reveals what actually improves outputs.

How to test:

  1. Define specific evaluation criteria
  2. Test multiple variations of your prompt
  3. Run each variation multiple times (LLM outputs vary)
  4. Measure results against your criteria
  5. Keep what works, discard what doesn’t

This is more work than copying “best practices” from blogs, but it’s the only way to know what works for your specific use case.

Organizations working on AI strategy support often find that systematic prompt testing and optimization provides better results than following generic best practices.

Production Considerations

In production systems, prompt engineering intersects with MLOps:

Version control: Track prompt versions alongside code. Prompts are part of your system and should be versioned.

Monitoring: Log prompts and outputs. Monitor for quality degradation over time.

Cost optimization: Shorter prompts cost less. If you can achieve the same results with fewer tokens, do so.

Consistency: Deterministic prompts (clear instructions, low temperature) produce more predictable outputs, which matters for production systems.

The Reality

Most prompt engineering advice is either obvious (be clear and specific) or untested folklore (magic phrases and role-playing).

What actually matters:

  • Clear task description
  • Relevant context
  • Appropriate examples
  • Structured output requests when needed
  • Choosing the right model
  • Testing systematically

Everything else is probably superstition or edge cases being generalized into universal rules.

Practical Recommendations

Start simple. Write clear prompts that specify what you want. Test them. Iterate based on results.

Don’t spend hours optimizing prompts before you’ve tested basic approaches. Complexity should be added only when simple approaches fail.

Focus more on model selection, context quality, and systematic testing than on finding magic prompt formulas.

What We’re Not Covering

This post focused on text generation with current LLMs. Prompting for image generation, multimodal models, or specialized models might have different considerations.

Agent-based systems and prompt chaining introduce additional complexity beyond single-prompt interactions. We’ll cover those in future posts.

Resources

For detailed testing of prompt techniques, Anthropic’s prompt engineering guide is research-backed rather than anecdotal.

OpenAI’s documentation on prompt engineering best practices is similarly useful.

Both are vendor documentation, so they focus on their own models, but the principles generally apply broadly.

Next post we’ll cover model selection criteria - how to choose between available models for specific use cases based on actual capabilities rather than marketing claims.