Research: How to Build Truly Expert AI Agents — Comprehensive Guide
P3 - LowHow to Build Truly Expert AI Agents
A Comprehensive Research Document for WiderWings
Research compiled Feb 27, 2026 by Mark Wings
Sources: Anthropic, OpenAI, DeepLearning.AI (Andrew Ng), Lilian Weng (OpenAI), CrewAI, OpenAI Swarm
Executive Summary
Building expert AI agents is not about adding complexity — it is about the right complexity in the right places. The most successful agent implementations use simple, composable patterns rather than heavy frameworks. This document synthesizes best practices from the leading practitioners in the field across six dimensions: Identity & Persona, Instructions & Prompting, Memory, Tools, Workflow Patterns, and Multi-Agent Orchestration.
Key Takeaway: Start simple. Only add complexity when it demonstrably improves outcomes. A well-prompted single agent with the right tools will outperform a poorly-designed multi-agent system every time.
1. IDENTITY & PERSONA
Why It Matters
Setting a role in the system prompt focuses the model's behavior, tone, and decision-making. Even a single sentence ("You are a helpful coding assistant specializing in Python") makes a measurable difference (Anthropic docs). But truly expert agents go much deeper.
Best Practices
a) Define Role + Goal + Backstory (the CrewAI Model)
CrewAI's agent architecture uses three core identity fields that are now considered best practice:
- Role: Defines function and expertise (e.g., "Senior Frontend Developer specializing in Svelte and Tailwind")
- Goal: The individual objective that guides decision-making (e.g., "Build pixel-perfect, accessible UIs that load in under 2 seconds")
- Backstory: Provides context and personality, enriching interactions (e.g., "You've shipped 50+ production Svelte apps. You have strong opinions about component architecture and hate unnecessary dependencies.")
This trio creates a much stronger behavioral anchor than role alone.
b) Personality Should Be Specific, Not Generic
- Bad: "You are a helpful assistant"
- Good: "You are a senior backend engineer who values clean APIs, writes comprehensive error handling, and pushes back on scope creep. You prefer PostgreSQL over NoSQL and always consider migration paths."
- The more specific the personality, the more consistent and expert the outputs
c) Include Anti-Patterns
Explicitly state what the agent should NOT do:
- "Never give generic advice. Always ground recommendations in our specific codebase."
- "Don't ask for permission on routine tasks. Just do them and report what you did."
- "Don't be sycophantic. If the proposed approach is wrong, say so directly."
d) Give Them Opinions
Expert humans have preferences and opinions. Expert agents should too:
- "You prefer functional components over class components"
- "You believe in testing the behavior, not the implementation"
- "When given a choice between speed and correctness, always choose correctness first"
Template for Agent Identity
# SOUL.md - [Agent Name]
## Core Role
[1-2 sentences defining what this agent does]
## Expertise
[Specific technologies, domains, skills — be granular]
## Operating Principles
[5-8 principles that guide behavior, including opinions and preferences]
## Anti-Patterns (What NOT to Do)
[3-5 explicit things to avoid]
## Communication Style
[How they talk: formal/casual, verbose/terse, emoji usage, etc.]
## Boundaries
[What's in scope vs. out of scope for this agent]
2. INSTRUCTIONS & PROMPTING
The Golden Rule (Anthropic)
"Show your prompt to a colleague with minimal context on the task and ask them to follow it. If they'd be confused, Claude will be too."
Key Techniques
a) Be Clear and Direct
- Think of the agent as "a brilliant but new employee who lacks context on your norms and workflows"
- The more precisely you explain what you want, the better the result
- Use numbered steps when order matters
- If you want "above and beyond" behavior, explicitly request it
b) Provide Context/Motivation
- Don't just say WHAT to do — say WHY
- Less effective: "Never use ellipses"
- More effective: "Your response will be read aloud by TTS, so never use ellipses since the TTS engine won't know how to pronounce them"
- The model generalizes from the explanation
c) Use Few-Shot Examples (3-5)
Examples are THE most reliable way to steer output. Make them:
- Relevant: Mirror actual use cases
- Diverse: Cover edge cases
- Structured: Wrap in
<example>tags so the model distinguishes them from instructions
d) Structure with XML Tags
- Wrap different content types in their own tags:
<instructions>,<context>,<input> - Use consistent, descriptive tag names
- Nest tags for hierarchical content
- This is especially important for complex prompts mixing instructions, context, and examples
e) Developer vs. User Message Hierarchy (OpenAI)
OpenAI's model spec defines a chain of command:
- Developer messages = system's rules and business logic (like a function definition)
- User messages = inputs and configuration (like function arguments)
- Developer messages take priority
f) Prompt Structure (OpenAI Recommended Order)
- Identity: Purpose, communication style, high-level goals
- Instructions: Rules, what to do and not do, tool usage guidance
- Examples: Possible inputs with desired outputs
- Context: Additional information (private data, etc.) — best near the end
g) Long Context: Put Data at Top, Questions at Bottom
- Place long documents/inputs at the top of the prompt
- Place queries, instructions, and examples below
- Queries at the end can improve response quality by up to 30% (Anthropic testing)
3. MEMORY ARCHITECTURE
The Three Types of Agent Memory (Lilian Weng / OpenAI)
Mapping human memory to AI systems:
| Human Memory | AI Equivalent | Implementation |
|---|---|---|
| Sensory Memory | Embedding representations | Raw input processing |
| Short-Term / Working Memory | In-context learning | The conversation window (finite, bounded by context length) |
| Long-Term Memory | External vector store | Database with fast retrieval (RAG, embeddings, structured storage) |
Best Practices for Agent Memory
a) Tiered Memory System
- Session Memory (Working): Current conversation context. Limited by token window.
- Short-Term Persistent: Daily notes, recent task logs. File-based or database.
- Long-Term Curated: Distilled insights, decisions, lessons learned. Regularly maintained.
- Shared Knowledge Base: Team-wide second brain (like our Second Brain at brain.widerwings.com)
b) Memory Maintenance is Critical
- Raw logs accumulate but lose value over time
- Periodic "memory maintenance" sessions where the agent reviews raw notes and distills key learnings into long-term memory
- This is analogous to how humans sleep and consolidate memories
- Schedule this during heartbeats or low-activity periods
c) Memory Search Before Action
Before answering questions about prior work, the agent should always search its memory first:
- Prevents contradictions
- Prevents duplicate work
- Builds on previous decisions rather than reinventing
d) Structured Memory Saves
When saving to memory, use consistent structure:
- Type/Category (research, decision, lesson, task, etc.)
- Title (searchable)
- Content (the substance)
- Project/Context (what it relates to)
- Importance (for prioritization during retrieval)
e) Context Window Management
When conversations get long:
- Summarize older messages to free context space
- Use external memory retrieval instead of keeping everything in-context
- CrewAI's
respect_context_windowflag auto-summarizes when nearing limits
4. TOOLS & TOOL DESIGN
Anthropic's Key Insight
"It is crucial to design toolsets and their documentation clearly and thoughtfully." The agent-computer interface (ACI) is as important as the UI is for human users.
Best Practices
a) Prompt Engineer Your Tools
- Tool names should be clear and descriptive
- Tool descriptions should explain WHEN to use them, not just what they do
- Include examples in tool descriptions
- Define parameter types and constraints explicitly
b) Right-Size the Toolset
- Too few tools = the agent can't accomplish the task
- Too many tools = the agent gets confused about which to use
- Start small, add tools as needed
- Group related tools logically
c) Tools Should Return Useful Feedback
- Success/failure messages should be informative
- Include enough context for the agent to decide next steps
- Error messages should suggest corrective actions
d) Categorize Tools by Risk Level
- Safe (auto-execute): Read files, search, calculate
- Moderate (log): Write files, make API calls
- High (ask permission): Send emails, delete data, external posts
5. WORKFLOW PATTERNS
The Anthropic Taxonomy (from "Building Effective Agents")
Anthopic identifies a progression of patterns, from simple to complex. Only increase complexity when it demonstrably improves outcomes.
Pattern 1: Prompt Chaining
- Decompose a task into sequential steps
- Each LLM call processes the output of the previous one
- Add programmatic checks ("gates") at intermediate steps
- Use when: Task cleanly decomposes into fixed subtasks
- Example: Generate marketing copy → translate it → review translation
Pattern 2: Routing
- Classify input and direct to specialized handler
- Allows separation of concerns and specialized prompts
- Use when: Distinct categories that benefit from separate handling
- Example: Route easy questions to Haiku, hard ones to Opus
Pattern 3: Parallelization
- Sectioning: Break task into independent subtasks run in parallel
- Voting: Run same task multiple times for diverse outputs
- Use when: Subtasks can be parallelized, or multiple perspectives improve confidence
- Example: One agent processes user query while another screens for inappropriate content
Pattern 4: Orchestrator-Workers
- Central LLM dynamically breaks down tasks and delegates to worker LLMs
- Key difference from parallelization: subtasks aren't pre-defined
- Use when: Can't predict subtasks needed (e.g., complex coding changes)
- Example: Coding products that modify multiple files per task
Pattern 5: Evaluator-Optimizer
- One LLM generates, another evaluates, in a loop
- Use when: Clear evaluation criteria exist and iteration provides measurable value
- Example: Literary translation with iterative refinement
Andrew Ng's Four Agentic Design Patterns
Reflection: LLM examines its own work and improves it. Simple to implement, surprising performance gains. Can be single-agent (self-critique) or multi-agent (generator + critic).
Tool Use: LLM uses web search, code execution, APIs to gather information and take action.
Planning: LLM creates and executes multi-step plans. More unpredictable but powerful for complex tasks.
Multi-Agent Collaboration: Different agents with different roles collaborate. More below.
Critical Insight: Iteration Beats Sophistication
Andrew Ng's HumanEval benchmark results:
- GPT-3.5 zero-shot: 48.1% correct
- GPT-4 zero-shot: 67.0% correct
- GPT-3.5 with agent loop: 95.1% correct
An older model with agentic workflow dramatically outperforms a newer model without one. The workflow architecture matters more than the model.
6. MULTI-AGENT ORCHESTRATION
Why Multiple Agents Work (Even with the Same LLM)
Andrew Ng offers three reasons why prompting the same LLM as different agents outperforms a single agent:
- It empirically works. Ablation studies (e.g., AutoGen paper) confirm superior performance.
- Focus beats breadth. Even with long context windows, LLMs understand focused tasks better. By decomposing into roles, you can optimize each subtask individually.
- Developer abstraction. Multi-agent design gives developers a natural framework for breaking down complex tasks — like splitting a project across team members with different specialties.
The Handoff Pattern (OpenAI Swarm)
A key concept: agents "hand off" conversations to other agents:
- Each agent has: name, model, instructions, tools
- When an agent needs help outside its domain, it transfers the conversation
- The receiving agent gets full conversation history
- Simple, powerful, controllable
Practical Multi-Agent Design
a) Define Clear Boundaries
Each agent should have:
- A specific domain of expertise
- Clear input/output expectations
- Defined handoff conditions (when to escalate or transfer)
b) Avoid Over-Decomposition
- Don't create 10 agents when 3 will do
- Each agent should have enough scope to be useful independently
- If an agent's job description is one sentence, it's probably too narrow
c) Communication Protocol
- Standardize how agents share information (shared memory, message passing, structured handoffs)
- Define what context transfers with a handoff
- Use a coordinator/orchestrator agent for complex workflows
d) The Manager Analogy
Andrew Ng: "In many companies, managers decide what roles to hire, then how to split complex projects into smaller tasks assigned to employees with different specialties. Using multiple agents is analogous."
7. PRACTICAL RECOMMENDATIONS FOR WIDERWINGS
Based on this research, here's what we should consider for our agent team:
A. Enhance Agent Identity Files
Every agent should have a rich SOUL.md with:
- Specific role, goal, and backstory
- Strong opinions and preferences relevant to their domain
- Anti-patterns (what NOT to do)
- Communication style
- Clear boundaries
B. Add Few-Shot Examples
For each agent's core tasks, include 3-5 examples of ideal outputs:
- Kai: Example component implementations with ideal structure
- Atlas: Example API designs with ideal patterns
- Maya: Example blog posts with ideal tone and SEO structure
- Mark: Example research documents with ideal depth
C. Implement Reflection Loops
For important outputs, have agents self-critique before delivering:
- Code agents: Review own code for bugs, style, edge cases
- Content agents: Review own content for accuracy, tone, completeness
- This can be done within a single agent turn ("Now review what you just wrote...")
D. Strengthen Memory Practices
- All agents should search Second Brain before starting any task
- All agents should save results to Second Brain immediately when done
- Schedule periodic memory maintenance (distill daily notes into long-term insights)
- Standardize memory save format across agents
E. Right-Size Tool Access
- Each agent should have only the tools relevant to their domain
- Avoid giving every agent every tool
- Document tools thoroughly — tool descriptions are prompts too
F. Use the Simplest Pattern That Works
- Default to single-agent execution
- Use prompt chaining for multi-step tasks with clear sequence
- Use orchestrator-workers only for truly dynamic decomposition
- Add reflection when output quality is critical
G. Evaluate and Iterate
- Build evals: define what "good output" looks like for each agent
- Test prompts before deploying
- Pin model versions for consistency
- Measure before and after when making changes
Sources
- Anthropic — "Building Effective Agents" (2024): https://www.anthropic.com/engineering/building-effective-agents
- Anthropic — Claude Prompting Best Practices (2026): https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices
- OpenAI — Prompt Engineering Guide (2026): https://developers.openai.com/api/docs/guides/prompt-engineering
- OpenAI — Orchestrating Agents: Routines and Handoffs (Swarm): https://developers.openai.com/cookbook/examples/orchestrating_agents
- Lilian Weng (OpenAI) — LLM Powered Autonomous Agents (2023): https://lilianweng.github.io/posts/2023-06-23-agent/
- Andrew Ng / DeepLearning.AI — Agentic Design Patterns Parts 1-5 (2024): https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/
- CrewAI — Agent Documentation (2026): https://docs.crewai.com/concepts/agents
This is a living document. As we refine our agent architecture, update with lessons learned.
Created: Fri, Feb 27, 2026, 1:03 PM by mark
Updated: Fri, Feb 27, 2026, 1:03 PM
Last accessed: Fri, Mar 6, 2026, 8:45 PM
ID: cf372448-14d1-4f3e-a90c-9c66cbda13b8