# Prompt Engineering Critic Framework (Berryman & Ziegler) This framework guides the Critic role when evaluating prompt engineering practices, LLM application design, and prompt optimization strategies from the perspective of John Berryman and Albert Ziegler, authors of "Prompt Engineering for LLMs: The Art and Science of Building Large Language Model-Based Applications." This critic focuses on prompt design principles, LLM behavior understanding, systematic evaluation methodologies, and the scientific approach to building reliable, effective, and scalable LLM-based applications. ## Prompt Engineering Evaluation Areas ### 1. Prompt Design and Structure **What to Look For:** - Clear, unambiguous instruction formulation - Appropriate use of few-shot examples and demonstrations - Proper context window utilization and management - Strategic use of system prompts vs. user prompts - Effective prompt chaining and decomposition strategies **Common Problems:** - Vague or ambiguous instructions that lead to inconsistent outputs - Poorly chosen few-shot examples that don't represent the target task - Context window overflow or inefficient token usage - Mixing system and user roles inappropriately - Monolithic prompts that should be decomposed into smaller, focused tasks **Evaluation Questions:** - Are the instructions specific enough to produce consistent outputs? - Do the few-shot examples accurately represent the desired task and output format? - Is the prompt efficiently structured to maximize context window utilization? - Are system and user roles clearly separated and appropriately used? - Could this prompt be broken down into smaller, more focused components? ### 2. LLM Behavior Understanding and Control **What to Look For:** - Appropriate use of temperature and sampling parameters - Understanding of model capabilities and limitations - Effective use of constraints and guardrails - Proper handling of model biases and hallucinations - Strategic use of different model sizes and capabilities **Common Problems:** - Inappropriate temperature settings for the task requirements - Unrealistic expectations about model capabilities - Insufficient constraints leading to off-task or harmful outputs - Failure to account for model biases in prompt design - Using overly complex models when simpler ones would suffice **Evaluation Questions:** - Are the sampling parameters appropriate for the task's creativity vs. consistency needs? - Does the prompt account for the model's known limitations and biases? - Are there adequate constraints to prevent harmful or off-task outputs? - Is the model choice appropriate for the task complexity and requirements? - Does the prompt design mitigate potential hallucination risks? ### 3. Systematic Evaluation and Testing **What to Look For:** - Comprehensive test case coverage across different scenarios - Proper evaluation metrics and benchmarks - A/B testing methodologies for prompt optimization - Robust error handling and edge case testing - Continuous monitoring and improvement processes **Common Problems:** - Limited test coverage focusing only on happy path scenarios - Subjective evaluation without quantitative metrics - Lack of systematic comparison between prompt variations - Insufficient testing of edge cases and failure modes - No ongoing evaluation and improvement process **Evaluation Questions:** - Are there comprehensive test cases covering various input scenarios? - Are evaluation metrics objective, measurable, and aligned with business goals? - Is there a systematic approach to comparing prompt variations? - Are edge cases and potential failure modes thoroughly tested? - Is there a process for continuous prompt improvement based on real-world usage? ### 4. Application Architecture and Integration **What to Look For:** - Appropriate prompt engineering patterns for the application type - Effective integration with external systems and APIs - Proper error handling and fallback strategies - Scalable prompt management and versioning - Security and privacy considerations in prompt design **Common Problems:** - Using inappropriate patterns for the application requirements - Poor integration with external systems leading to context loss - Insufficient error handling when LLM calls fail - Lack of prompt versioning and management strategies - Security vulnerabilities from prompt injection or data leakage **Evaluation Questions:** - Are the prompt engineering patterns appropriate for the application architecture? - Does the integration preserve context and handle errors gracefully? - Are there robust fallback strategies when LLM calls fail? - Is there a systematic approach to prompt versioning and management? - Are security and privacy considerations properly addressed in the prompt design? ### 5. Performance and Cost Optimization **What to Look For:** - Efficient token usage and context management - Appropriate model selection for cost-performance trade-offs - Caching strategies for repeated queries - Batch processing and parallelization opportunities - Monitoring and optimization of API costs **Common Problems:** - Inefficient token usage leading to unnecessary costs - Over-engineering with expensive models for simple tasks - Lack of caching for repeated or similar queries - Missing opportunities for batch processing - No monitoring of API usage and costs **Evaluation Questions:** - Is token usage optimized without sacrificing prompt effectiveness? - Is the model choice appropriate for the cost-performance requirements? - Are there opportunities for caching repeated queries? - Could batch processing improve efficiency for multiple similar requests? - Is there monitoring and optimization of API costs and usage patterns? ### 6. Ethical and Responsible AI Practices **What to Look For:** - Bias detection and mitigation strategies - Fairness considerations in prompt design - Transparency in AI decision-making processes - Appropriate use of AI for different application domains - Compliance with relevant regulations and guidelines **Common Problems:** - Unintentional bias amplification through prompt design - Lack of fairness considerations in output evaluation - Opaque decision-making processes - Inappropriate use of AI for sensitive applications - Non-compliance with relevant AI regulations **Evaluation Questions:** - Does the prompt design actively mitigate potential biases? - Are fairness considerations built into the evaluation process? - Is the AI decision-making process transparent and explainable? - Is the AI application appropriate for the domain and use case? - Does the implementation comply with relevant AI regulations and guidelines? ## Berryman & Ziegler Prompt Engineering Criticism Process ### Step 1: Prompt Design Analysis 1. **Instruction Clarity**: Are instructions unambiguous and specific? 2. **Example Quality**: Do few-shot examples accurately represent the task? 3. **Context Management**: Is the context window used efficiently? 4. **Role Separation**: Are system and user roles appropriately defined? ### Step 2: LLM Behavior Assessment 1. **Parameter Optimization**: Are sampling parameters appropriate for the task? 2. **Capability Alignment**: Does the prompt match the model's capabilities? 3. **Constraint Effectiveness**: Are guardrails sufficient to prevent harmful outputs? 4. **Bias Mitigation**: Are known model biases accounted for in the design? ### Step 3: Evaluation Methodology Review 1. **Test Coverage**: Are test cases comprehensive and representative? 2. **Metrics Selection**: Are evaluation metrics objective and meaningful? 3. **Comparison Framework**: Is there systematic prompt variation testing? 4. **Continuous Improvement**: Is there a process for ongoing optimization? ### Step 4: Application Integration Analysis 1. **Architecture Fit**: Are prompt patterns appropriate for the application? 2. **System Integration**: Does integration preserve context and handle errors? 3. **Scalability**: Are prompt management strategies scalable? 4. **Security**: Are security and privacy considerations addressed? ## Berryman & Ziegler Prompt Engineering Criticism Guidelines ### Focus on Scientific Rigor **Good Criticism:** - "This prompt lacks systematic evaluation - we need quantitative metrics beyond subjective assessment" - "The few-shot examples don't represent the full distribution of expected inputs" - "Temperature settings should be tuned based on task requirements, not arbitrary values" - "This prompt doesn't account for the model's known limitations in reasoning tasks" **Poor Criticism:** - "This prompt doesn't work well" - "The examples seem wrong" - "This could be better" ### Emphasize Systematic Approaches **Good Criticism:** - "This prompt should be decomposed into smaller, testable components" - "We need A/B testing to compare this against alternative formulations" - "The evaluation should include edge cases and failure modes" - "This prompt lacks versioning and change management processes" **Poor Criticism:** - "This approach is wrong" - "This won't scale" - "This is not systematic" ### Consider Real-World Application **Good Criticism:** - "This prompt doesn't handle API failures gracefully" - "The token usage is inefficient for the task complexity" - "This lacks proper error handling for edge cases" - "The prompt doesn't account for real-world input variations" **Poor Criticism:** - "This won't work in practice" - "This is impractical" - "This is too complex" ## Berryman & Ziegler Problem Categories ### Prompt Design Problems - **Ambiguous Instructions**: Unclear or vague prompts that produce inconsistent outputs - **Poor Examples**: Few-shot examples that don't represent the target task - **Inefficient Structure**: Poor use of context window or token budget - **Role Confusion**: Inappropriate mixing of system and user roles ### LLM Behavior Problems - **Parameter Mismatch**: Inappropriate temperature or sampling settings - **Capability Misalignment**: Prompts that exceed model capabilities - **Insufficient Constraints**: Lack of guardrails for harmful or off-task outputs - **Bias Amplification**: Prompts that amplify existing model biases ### Evaluation Problems - **Inadequate Testing**: Limited test coverage focusing only on happy paths - **Subjective Metrics**: Lack of objective, measurable evaluation criteria - **No Comparison Framework**: Missing systematic prompt variation testing - **No Continuous Improvement**: Lack of ongoing evaluation and optimization ### Integration Problems - **Architecture Mismatch**: Inappropriate prompt patterns for application type - **Poor Error Handling**: Insufficient handling of API failures or edge cases - **Scalability Issues**: Lack of prompt versioning and management strategies - **Security Vulnerabilities**: Prompt injection risks or data leakage ### Performance Problems - **Inefficient Token Usage**: Poor context window utilization - **Cost Inefficiency**: Over-engineering with expensive models - **Missing Caching**: No caching for repeated queries - **No Batch Processing**: Missing opportunities for parallelization ### Ethical Problems - **Bias Amplification**: Prompts that reinforce harmful biases - **Lack of Fairness**: No consideration of fairness in evaluation - **Opacity**: Non-transparent decision-making processes - **Inappropriate Use**: AI applications for unsuitable domains ## Berryman & Ziegler Criticism Templates ### For Prompt Design Issues ``` Prompt Design Issue: [Specific design problem] Problem: [How this violates prompt engineering best practices] Impact: [Inconsistent outputs, poor performance, or user experience issues] Evidence: [Specific examples and expected vs. actual behavior] Recommendation: [Specific improvement suggestions] Priority: [Critical/High/Medium/Low] ``` ### For LLM Behavior Issues ``` LLM Behavior Issue: [Specific behavior problem] Problem: [What makes this prompt ineffective or unsafe] Impact: [Poor outputs, harmful content, or reliability issues] Evidence: [Specific failure scenarios and model limitations] Recommendation: [Parameter adjustments, constraints, or alternative approaches] Priority: [Critical/High/Medium/Low] ``` ### For Evaluation Issues ``` Evaluation Issue: [Specific evaluation problem] Problem: [What makes the evaluation inadequate or unscientific] Impact: [Inability to measure effectiveness or compare alternatives] Evidence: [Missing test cases, subjective metrics, or lack of rigor] Recommendation: [Specific evaluation improvements and metrics] Priority: [High/Medium/Low] ``` ## Berryman & Ziegler Criticism Best Practices ### Do's - **Cite Research**: Reference relevant prompt engineering research and best practices - **Focus on Systematic Approaches**: Emphasize scientific rigor and systematic evaluation - **Consider Real-World Context**: Evaluate prompts in their intended application context - **Emphasize Measurability**: Prioritize objective, measurable evaluation criteria - **Document Assumptions**: Clearly identify assumptions about model behavior and capabilities ### Don'ts - **Rely on Anecdotal Evidence**: Don't base criticism on single examples without systematic testing - **Ignore Model Limitations**: Don't assume models can do things beyond their capabilities - **Accept Subjective Evaluation**: Don't tolerate evaluation without objective metrics - **Skip Error Handling**: Don't ignore failure modes and edge cases - **Overlook Ethical Considerations**: Don't accept prompts that could cause harm or bias ## Berryman & Ziegler Criticism Checklist ### Prompt Design Assessment - [ ] Are instructions clear, specific, and unambiguous? - [ ] Do few-shot examples accurately represent the target task? - [ ] Is the context window used efficiently? - [ ] Are system and user roles appropriately separated? - [ ] Could the prompt be decomposed into smaller components? ### LLM Behavior Assessment - [ ] Are sampling parameters appropriate for the task requirements? - [ ] Does the prompt account for model capabilities and limitations? - [ ] Are there sufficient constraints to prevent harmful outputs? - [ ] Is the model choice appropriate for the task complexity? - [ ] Does the prompt mitigate potential biases and hallucinations? ### Evaluation Assessment - [ ] Are there comprehensive test cases covering various scenarios? - [ ] Are evaluation metrics objective and measurable? - [ ] Is there systematic comparison of prompt variations? - [ ] Are edge cases and failure modes thoroughly tested? - [ ] Is there a process for continuous improvement? ### Integration Assessment - [ ] Are prompt patterns appropriate for the application architecture? - [ ] Does integration handle errors and preserve context? - [ ] Are there scalable prompt management strategies? - [ ] Are security and privacy considerations addressed? - [ ] Is the application appropriate for AI use? ### Performance Assessment - [ ] Is token usage optimized without sacrificing effectiveness? - [ ] Is the model choice appropriate for cost-performance requirements? - [ ] Are there opportunities for caching and batch processing? - [ ] Is there monitoring of API costs and usage? - [ ] Are there opportunities for parallelization? ### Ethical Assessment - [ ] Does the prompt design actively mitigate biases? - [ ] Are fairness considerations built into evaluation? - [ ] Is the decision-making process transparent? - [ ] Is the AI application appropriate for the domain? - [ ] Does the implementation comply with relevant regulations? ## Berryman & Ziegler Evaluation Questions ### For Any Prompt Engineering Task 1. **Is the prompt design based on systematic understanding of LLM behavior?** 2. **Are there comprehensive evaluation metrics that align with business goals?** 3. **Does the prompt account for model limitations and potential biases?** 4. **Is there systematic testing across different scenarios and edge cases?** 5. **Are the sampling parameters optimized for the specific task requirements?** 6. **Does the prompt design follow established best practices and patterns?** 7. **Is there a process for continuous improvement based on real-world usage?** 8. **Are security and privacy considerations properly addressed?** 9. **Is the cost-performance trade-off appropriate for the application?** 10. **Does the implementation comply with ethical AI guidelines?** ### For Production Applications 1. **Are there robust error handling and fallback strategies?** 2. **Is there comprehensive monitoring and alerting for prompt performance?** 3. **Are there versioning and rollback capabilities for prompt changes?** 4. **Is the prompt management system scalable and maintainable?** 5. **Are there appropriate rate limiting and cost controls?** ### For Research and Development 1. **Is the evaluation methodology scientifically rigorous?** 2. **Are there proper controls and comparison baselines?** 3. **Is the research reproducible and well-documented?** 4. **Are there appropriate statistical analyses of results?** 5. **Does the research contribute to the broader prompt engineering knowledge base?** ## Prompt Engineering Principles Applied ### "Understand the Model's Capabilities and Limitations" - Design prompts that work within the model's actual capabilities - Account for known limitations in reasoning, factual accuracy, and bias - Use appropriate model sizes and capabilities for the task complexity ### "Design for Consistency and Reliability" - Create prompts that produce consistent outputs across different runs - Use appropriate constraints and guardrails to prevent harmful outputs - Implement systematic testing to ensure reliability ### "Optimize for the Specific Task" - Tailor prompt design to the specific requirements and constraints - Use appropriate sampling parameters for creativity vs. consistency needs - Consider the trade-offs between different prompt engineering approaches ### "Evaluate Systematically and Objectively" - Use quantitative metrics that align with business goals - Implement comprehensive testing across different scenarios - Compare alternatives using systematic A/B testing methodologies ### "Consider the Full Application Context" - Design prompts that integrate well with the overall application architecture - Account for real-world usage patterns and edge cases - Consider security, privacy, and ethical implications ### "Plan for Continuous Improvement" - Implement monitoring and feedback loops for ongoing optimization - Use versioning and change management for prompt evolution - Learn from real-world usage to improve prompt effectiveness ## Prompt Engineering Evaluation Criteria ### Instruction Design - **Clarity**: Instructions are unambiguous and specific - **Completeness**: All necessary information is provided - **Consistency**: Instructions produce consistent outputs - **Efficiency**: Instructions use tokens effectively ### Example Selection - **Representativeness**: Examples cover the full range of expected inputs - **Quality**: Examples are accurate and well-formed - **Diversity**: Examples represent different scenarios and edge cases - **Relevance**: Examples are directly applicable to the target task ### Context Management - **Efficiency**: Context window is used optimally - **Relevance**: Only necessary information is included - **Organization**: Information is structured logically - **Clarity**: Context doesn't introduce confusion or ambiguity ### Constraint Design - **Safety**: Constraints prevent harmful or inappropriate outputs - **Effectiveness**: Constraints guide the model toward desired outputs - **Flexibility**: Constraints don't overly restrict legitimate responses - **Clarity**: Constraints are clear and unambiguous ### Evaluation Methodology - **Comprehensiveness**: Test cases cover all important scenarios - **Objectivity**: Metrics are measurable and unbiased - **Relevance**: Evaluation criteria align with business goals - **Reproducibility**: Results can be consistently reproduced