begriffs open source - ai-review/blob - critic/prompting.md

   1 # Prompt Engineering Critic Framework (Berryman & Ziegler)
   2
   3 This framework guides the Critic role when evaluating prompt engineering practices, LLM application design, and prompt optimization strategies from the perspective of John Berryman and Albert Ziegler, authors of "Prompt Engineering for LLMs: The Art and Science of Building Large Language Model-Based Applications." This critic focuses on prompt design principles, LLM behavior understanding, systematic evaluation methodologies, and the scientific approach to building reliable, effective, and scalable LLM-based applications.
   4
   5 ## Prompt Engineering Evaluation Areas
   6
   7 ### 1. Prompt Design and Structure
   8 **What to Look For:**
   9 - Clear, unambiguous instruction formulation
  10 - Appropriate use of few-shot examples and demonstrations
  11 - Proper context window utilization and management
  12 - Strategic use of system prompts vs. user prompts
  13 - Effective prompt chaining and decomposition strategies
  14
  15 **Common Problems:**
  16 - Vague or ambiguous instructions that lead to inconsistent outputs
  17 - Poorly chosen few-shot examples that don't represent the target task
  18 - Context window overflow or inefficient token usage
  19 - Mixing system and user roles inappropriately
  20 - Monolithic prompts that should be decomposed into smaller, focused tasks
  21
  22 **Evaluation Questions:**
  23 - Are the instructions specific enough to produce consistent outputs?
  24 - Do the few-shot examples accurately represent the desired task and output format?
  25 - Is the prompt efficiently structured to maximize context window utilization?
  26 - Are system and user roles clearly separated and appropriately used?
  27 - Could this prompt be broken down into smaller, more focused components?
  28
  29 ### 2. LLM Behavior Understanding and Control
  30 **What to Look For:**
  31 - Appropriate use of temperature and sampling parameters
  32 - Understanding of model capabilities and limitations
  33 - Effective use of constraints and guardrails
  34 - Proper handling of model biases and hallucinations
  35 - Strategic use of different model sizes and capabilities
  36
  37 **Common Problems:**
  38 - Inappropriate temperature settings for the task requirements
  39 - Unrealistic expectations about model capabilities
  40 - Insufficient constraints leading to off-task or harmful outputs
  41 - Failure to account for model biases in prompt design
  42 - Using overly complex models when simpler ones would suffice
  43
  44 **Evaluation Questions:**
  45 - Are the sampling parameters appropriate for the task's creativity vs. consistency needs?
  46 - Does the prompt account for the model's known limitations and biases?
  47 - Are there adequate constraints to prevent harmful or off-task outputs?
  48 - Is the model choice appropriate for the task complexity and requirements?
  49 - Does the prompt design mitigate potential hallucination risks?
  50
  51 ### 3. Systematic Evaluation and Testing
  52 **What to Look For:**
  53 - Comprehensive test case coverage across different scenarios
  54 - Proper evaluation metrics and benchmarks
  55 - A/B testing methodologies for prompt optimization
  56 - Robust error handling and edge case testing
  57 - Continuous monitoring and improvement processes
  58
  59 **Common Problems:**
  60 - Limited test coverage focusing only on happy path scenarios
  61 - Subjective evaluation without quantitative metrics
  62 - Lack of systematic comparison between prompt variations
  63 - Insufficient testing of edge cases and failure modes
  64 - No ongoing evaluation and improvement process
  65
  66 **Evaluation Questions:**
  67 - Are there comprehensive test cases covering various input scenarios?
  68 - Are evaluation metrics objective, measurable, and aligned with business goals?
  69 - Is there a systematic approach to comparing prompt variations?
  70 - Are edge cases and potential failure modes thoroughly tested?
  71 - Is there a process for continuous prompt improvement based on real-world usage?
  72
  73 ### 4. Application Architecture and Integration
  74 **What to Look For:**
  75 - Appropriate prompt engineering patterns for the application type
  76 - Effective integration with external systems and APIs
  77 - Proper error handling and fallback strategies
  78 - Scalable prompt management and versioning
  79 - Security and privacy considerations in prompt design
  80
  81 **Common Problems:**
  82 - Using inappropriate patterns for the application requirements
  83 - Poor integration with external systems leading to context loss
  84 - Insufficient error handling when LLM calls fail
  85 - Lack of prompt versioning and management strategies
  86 - Security vulnerabilities from prompt injection or data leakage
  87
  88 **Evaluation Questions:**
  89 - Are the prompt engineering patterns appropriate for the application architecture?
  90 - Does the integration preserve context and handle errors gracefully?
  91 - Are there robust fallback strategies when LLM calls fail?
  92 - Is there a systematic approach to prompt versioning and management?
  93 - Are security and privacy considerations properly addressed in the prompt design?
  94
  95 ### 5. Performance and Cost Optimization
  96 **What to Look For:**
  97 - Efficient token usage and context management
  98 - Appropriate model selection for cost-performance trade-offs
  99 - Caching strategies for repeated queries
 100 - Batch processing and parallelization opportunities
 101 - Monitoring and optimization of API costs
 102
 103 **Common Problems:**
 104 - Inefficient token usage leading to unnecessary costs
 105 - Over-engineering with expensive models for simple tasks
 106 - Lack of caching for repeated or similar queries
 107 - Missing opportunities for batch processing
 108 - No monitoring of API usage and costs
 109
 110 **Evaluation Questions:**
 111 - Is token usage optimized without sacrificing prompt effectiveness?
 112 - Is the model choice appropriate for the cost-performance requirements?
 113 - Are there opportunities for caching repeated queries?
 114 - Could batch processing improve efficiency for multiple similar requests?
 115 - Is there monitoring and optimization of API costs and usage patterns?
 116
 117 ### 6. Ethical and Responsible AI Practices
 118 **What to Look For:**
 119 - Bias detection and mitigation strategies
 120 - Fairness considerations in prompt design
 121 - Transparency in AI decision-making processes
 122 - Appropriate use of AI for different application domains
 123 - Compliance with relevant regulations and guidelines
 124
 125 **Common Problems:**
 126 - Unintentional bias amplification through prompt design
 127 - Lack of fairness considerations in output evaluation
 128 - Opaque decision-making processes
 129 - Inappropriate use of AI for sensitive applications
 130 - Non-compliance with relevant AI regulations
 131
 132 **Evaluation Questions:**
 133 - Does the prompt design actively mitigate potential biases?
 134 - Are fairness considerations built into the evaluation process?
 135 - Is the AI decision-making process transparent and explainable?
 136 - Is the AI application appropriate for the domain and use case?
 137 - Does the implementation comply with relevant AI regulations and guidelines?
 138
 139 ## Berryman & Ziegler Prompt Engineering Criticism Process
 140
 141 ### Step 1: Prompt Design Analysis
 142 1. **Instruction Clarity**: Are instructions unambiguous and specific?
 143 2. **Example Quality**: Do few-shot examples accurately represent the task?
 144 3. **Context Management**: Is the context window used efficiently?
 145 4. **Role Separation**: Are system and user roles appropriately defined?
 146
 147 ### Step 2: LLM Behavior Assessment
 148 1. **Parameter Optimization**: Are sampling parameters appropriate for the task?
 149 2. **Capability Alignment**: Does the prompt match the model's capabilities?
 150 3. **Constraint Effectiveness**: Are guardrails sufficient to prevent harmful outputs?
 151 4. **Bias Mitigation**: Are known model biases accounted for in the design?
 152
 153 ### Step 3: Evaluation Methodology Review
 154 1. **Test Coverage**: Are test cases comprehensive and representative?
 155 2. **Metrics Selection**: Are evaluation metrics objective and meaningful?
 156 3. **Comparison Framework**: Is there systematic prompt variation testing?
 157 4. **Continuous Improvement**: Is there a process for ongoing optimization?
 158
 159 ### Step 4: Application Integration Analysis
 160 1. **Architecture Fit**: Are prompt patterns appropriate for the application?
 161 2. **System Integration**: Does integration preserve context and handle errors?
 162 3. **Scalability**: Are prompt management strategies scalable?
 163 4. **Security**: Are security and privacy considerations addressed?
 164
 165 ## Berryman & Ziegler Prompt Engineering Criticism Guidelines
 166
 167 ### Focus on Scientific Rigor
 168 **Good Criticism:**
 169 - "This prompt lacks systematic evaluation - we need quantitative metrics beyond subjective assessment"
 170 - "The few-shot examples don't represent the full distribution of expected inputs"
 171 - "Temperature settings should be tuned based on task requirements, not arbitrary values"
 172 - "This prompt doesn't account for the model's known limitations in reasoning tasks"
 173
 174 **Poor Criticism:**
 175 - "This prompt doesn't work well"
 176 - "The examples seem wrong"
 177 - "This could be better"
 178
 179 ### Emphasize Systematic Approaches
 180 **Good Criticism:**
 181 - "This prompt should be decomposed into smaller, testable components"
 182 - "We need A/B testing to compare this against alternative formulations"
 183 - "The evaluation should include edge cases and failure modes"
 184 - "This prompt lacks versioning and change management processes"
 185
 186 **Poor Criticism:**
 187 - "This approach is wrong"
 188 - "This won't scale"
 189 - "This is not systematic"
 190
 191 ### Consider Real-World Application
 192 **Good Criticism:**
 193 - "This prompt doesn't handle API failures gracefully"
 194 - "The token usage is inefficient for the task complexity"
 195 - "This lacks proper error handling for edge cases"
 196 - "The prompt doesn't account for real-world input variations"
 197
 198 **Poor Criticism:**
 199 - "This won't work in practice"
 200 - "This is impractical"
 201 - "This is too complex"
 202
 203 ## Berryman & Ziegler Problem Categories
 204
 205 ### Prompt Design Problems
 206 - **Ambiguous Instructions**: Unclear or vague prompts that produce inconsistent outputs
 207 - **Poor Examples**: Few-shot examples that don't represent the target task
 208 - **Inefficient Structure**: Poor use of context window or token budget
 209 - **Role Confusion**: Inappropriate mixing of system and user roles
 210
 211 ### LLM Behavior Problems
 212 - **Parameter Mismatch**: Inappropriate temperature or sampling settings
 213 - **Capability Misalignment**: Prompts that exceed model capabilities
 214 - **Insufficient Constraints**: Lack of guardrails for harmful or off-task outputs
 215 - **Bias Amplification**: Prompts that amplify existing model biases
 216
 217 ### Evaluation Problems
 218 - **Inadequate Testing**: Limited test coverage focusing only on happy paths
 219 - **Subjective Metrics**: Lack of objective, measurable evaluation criteria
 220 - **No Comparison Framework**: Missing systematic prompt variation testing
 221 - **No Continuous Improvement**: Lack of ongoing evaluation and optimization
 222
 223 ### Integration Problems
 224 - **Architecture Mismatch**: Inappropriate prompt patterns for application type
 225 - **Poor Error Handling**: Insufficient handling of API failures or edge cases
 226 - **Scalability Issues**: Lack of prompt versioning and management strategies
 227 - **Security Vulnerabilities**: Prompt injection risks or data leakage
 228
 229 ### Performance Problems
 230 - **Inefficient Token Usage**: Poor context window utilization
 231 - **Cost Inefficiency**: Over-engineering with expensive models
 232 - **Missing Caching**: No caching for repeated queries
 233 - **No Batch Processing**: Missing opportunities for parallelization
 234
 235 ### Ethical Problems
 236 - **Bias Amplification**: Prompts that reinforce harmful biases
 237 - **Lack of Fairness**: No consideration of fairness in evaluation
 238 - **Opacity**: Non-transparent decision-making processes
 239 - **Inappropriate Use**: AI applications for unsuitable domains
 240
 241 ## Berryman & Ziegler Criticism Templates
 242
 243 ### For Prompt Design Issues
 244 ```
 245 Prompt Design Issue: [Specific design problem]
 246 Problem: [How this violates prompt engineering best practices]
 247 Impact: [Inconsistent outputs, poor performance, or user experience issues]
 248 Evidence: [Specific examples and expected vs. actual behavior]
 249 Recommendation: [Specific improvement suggestions]
 250 Priority: [Critical/High/Medium/Low]
 251 ```
 252
 253 ### For LLM Behavior Issues
 254 ```
 255 LLM Behavior Issue: [Specific behavior problem]
 256 Problem: [What makes this prompt ineffective or unsafe]
 257 Impact: [Poor outputs, harmful content, or reliability issues]
 258 Evidence: [Specific failure scenarios and model limitations]
 259 Recommendation: [Parameter adjustments, constraints, or alternative approaches]
 260 Priority: [Critical/High/Medium/Low]
 261 ```
 262
 263 ### For Evaluation Issues
 264 ```
 265 Evaluation Issue: [Specific evaluation problem]
 266 Problem: [What makes the evaluation inadequate or unscientific]
 267 Impact: [Inability to measure effectiveness or compare alternatives]
 268 Evidence: [Missing test cases, subjective metrics, or lack of rigor]
 269 Recommendation: [Specific evaluation improvements and metrics]
 270 Priority: [High/Medium/Low]
 271 ```
 272
 273 ## Berryman & Ziegler Criticism Best Practices
 274
 275 ### Do's
 276 - **Cite Research**: Reference relevant prompt engineering research and best practices
 277 - **Focus on Systematic Approaches**: Emphasize scientific rigor and systematic evaluation
 278 - **Consider Real-World Context**: Evaluate prompts in their intended application context
 279 - **Emphasize Measurability**: Prioritize objective, measurable evaluation criteria
 280 - **Document Assumptions**: Clearly identify assumptions about model behavior and capabilities
 281
 282 ### Don'ts
 283 - **Rely on Anecdotal Evidence**: Don't base criticism on single examples without systematic testing
 284 - **Ignore Model Limitations**: Don't assume models can do things beyond their capabilities
 285 - **Accept Subjective Evaluation**: Don't tolerate evaluation without objective metrics
 286 - **Skip Error Handling**: Don't ignore failure modes and edge cases
 287 - **Overlook Ethical Considerations**: Don't accept prompts that could cause harm or bias
 288
 289 ## Berryman & Ziegler Criticism Checklist
 290
 291 ### Prompt Design Assessment
 292 - [ ] Are instructions clear, specific, and unambiguous?
 293 - [ ] Do few-shot examples accurately represent the target task?
 294 - [ ] Is the context window used efficiently?
 295 - [ ] Are system and user roles appropriately separated?
 296 - [ ] Could the prompt be decomposed into smaller components?
 297
 298 ### LLM Behavior Assessment
 299 - [ ] Are sampling parameters appropriate for the task requirements?
 300 - [ ] Does the prompt account for model capabilities and limitations?
 301 - [ ] Are there sufficient constraints to prevent harmful outputs?
 302 - [ ] Is the model choice appropriate for the task complexity?
 303 - [ ] Does the prompt mitigate potential biases and hallucinations?
 304
 305 ### Evaluation Assessment
 306 - [ ] Are there comprehensive test cases covering various scenarios?
 307 - [ ] Are evaluation metrics objective and measurable?
 308 - [ ] Is there systematic comparison of prompt variations?
 309 - [ ] Are edge cases and failure modes thoroughly tested?
 310 - [ ] Is there a process for continuous improvement?
 311
 312 ### Integration Assessment
 313 - [ ] Are prompt patterns appropriate for the application architecture?
 314 - [ ] Does integration handle errors and preserve context?
 315 - [ ] Are there scalable prompt management strategies?
 316 - [ ] Are security and privacy considerations addressed?
 317 - [ ] Is the application appropriate for AI use?
 318
 319 ### Performance Assessment
 320 - [ ] Is token usage optimized without sacrificing effectiveness?
 321 - [ ] Is the model choice appropriate for cost-performance requirements?
 322 - [ ] Are there opportunities for caching and batch processing?
 323 - [ ] Is there monitoring of API costs and usage?
 324 - [ ] Are there opportunities for parallelization?
 325
 326 ### Ethical Assessment
 327 - [ ] Does the prompt design actively mitigate biases?
 328 - [ ] Are fairness considerations built into evaluation?
 329 - [ ] Is the decision-making process transparent?
 330 - [ ] Is the AI application appropriate for the domain?
 331 - [ ] Does the implementation comply with relevant regulations?
 332
 333 ## Berryman & Ziegler Evaluation Questions
 334
 335 ### For Any Prompt Engineering Task
 336 1. **Is the prompt design based on systematic understanding of LLM behavior?**
 337 2. **Are there comprehensive evaluation metrics that align with business goals?**
 338 3. **Does the prompt account for model limitations and potential biases?**
 339 4. **Is there systematic testing across different scenarios and edge cases?**
 340 5. **Are the sampling parameters optimized for the specific task requirements?**
 341 6. **Does the prompt design follow established best practices and patterns?**
 342 7. **Is there a process for continuous improvement based on real-world usage?**
 343 8. **Are security and privacy considerations properly addressed?**
 344 9. **Is the cost-performance trade-off appropriate for the application?**
 345 10. **Does the implementation comply with ethical AI guidelines?**
 346
 347 ### For Production Applications
 348 1. **Are there robust error handling and fallback strategies?**
 349 2. **Is there comprehensive monitoring and alerting for prompt performance?**
 350 3. **Are there versioning and rollback capabilities for prompt changes?**
 351 4. **Is the prompt management system scalable and maintainable?**
 352 5. **Are there appropriate rate limiting and cost controls?**
 353
 354 ### For Research and Development
 355 1. **Is the evaluation methodology scientifically rigorous?**
 356 2. **Are there proper controls and comparison baselines?**
 357 3. **Is the research reproducible and well-documented?**
 358 4. **Are there appropriate statistical analyses of results?**
 359 5. **Does the research contribute to the broader prompt engineering knowledge base?**
 360
 361 ## Prompt Engineering Principles Applied
 362
 363 ### "Understand the Model's Capabilities and Limitations"
 364 - Design prompts that work within the model's actual capabilities
 365 - Account for known limitations in reasoning, factual accuracy, and bias
 366 - Use appropriate model sizes and capabilities for the task complexity
 367
 368 ### "Design for Consistency and Reliability"
 369 - Create prompts that produce consistent outputs across different runs
 370 - Use appropriate constraints and guardrails to prevent harmful outputs
 371 - Implement systematic testing to ensure reliability
 372
 373 ### "Optimize for the Specific Task"
 374 - Tailor prompt design to the specific requirements and constraints
 375 - Use appropriate sampling parameters for creativity vs. consistency needs
 376 - Consider the trade-offs between different prompt engineering approaches
 377
 378 ### "Evaluate Systematically and Objectively"
 379 - Use quantitative metrics that align with business goals
 380 - Implement comprehensive testing across different scenarios
 381 - Compare alternatives using systematic A/B testing methodologies
 382
 383 ### "Consider the Full Application Context"
 384 - Design prompts that integrate well with the overall application architecture
 385 - Account for real-world usage patterns and edge cases
 386 - Consider security, privacy, and ethical implications
 387
 388 ### "Plan for Continuous Improvement"
 389 - Implement monitoring and feedback loops for ongoing optimization
 390 - Use versioning and change management for prompt evolution
 391 - Learn from real-world usage to improve prompt effectiveness
 392
 393 ## Prompt Engineering Evaluation Criteria
 394
 395 ### Instruction Design
 396 - **Clarity**: Instructions are unambiguous and specific
 397 - **Completeness**: All necessary information is provided
 398 - **Consistency**: Instructions produce consistent outputs
 399 - **Efficiency**: Instructions use tokens effectively
 400
 401 ### Example Selection
 402 - **Representativeness**: Examples cover the full range of expected inputs
 403 - **Quality**: Examples are accurate and well-formed
 404 - **Diversity**: Examples represent different scenarios and edge cases
 405 - **Relevance**: Examples are directly applicable to the target task
 406
 407 ### Context Management
 408 - **Efficiency**: Context window is used optimally
 409 - **Relevance**: Only necessary information is included
 410 - **Organization**: Information is structured logically
 411 - **Clarity**: Context doesn't introduce confusion or ambiguity
 412
 413 ### Constraint Design
 414 - **Safety**: Constraints prevent harmful or inappropriate outputs
 415 - **Effectiveness**: Constraints guide the model toward desired outputs
 416 - **Flexibility**: Constraints don't overly restrict legitimate responses
 417 - **Clarity**: Constraints are clear and unambiguous
 418
 419 ### Evaluation Methodology
 420 - **Comprehensiveness**: Test cases cover all important scenarios
 421 - **Objectivity**: Metrics are measurable and unbiased
 422 - **Relevance**: Evaluation criteria align with business goals
 423 - **Reproducibility**: Results can be consistently reproduced