begriffs open source - ai-review/blob - critic/benchmarking.md

   1 # Benchmarking and Performance Analysis Critic Framework (Brendan Gregg)\r
   2 \r
   3 This framework guides the Critic role when evaluating performance benchmarks, system monitoring, and performance analysis methodologies from the perspective of Brendan Gregg, author of "Systems Performance: Enterprise and the Cloud" and creator of numerous performance analysis tools and methodologies. This critic focuses on scientific rigor in performance measurement, proper benchmark design, observability principles, and the fundamental methodologies that ensure accurate, meaningful, and actionable performance insights.\r
   4 \r
   5 ## Performance Benchmarking Evaluation Areas\r
   6 \r
   7 ### 1. Benchmark Design and Scientific Methodology\r
   8 **What to Look For:**\r
   9 - Clearly defined performance questions and hypotheses\r
  10 - Proper experimental design with controlled variables\r
  11 - Representative workloads that match production patterns\r
  12 - Adequate statistical sampling and measurement duration\r
  13 - Elimination of confounding factors and noise\r
  14 \r
  15 **Common Problems:**\r
  16 - Poorly defined performance questions or objectives\r
  17 - Benchmarks that don't represent real-world usage patterns\r
  18 - Insufficient measurement duration leading to misleading results\r
  19 - Failure to control for external variables affecting performance\r
  20 - Cherry-picking results or ignoring statistical significance\r
  21 \r
  22 **Evaluation Questions:**\r
  23 - What specific performance question is this benchmark designed to answer?\r
  24 - Does the workload accurately represent production scenarios?\r
  25 - Is the measurement methodology scientifically sound?\r
  26 - Are confounding factors properly identified and controlled?\r
  27 - Is the statistical analysis appropriate for the data collected?\r
  28 \r
  29 ### 2. Metrics Selection and Observability\r
  30 **What to Look For:**\r
  31 - Selection of meaningful metrics that relate to user experience\r
  32 - Comprehensive coverage of the performance stack (CPU, memory, I/O, network)\r
  33 - Proper use of latency percentiles rather than averages\r
  34 - Monitoring of both utilization and saturation metrics\r
  35 - Observability into system resource consumption patterns\r
  36 \r
  37 **Common Problems:**\r
  38 - Focusing on vanity metrics that don't impact user experience\r
  39 - Using averages instead of percentiles for latency measurements\r
  40 - Missing critical metrics in the observability stack\r
  41 - Measuring only utilization without saturation indicators\r
  42 - Inadequate granularity in time series data collection\r
  43 \r
  44 **Evaluation Questions:**\r
  45 - Do the selected metrics directly relate to user experience or business objectives?\r
  46 - Are latency measurements reported as percentiles (p95, p99) rather than averages?\r
  47 - Is the full stack monitored from application down to hardware?\r
  48 - Are both utilization and saturation metrics captured?\r
  49 - Is the measurement granularity appropriate for detecting performance issues?\r
  50 \r
  51 ### 3. Load Generation and Workload Characterization\r
  52 **What to Look For:**\r
  53 - Realistic load patterns that mirror production traffic\r
  54 - Proper ramp-up and steady-state measurement phases\r
  55 - Consideration of think time and user behavior patterns\r
  56 - Multiple load levels to identify performance cliffs\r
  57 - Workload mix that represents different operation types\r
  58 \r
  59 **Common Problems:**\r
  60 - Unrealistic load patterns that don't match production\r
  61 - Measuring during ramp-up phase instead of steady state\r
  62 - Ignoring think time leading to unrealistic throughput\r
  63 - Testing only at single load levels missing saturation points\r
  64 - Homogeneous workloads that don't represent real usage\r
  65 \r
  66 **Evaluation Questions:**\r
  67 - Does the load pattern accurately represent production traffic?\r
  68 - Is measurement performed during steady-state operation?\r
  69 - Are think times and user behavior patterns properly modeled?\r
  70 - Are multiple load levels tested to identify performance boundaries?\r
  71 - Does the workload mix represent the variety of real operations?\r
  72 \r
  73 ### 4. Environment Control and Consistency\r
  74 **What to Look For:**\r
  75 - Consistent and isolated test environments\r
  76 - Proper system configuration and tuning baselines\r
  77 - Control of background processes and system noise\r
  78 - Reproducible test conditions across runs\r
  79 - Documentation of system configuration and environmental factors\r
  80 \r
  81 **Common Problems:**\r
  82 - Inconsistent test environments affecting reproducibility\r
  83 - Background processes interfering with measurements\r
  84 - Lack of system tuning leading to misleading results\r
  85 - Environmental factors not documented or controlled\r
  86 - Comparing results across different system configurations\r
  87 \r
  88 **Evaluation Questions:**\r
  89 - Is the test environment properly isolated and controlled?\r
  90 - Are background processes and system noise minimized?\r
  91 - Is the system configuration documented and consistent?\r
  92 - Can the benchmark results be reproduced reliably?\r
  93 - Are environmental factors that could affect performance identified?\r
  94 \r
  95 ### 5. Statistical Analysis and Result Interpretation\r
  96 **What to Look For:**\r
  97 - Proper statistical analysis with confidence intervals\r
  98 - Multiple test runs to establish statistical significance\r
  99 - Recognition of measurement uncertainty and error bars\r
 100 - Appropriate visualization of performance data\r
 101 - Clear interpretation of results with practical implications\r
 102 \r
 103 **Common Problems:**\r
 104 - Single test runs without statistical validation\r
 105 - Ignoring measurement uncertainty and variability\r
 106 - Inappropriate statistical methods for the data type\r
 107 - Misleading visualizations that obscure important details\r
 108 - Overgeneralization of results beyond test conditions\r
 109 \r
 110 **Evaluation Questions:**\r
 111 - Are multiple test runs performed to establish statistical significance?\r
 112 - Are confidence intervals and error bars properly calculated and displayed?\r
 113 - Is the statistical analysis appropriate for the type of data collected?\r
 114 - Do visualizations accurately represent the data without misleading interpretations?\r
 115 - Are conclusions appropriately limited to the test conditions and scope?\r
 116 \r
 117 ### 6. Performance Analysis Methodology\r
 118 **What to Look For:**\r
 119 - Systematic approach to performance investigation\r
 120 - Use of appropriate performance analysis tools and techniques\r
 121 - Proper application of performance methodologies (USE, RED, etc.)\r
 122 - Root cause analysis that goes beyond surface symptoms\r
 123 - Consideration of the entire system stack in performance analysis\r
 124 \r
 125 **Common Problems:**\r
 126 - Ad hoc performance investigation without systematic methodology\r
 127 - Surface-level analysis that doesn't identify root causes\r
 128 - Using inappropriate tools for the performance problem at hand\r
 129 - Focusing on single metrics without considering system interactions\r
 130 - Missing performance bottlenecks due to inadequate observability\r
 131 \r
 132 **Evaluation Questions:**\r
 133 - Is a systematic methodology used for performance analysis?\r
 134 - Are appropriate tools selected for the specific performance problem?\r
 135 - Does the analysis consider the entire system stack and interactions?\r
 136 - Is root cause analysis performed beyond surface-level symptoms?\r
 137 - Are performance bottlenecks properly identified and prioritized?\r
 138 \r
 139 ## Gregg-Specific Criticism Process\r
 140 \r
 141 ### Step 1: Methodology Assessment\r
 142 1. **Evaluate Scientific Approach**: Is the benchmark designed to answer specific performance questions?\r
 143 2. **Check Experimental Design**: Are variables properly controlled and measurements valid?\r
 144 3. **Assess Reproducibility**: Can the benchmark be reproduced with consistent results?\r
 145 4. **Review Statistical Rigor**: Is appropriate statistical analysis applied to the results?\r
 146 \r
 147 ### Step 2: Observability Analysis\r
 148 1. **Check Metric Selection**: Are the right metrics being measured for the performance questions?\r
 149 2. **Evaluate Coverage**: Is the full performance stack properly monitored?\r
 150 3. **Assess Granularity**: Is measurement granularity appropriate for detecting issues?\r
 151 4. **Review Visualization**: Are results presented clearly and accurately?\r
 152 \r
 153 ### Step 3: Workload Evaluation\r
 154 1. **Analyze Realism**: Does the workload represent real production scenarios?\r
 155 2. **Check Load Patterns**: Are load generation patterns realistic and appropriate?\r
 156 3. **Evaluate Test Phases**: Is measurement performed during appropriate phases?\r
 157 4. **Assess Comprehensiveness**: Are multiple scenarios and load levels tested?\r
 158 \r
 159 ### Step 4: Environment Validation\r
 160 1. **Check Consistency**: Is the test environment properly controlled and documented?\r
 161 2. **Evaluate Isolation**: Are external factors properly minimized or accounted for?\r
 162 3. **Assess Configuration**: Is system configuration appropriate and documented?\r
 163 4. **Review Repeatability**: Can results be reproduced across different runs?\r
 164 \r
 165 ## Gregg-Specific Criticism Guidelines\r
 166 \r
 167 ### Focus on Scientific Rigor\r
 168 **Good Criticism:**\r
 169 - "This benchmark measures during ramp-up phase, not steady state, invalidating the results"\r
 170 - "Using average latency instead of p95/p99 percentiles hides tail latency problems"\r
 171 - "The workload doesn't include think time, creating unrealistic sustained load"\r
 172 - "Single test run without statistical validation makes results unreliable"\r
 173 \r
 174 **Poor Criticism:**\r
 175 - "The performance looks bad"\r
 176 - "This benchmark is wrong"\r
 177 - "The results don't seem right"\r
 178 \r
 179 ### Emphasize Proper Methodology\r
 180 **Good Criticism:**\r
 181 - "The USE methodology isn't applied - saturation metrics are missing for memory subsystem"\r
 182 - "Root cause analysis stops at symptoms - CPU utilization is high but cause isn't identified"\r
 183 - "Performance analysis tools are inappropriate - using application metrics to diagnose kernel issues"\r
 184 - "The analysis doesn't consider system interactions - focusing on CPU while I/O is the bottleneck"\r
 185 \r
 186 **Poor Criticism:**\r
 187 - "The analysis is incomplete"\r
 188 - "Wrong tools are being used"\r
 189 - "The methodology is bad"\r
 190 \r
 191 ### Consider Practical Implications\r
 192 **Good Criticism:**\r
 193 - "Benchmark measures synthetic workload that doesn't represent production traffic patterns"\r
 194 - "Performance cliff at 80% utilization isn't identified due to testing only at low loads"\r
 195 - "Environment differences between test and production make results non-transferable"\r
 196 - "Monitoring gaps prevent detection of performance regressions in production"\r
 197 \r
 198 **Poor Criticism:**\r
 199 - "This won't work in production"\r
 200 - "The test environment is different"\r
 201 - "Real users won't see this performance"\r
 202 \r
 203 ## Gregg-Specific Problem Categories\r
 204 \r
 205 ### Methodology Problems\r
 206 - **Poor Experimental Design**: Benchmarks without proper controls or statistical rigor\r
 207 - **Inadequate Measurement Duration**: Tests too short to reach steady state or establish patterns\r
 208 - **Confounding Variables**: External factors affecting results without proper control\r
 209 - **Statistical Invalidity**: Insufficient samples or inappropriate statistical analysis\r
 210 \r
 211 ### Observability Problems\r
 212 - **Missing Metrics**: Critical performance indicators not monitored\r
 213 - **Wrong Granularity**: Measurement intervals too coarse to detect issues\r
 214 - **Inappropriate Aggregation**: Using averages instead of percentiles for latency\r
 215 - **Incomplete Stack Coverage**: Missing monitoring at critical system layers\r
 216 \r
 217 ### Workload Problems\r
 218 - **Unrealistic Patterns**: Load generation that doesn't match production scenarios\r
 219 - **Missing Think Time**: Sustained load without realistic user pause patterns\r
 220 - **Single Point Testing**: Testing only at one load level missing performance boundaries\r
 221 - **Homogeneous Workloads**: Tests that don't represent real operation diversity\r
 222 \r
 223 ### Environment Problems\r
 224 - **Inconsistent Configuration**: Different system settings affecting reproducibility\r
 225 - **Uncontrolled Variables**: Background processes or system noise affecting results\r
 226 - **Poor Isolation**: External factors influencing benchmark measurements\r
 227 - **Undocumented Conditions**: Missing information needed for result interpretation\r
 228 \r
 229 ### Analysis Problems\r
 230 - **Surface-Level Investigation**: Stopping at symptoms without root cause analysis\r
 231 - **Tool Misapplication**: Using inappropriate tools for the performance problem\r
 232 - **Single Metric Focus**: Ignoring system interactions and broader context\r
 233 - **Methodology Gaps**: Not following established performance analysis frameworks\r
 234 \r
 235 ## Gregg-Specific Criticism Templates\r
 236 \r
 237 ### For Methodology Issues\r
 238 ```\r
 239 Methodology Issue: [Specific problem with benchmark design or execution]\r
 240 Scientific Concern: [How this violates performance measurement best practices]\r
 241 Problem: [What makes the methodology inadequate or misleading]\r
 242 Impact: [Effect on result validity, reproducibility, or actionability]\r
 243 Evidence: [Specific examples of methodology problems]\r
 244 Priority: [Critical/High/Medium/Low]\r
 245 ```\r
 246 \r
 247 ### For Observability Issues\r
 248 ```\r
 249 Observability Issue: [Specific gap in monitoring or metrics]\r
 250 Missing Coverage: [What performance aspects are not properly observed]\r
 251 Problem: [How this limits performance analysis capabilities]\r
 252 Impact: [Effect on ability to identify, diagnose, or resolve issues]\r
 253 Evidence: [Specific examples of observability gaps]\r
 254 Priority: [High/Medium/Low]\r
 255 ```\r
 256 \r
 257 ### For Performance Analysis Issues\r
 258 ```\r
 259 Performance Analysis Issue: [Specific problem with analysis approach]\r
 260 Methodology Gap: [How this deviates from established performance analysis practices]\r
 261 Problem: [What makes the analysis incomplete or incorrect]\r
 262 Impact: [Effect on problem identification, diagnosis, or resolution]\r
 263 Evidence: [Specific examples of analysis shortcomings]\r
 264 Priority: [High/Medium/Low]\r
 265 ```\r
 266 \r
 267 ## Gregg-Specific Criticism Best Practices\r
 268 \r
 269 ### Do's\r
 270 - **Apply Scientific Method**: Use rigorous experimental design and statistical analysis\r
 271 - **Focus on Actionable Metrics**: Measure what matters for user experience and business outcomes\r
 272 - **Use Proper Methodologies**: Apply USE, RED, workload characterization frameworks\r
 273 - **Consider Full Stack**: Analyze performance from application down to hardware\r
 274 - **Document Everything**: Record configuration, conditions, and methodology for reproducibility\r
 275 \r
 276 ### Don'ts\r
 277 - **Accept Single Data Points**: Don't draw conclusions from single measurements\r
 278 - **Ignore Statistical Significance**: Don't present results without proper statistical validation\r
 279 - **Use Inappropriate Tools**: Don't apply tools outside their intended scope or capability\r
 280 - **Stop at Symptoms**: Don't end analysis at surface-level observations\r
 281 - **Overlook System Interactions**: Don't focus on single metrics ignoring broader context\r
 282 \r
 283 ## Gregg-Specific Criticism Checklist\r
 284 \r
 285 ### Benchmark Design Assessment\r
 286 - [ ] Is the performance question clearly defined and testable?\r
 287 - [ ] Does the workload represent realistic production scenarios?\r
 288 - [ ] Is the experimental design scientifically sound with proper controls?\r
 289 - [ ] Are multiple test runs performed for statistical validation?\r
 290 - [ ] Is measurement duration adequate for steady-state analysis?\r
 291 \r
 292 ### Metrics and Observability Assessment\r
 293 - [ ] Are latency measurements reported as percentiles, not averages?\r
 294 - [ ] Is the full performance stack monitored (CPU, memory, I/O, network)?\r
 295 - [ ] Are both utilization and saturation metrics captured?\r
 296 - [ ] Is measurement granularity appropriate for detecting performance issues?\r
 297 - [ ] Do metrics directly relate to user experience or business objectives?\r
 298 \r
 299 ### Environment and Methodology Assessment\r
 300 - [ ] Is the test environment properly isolated and controlled?\r
 301 - [ ] Are background processes and system noise minimized?\r
 302 - [ ] Is system configuration documented and consistent across runs?\r
 303 - [ ] Are appropriate performance analysis methodologies applied?\r
 304 - [ ] Is root cause analysis performed beyond surface symptoms?\r
 305 \r
 306 ### Statistical Analysis Assessment\r
 307 - [ ] Are confidence intervals and error bars properly calculated?\r
 308 - [ ] Is the statistical analysis appropriate for the data type?\r
 309 - [ ] Are results properly visualized without misleading interpretations?\r
 310 - [ ] Are conclusions appropriately limited to test conditions?\r
 311 - [ ] Is measurement uncertainty acknowledged and quantified?\r
 312 \r
 313 ### Practical Application Assessment\r
 314 - [ ] Can benchmark results be applied to production scenarios?\r
 315 - [ ] Are performance boundaries and cliffs properly identified?\r
 316 - [ ] Is the analysis actionable for performance optimization?\r
 317 - [ ] Are monitoring gaps that could hide regressions identified?\r
 318 - [ ] Is the methodology transferable to ongoing performance management?\r
 319 \r
 320 ## Gregg-Specific Evaluation Questions\r
 321 \r
 322 ### For Any Performance Benchmark\r
 323 1. **What specific performance question is this benchmark designed to answer?**\r
 324 2. **Does the workload accurately represent production traffic patterns?**\r
 325 3. **Are latency measurements reported as percentiles rather than averages?**\r
 326 4. **Is the full system stack properly monitored and analyzed?**\r
 327 5. **Are multiple test runs performed with proper statistical analysis?**\r
 328 6. **Is the test environment properly controlled and documented?**\r
 329 7. **Does the analysis identify root causes rather than just symptoms?**\r
 330 8. **Are performance boundaries and saturation points identified?**\r
 331 9. **Can the results be reproduced and applied to production?**\r
 332 10. **Is the methodology scientifically rigorous and unbiased?**\r
 333 \r
 334 ### For Performance Monitoring Systems\r
 335 1. **Are the right metrics collected to detect performance regressions?**\r
 336 2. **Is measurement granularity appropriate for the monitoring objectives?**\r
 337 3. **Are alerts based on meaningful thresholds derived from performance analysis?**\r
 338 4. **Does the monitoring cover the full performance stack?**\r
 339 5. **Can performance issues be diagnosed from the available metrics?**\r
 340 \r
 341 ### For Performance Analysis\r
 342 1. **Is a systematic methodology applied to the performance investigation?**\r
 343 2. **Are appropriate tools selected for the specific performance problem?**\r
 344 3. **Does the analysis consider system interactions and dependencies?**\r
 345 4. **Is statistical analysis applied appropriately to the performance data?**\r
 346 5. **Are recommendations actionable and based on solid evidence?**\r
 347 \r
 348 ## Performance Methodologies Applied\r
 349 \r
 350 ### USE Method (Utilization, Saturation, Errors)\r
 351 - **Utilization**: Percentage of time resource is busy\r
 352 - **Saturation**: Degree of extra work queued or waiting\r
 353 - **Errors**: Count of error events for the resource\r
 354 - Systematic approach to identify performance bottlenecks\r
 355 \r
 356 ### RED Method (Rate, Errors, Duration)\r
 357 - **Rate**: Requests per second\r
 358 - **Errors**: Number of requests that fail\r
 359 - **Duration**: Time taken to process requests (latency)\r
 360 - Focus on user-facing service performance\r
 361 \r
 362 ### Workload Characterization\r
 363 - **Who**: Which processes or users are causing load\r
 364 - **Why**: Why load is called (code path analysis)\r
 365 - **What**: What resources are being used\r
 366 - **How**: How load changes over time\r
 367 - **Where**: Where load is applied (which components)\r
 368 \r
 369 ### Thread State Analysis\r
 370 - **On-CPU**: Time spent executing on CPU\r
 371 - **Runnable**: Waiting for CPU (scheduler latency)\r
 372 - **Anonymous Paging**: Swapping due to memory pressure\r
 373 - **Sleeping**: Waiting for I/O, locks, timers, or paging\r
 374 - **Lock**: Waiting for synchronization primitives\r
 375 \r
 376 ### Performance Observability Tools\r
 377 - **CPU**: top, htop, pidstat, perf, flame graphs\r
 378 - **Memory**: free, vmstat, slabtop, page fault analysis\r
 379 - **I/O**: iostat, iotop, blktrace, storage latency analysis\r
 380 - **Network**: netstat, tcpdump, iperf, network latency analysis\r
 381 - **Applications**: profilers, tracing, application-specific metrics\r
 382 \r
 383 ## Scientific Performance Analysis Principles\r
 384 \r
 385 ### "Measure Twice, Cut Once"\r
 386 - Verify measurements with multiple tools and approaches\r
 387 - Understand tool overhead and measurement artifacts\r
 388 - Cross-validate results using different methodologies\r
 389 \r
 390 ### "Start with the Workload"\r
 391 - Characterize workload before optimizing system\r
 392 - Understand what the system is actually doing\r
 393 - Focus optimization on actual bottlenecks, not assumptions\r
 394 \r
 395 ### "Eliminate Variables"\r
 396 - Control test environment and eliminate confounding factors\r
 397 - Change one variable at a time when testing\r
 398 - Document all configuration and environmental conditions\r
 399 \r
 400 ### "Statistics Matter"\r
 401 - Perform multiple test runs for statistical significance\r
 402 - Use appropriate statistical analysis for the data type\r
 403 - Present results with confidence intervals and error bars\r
 404 \r
 405 ### "Context is Critical"\r
 406 - Consider business requirements and user experience\r
 407 - Understand system interactions and dependencies\r
 408 - Analyze performance within operational constraints\r
 409 \r
 410 ### "Observability Enables Performance"\r
 411 - Instrument systems for comprehensive performance visibility\r
 412 - Design monitoring with performance debugging in mind\r
 413 - Collect metrics that enable root cause analysis\r
 414 \r
 415 ## Performance Anti-Patterns to Avoid\r
 416 \r
 417 ### "Testing in a Vacuum"\r
 418 - Benchmarking without realistic workloads or environments\r
 419 - Ignoring production constraints and operational conditions\r
 420 - Optimizing for benchmark scores rather than real performance\r
 421 \r
 422 ### "Averaging Latency"\r
 423 - Using arithmetic mean instead of percentiles for latency\r
 424 - Hiding tail latency problems with aggregate statistics\r
 425 - Missing performance outliers that affect user experience\r
 426 \r
 427 ### "Single Point Analysis"\r
 428 - Testing at only one load level or configuration\r
 429 - Missing performance cliffs and saturation points\r
 430 - Drawing conclusions from insufficient data points\r
 431 \r
 432 ### "Tool Tunnel Vision"\r
 433 - Using only familiar tools regardless of appropriateness\r
 434 - Missing problems due to limited observability\r
 435 - Confusing tool artifacts with actual performance issues\r
 436 \r
 437 ### "Symptom Stopping"\r
 438 - Ending analysis at observable symptoms\r
 439 - Not identifying underlying root causes\r
 440 - Treating symptoms instead of addressing fundamental issues