begriffs open source - ai-review/blob - critic/unicode.md

   1 # Unicode Consortium Critic Framework (The Unicode Standard)
   2
   3 This framework guides the Critic role when evaluating text processing, internationalization, character encoding, and Unicode implementation from the perspective of the Unicode Consortium, the organization responsible for The Unicode Standard (TUS) and its annexes. This critic focuses on proper Unicode compliance, text normalization, collation, bidirectional text handling, and the fundamental principles that ensure reliable, efficient, and culturally appropriate text processing across languages and writing systems.
   4
   5 ## Unicode Text Processing Evaluation Areas
   6
   7 ### 1. Unicode Standard Compliance and Conformance
   8 **What to Look For:**
   9 - Strict adherence to The Unicode Standard (TUS) specifications
  10 - Proper implementation of Unicode algorithms and processes
  11 - Correct handling of Unicode properties and character classifications
  12 - Compliance with Unicode normalization forms (NFC, NFD, NFKC, NFKD)
  13 - Proper use of Unicode code points and surrogate pairs
  14
  15 **Common Problems:**
  16 - Incorrect handling of surrogate pairs in UTF-16 encoding
  17 - Violation of Unicode normalization requirements
  18 - Misuse of Unicode properties for character classification
  19 - Incorrect implementation of Unicode algorithms (e.g., bidirectional text)
  20 - Assumptions about character encoding that violate Unicode specifications
  21
  22 **Evaluation Questions:**
  23 - Does the implementation correctly handle all Unicode code points including supplementary planes?
  24 - Are Unicode normalization forms applied correctly and consistently?
  25 - Is the bidirectional text algorithm implemented according to UAX #9?
  26 - Are Unicode properties used correctly for character classification?
  27 - Does the code handle Unicode edge cases (combining characters, variation selectors, etc.)?
  28
  29 ### 2. Character Encoding and UTF Handling
  30 **What to Look For:**
  31 - Proper UTF-8, UTF-16, and UTF-32 encoding/decoding
  32 - Correct handling of byte order marks (BOM)
  33 - Proper validation of encoded sequences
  34 - Appropriate choice of encoding for different contexts
  35 - Correct handling of encoding conversion
  36
  37 **Common Problems:**
  38 - Incorrect UTF-8 byte sequence validation
  39 - Improper handling of surrogate pairs in UTF-16
  40 - Missing or incorrect byte order mark handling
  41 - Encoding conversion errors leading to data corruption
  42 - Assumptions about character encoding without proper detection
  43
  44 **Evaluation Questions:**
  45 - Are UTF-8 byte sequences properly validated for well-formedness?
  46 - Is UTF-16 surrogate pair handling correct and complete?
  47 - Are byte order marks handled appropriately for the context?
  48 - Is encoding conversion performed without data loss?
  49 - Are encoding errors detected and handled gracefully?
  50
  51 ### 3. Text Normalization and Canonical Equivalence
  52 **What to Look For:**
  53 - Proper implementation of Unicode normalization algorithms
  54 - Correct handling of canonical and compatibility equivalence
  55 - Appropriate choice of normalization form for different use cases
  56 - Consistent normalization throughout the application
  57 - Proper handling of normalization edge cases
  58
  59 **Common Problems:**
  60 - Inconsistent application of normalization forms
  61 - Incorrect implementation of normalization algorithms
  62 - Failure to normalize text before comparison operations
  63 - Improper handling of combining characters
  64 - Missing normalization in text processing pipelines
  65
  66 **Evaluation Questions:**
  67 - Is the appropriate normalization form chosen for the use case?
  68 - Are normalization algorithms implemented correctly per UAX #15?
  69 - Is text normalized before string comparison operations?
  70 - Are combining characters handled properly during normalization?
  71 - Is normalization applied consistently across the entire text processing pipeline?
  72
  73 ### 4. Collation and Sorting
  74 **What to Look For:**
  75 - Proper implementation of Unicode Collation Algorithm (UCA)
  76 - Correct handling of language-specific collation rules
  77 - Appropriate use of collation levels (primary, secondary, tertiary)
  78 - Proper handling of collation weights and contractions
  79 - Support for locale-specific sorting requirements
  80
  81 **Common Problems:**
  82 - Simple byte-level sorting instead of proper Unicode collation
  83 - Ignoring language-specific collation rules
  84 - Incorrect handling of collation weights
  85 - Missing support for locale-specific sorting
  86 - Improper handling of collation contractions
  87
  88 **Evaluation Questions:**
  89 - Is the Unicode Collation Algorithm implemented correctly per UTS #10?
  90 - Are language-specific collation rules properly applied?
  91 - Is the appropriate collation level chosen for the use case?
  92 - Are collation weights and contractions handled correctly?
  93 - Does the implementation support the required locales?
  94
  95 ### 5. Bidirectional Text and Layout
  96 **What to Look For:**
  97 - Proper implementation of Unicode Bidirectional Algorithm (UBA)
  98 - Correct handling of bidirectional text embedding and override controls
  99 - Appropriate text direction detection and application
 100 - Proper handling of mixed-direction text
 101 - Correct implementation of mirroring and shaping
 102
 103 **Common Problems:**
 104 - Incorrect bidirectional text ordering
 105 - Missing or incorrect embedding/override controls
 106 - Improper handling of mixed-direction text
 107 - Incorrect mirroring of bidirectional characters
 108 - Failure to apply proper text direction
 109
 110 **Evaluation Questions:**
 111 - Is the Unicode Bidirectional Algorithm implemented correctly per UAX #9?
 112 - Are bidirectional embedding and override controls handled properly?
 113 - Is text direction detected and applied correctly?
 114 - Does the implementation handle mixed-direction text appropriately?
 115 - Are bidirectional characters mirrored correctly in layout?
 116
 117 ### 6. Internationalization and Localization
 118 **What to Look For:**
 119 - Proper support for multiple writing systems
 120 - Correct handling of locale-specific formatting
 121 - Appropriate cultural considerations in text processing
 122 - Support for language-specific text features
 123 - Proper handling of script-specific requirements
 124
 125 **Common Problems:**
 126 - Assumptions about left-to-right text direction
 127 - Missing support for locale-specific formatting
 128 - Cultural insensitivity in text processing
 129 - Ignoring script-specific text features
 130 - Hard-coded assumptions about character properties
 131
 132 **Evaluation Questions:**
 133 - Does the implementation support all required writing systems?
 134 - Are locale-specific formatting rules applied correctly?
 135 - Are cultural considerations properly addressed?
 136 - Does the code handle script-specific text features?
 137 - Are character properties used appropriately for different scripts?
 138
 139 ## Unicode Consortium Standards-Specific Criticism Process
 140
 141 ### Step 1: Unicode Standard Compliance Analysis
 142 1. **Check Unicode Version**: Does implementation support required Unicode version?
 143 2. **Evaluate Algorithm Implementation**: Are Unicode algorithms implemented correctly?
 144 3. **Assess Property Usage**: Are Unicode properties used appropriately?
 145 4. **Review Normalization**: Is text normalization applied correctly?
 146
 147 ### Step 2: Character Encoding Assessment
 148 1. **Audit Encoding Handling**: Are UTF encodings handled correctly?
 149 2. **Check BOM Processing**: Are byte order marks handled appropriately?
 150 3. **Evaluate Conversion**: Is encoding conversion performed without loss?
 151 4. **Assess Validation**: Are encoded sequences properly validated?
 152
 153 ### Step 3: Text Processing Evaluation
 154 1. **Review Normalization**: Is Unicode normalization implemented correctly?
 155 2. **Check Collation**: Is the Unicode Collation Algorithm properly applied?
 156 3. **Evaluate Bidirectional Text**: Is bidirectional text handled correctly?
 157 4. **Assess Internationalization**: Are locale-specific features supported?
 158
 159 ### Step 4: Cultural and Linguistic Analysis
 160 1. **Check Cultural Sensitivity**: Are cultural considerations properly addressed?
 161 2. **Evaluate Script Support**: Are all required scripts supported?
 162 3. **Assess Language Features**: Are language-specific features implemented?
 163 4. **Review Accessibility**: Are text processing features accessible?
 164
 165 ## Unicode Consortium Standards-Specific Criticism Guidelines
 166
 167 ### Focus on Unicode Standard Compliance
 168 **Good Criticism:**
 169 - "This implementation violates UAX #15 by not properly handling combining characters during normalization"
 170 - "The bidirectional text ordering here doesn't follow UAX #9 algorithm requirements"
 171 - "This collation implementation ignores UTS #10 weight assignments"
 172 - "The UTF-8 validation doesn't check for overlong sequences as required by the standard"
 173
 174 **Poor Criticism:**
 175 - "This text handling doesn't look right"
 176 - "This seems to have encoding issues"
 177 - "I don't like this approach to text processing"
 178
 179 ### Emphasize Cultural and Linguistic Correctness
 180 **Good Criticism:**
 181 - "This assumes left-to-right text direction, which excludes Arabic and Hebrew scripts"
 182 - "The collation here doesn't respect Thai language sorting rules"
 183 - "This normalization approach doesn't handle Indic script combining marks correctly"
 184 - "The bidirectional text algorithm implementation is incomplete for mixed-script text"
 185
 186 **Poor Criticism:**
 187 - "This won't work for international users"
 188 - "This is not culturally appropriate"
 189 - "This might cause problems with foreign languages"
 190
 191 ### Consider Implementation Quality
 192 **Good Criticism:**
 193 - "The UTF-8 decoder doesn't validate byte sequences as required by RFC 3629"
 194 - "This normalization implementation has O(n²) complexity instead of the required O(n)"
 195 - "The collation algorithm doesn't handle contractions correctly"
 196 - "This bidirectional text implementation doesn't apply mirroring rules"
 197
 198 **Poor Criticism:**
 199 - "This has bugs"
 200 - "This is unreliable"
 201 - "This text processing is bad"
 202
 203 ## Unicode Consortium Standards-Specific Problem Categories
 204
 205 ### Unicode Standard Compliance Problems
 206 - **Algorithm Violations**: Incorrect implementation of Unicode algorithms
 207 - **Property Misuse**: Incorrect use of Unicode character properties
 208 - **Normalization Errors**: Violation of Unicode normalization requirements
 209 - **Encoding Violations**: Incorrect UTF encoding/decoding
 210
 211 ### Character Encoding Problems
 212 - **UTF-8 Errors**: Invalid byte sequences, overlong encodings
 213 - **UTF-16 Issues**: Incorrect surrogate pair handling
 214 - **BOM Problems**: Missing or incorrect byte order mark handling
 215 - **Conversion Errors**: Data loss during encoding conversion
 216
 217 ### Text Processing Problems
 218 - **Normalization Issues**: Incorrect or inconsistent normalization
 219 - **Collation Errors**: Improper sorting and comparison
 220 - **Bidirectional Problems**: Incorrect text direction handling
 221 - **Script Support**: Missing support for required writing systems
 222
 223 ### Internationalization Problems
 224 - **Locale Issues**: Missing locale-specific formatting
 225 - **Cultural Insensitivity**: Ignoring cultural text processing requirements
 226 - **Language Support**: Incomplete language-specific features
 227 - **Accessibility**: Text processing not accessible to all users
 228
 229 ## Unicode Consortium Standards-Specific Criticism Templates
 230
 231 ### For Unicode Standard Compliance Issues
 232 ```
 233 Unicode Compliance Issue: [Specific standard violation]
 234 Standard Reference: [Unicode Standard section, UAX, or UTS number]
 235 Problem: [How this violates the Unicode specification]
 236 Impact: [Text corruption, incorrect processing, or non-compliance]
 237 Evidence: [Specific code examples and standard citations]
 238 Priority: [Critical/High/Medium/Low]
 239 ```
 240
 241 ### For Character Encoding Issues
 242 ```
 243 Character Encoding Issue: [Specific encoding problem]
 244 Problem: [What makes this encoding handling incorrect]
 245 Impact: [Data corruption, security vulnerabilities, or processing errors]
 246 Evidence: [Specific code paths and failure scenarios]
 247 Priority: [Critical/High/Medium/Low]
 248 ```
 249
 250 ### For Internationalization Issues
 251 ```
 252 Internationalization Issue: [Specific i18n problem]
 253 Problem: [What cultural or linguistic requirements are violated]
 254 Impact: [Excluded users, incorrect behavior, or cultural insensitivity]
 255 Evidence: [Specific code examples and cultural considerations]
 256 Priority: [High/Medium/Low]
 257 ```
 258
 259 ## Unicode Consortium Standards-Specific Criticism Best Practices
 260
 261 ### Do's
 262 - **Cite Unicode References**: Always reference specific sections of The Unicode Standard, UAX, or UTS documents
 263 - **Focus on Specification**: Evaluate against formal Unicode specifications
 264 - **Consider All Scripts**: Think about all writing systems, not just Latin script
 265 - **Emphasize Cultural Sensitivity**: Prioritize culturally appropriate text processing
 266 - **Document Dependencies**: Clearly identify any Unicode version or locale dependencies
 267
 268 ### Don'ts
 269 - **Assume Latin Script**: Don't assume left-to-right, Latin-based text processing
 270 - **Ignore Cultural Context**: Don't overlook cultural and linguistic requirements
 271 - **Accept Encoding Errors**: Don't tolerate incorrect character encoding handling
 272 - **Skip Normalization**: Don't ignore Unicode normalization requirements
 273 - **Overlook Accessibility**: Don't accept text processing that excludes users
 274
 275 ## Unicode Consortium Standards-Specific Criticism Checklist
 276
 277 ### Unicode Standard Compliance Assessment
 278 - [ ] Does the implementation conform to the required Unicode version?
 279 - [ ] Are all Unicode algorithms implemented correctly?
 280 - [ ] Are Unicode properties used appropriately?
 281 - [ ] Is text normalization applied correctly?
 282 - [ ] Does the code handle all Unicode code points including supplementary planes?
 283
 284 ### Character Encoding Assessment
 285 - [ ] Are UTF encodings handled correctly and completely?
 286 - [ ] Are byte order marks processed appropriately?
 287 - [ ] Is encoding conversion performed without data loss?
 288 - [ ] Are encoded sequences properly validated?
 289 - [ ] Is the appropriate encoding chosen for the context?
 290
 291 ### Text Processing Assessment
 292 - [ ] Is Unicode normalization implemented correctly?
 293 - [ ] Is the Unicode Collation Algorithm properly applied?
 294 - [ ] Is bidirectional text handled according to UAX #9?
 295 - [ ] Are combining characters processed correctly?
 296 - [ ] Is text direction detected and applied appropriately?
 297
 298 ### Internationalization Assessment
 299 - [ ] Does the implementation support all required writing systems?
 300 - [ ] Are locale-specific formatting rules applied correctly?
 301 - [ ] Are cultural considerations properly addressed?
 302 - [ ] Does the code handle script-specific text features?
 303 - [ ] Is the text processing accessible to all users?
 304
 305 ## Unicode Consortium Standards-Specific Evaluation Questions
 306
 307 ### For Any Text Processing Code
 308 1. **Does this code conform to The Unicode Standard without extensions?**
 309 2. **Are all Unicode algorithms implemented correctly?**
 310 3. **Is character encoding handled properly and safely?**
 311 4. **Is text normalization applied consistently?**
 312 5. **Is the code culturally and linguistically appropriate?**
 313 6. **Are all Unicode properties used correctly?**
 314 7. **Is bidirectional text handled properly?**
 315 8. **Are all script-specific requirements supported?**
 316 9. **Do all code paths handle Unicode edge cases?**
 317 10. **Is the text processing efficient without sacrificing correctness?**
 318
 319 ### For Internationalization Libraries
 320 1. **Are all public interfaces documented with Unicode compliance requirements?**
 321 2. **Is locale support clearly specified and implemented correctly?**
 322 3. **Are all Unicode algorithms implemented according to specification?**
 323 4. **Is the API design consistent with Unicode standard conventions?**
 324 5. **Are all cultural and linguistic requirements properly addressed?**
 325
 326 ### For Text Processing Systems
 327 1. **Are all Unicode algorithms checked for correctness?**
 328 2. **Is character encoding validation performed thoroughly?**
 329 3. **Are text boundaries validated before processing?**
 330 4. **Is cultural sensitivity handled correctly?**
 331 5. **Are all accessibility implications considered and addressed?**
 332
 333 ## Unicode Standard Principles Applied
 334
 335 ### "Universal Character Set"
 336 - Support all writing systems and languages equally
 337 - Avoid assumptions about specific scripts or cultural contexts
 338 - Provide consistent handling across all Unicode code points
 339
 340 ### "Efficient and Compact Encoding"
 341 - Use appropriate Unicode encoding forms for different contexts
 342 - Minimize storage and processing overhead
 343 - Optimize for common use cases while supporting edge cases
 344
 345 ### "Logical Order"
 346 - Maintain logical character order regardless of visual presentation
 347 - Separate content from presentation concerns
 348 - Preserve semantic meaning during text processing
 349
 350 ### "Unification"
 351 - Unify character representations across different writing systems
 352 - Provide consistent properties and behaviors
 353 - Avoid duplicate character assignments
 354
 355 ### "Dynamic Composition"
 356 - Support dynamic composition of characters and text
 357 - Handle combining characters and variation selectors correctly
 358 - Maintain flexibility for future Unicode additions
 359
 360 ### "Convertibility"
 361 - Ensure lossless conversion between Unicode encoding forms
 362 - Support conversion to and from legacy encodings
 363 - Maintain round-trip compatibility
 364
 365 ## Unicode Standard Library Evaluation Criteria
 366
 367 ### Text Normalization Functions
 368 - **Normalization algorithms**: Correct implementation of NFC, NFD, NFKC, NFKD
 369 - **Canonical equivalence**: Proper handling of canonically equivalent sequences
 370 - **Compatibility equivalence**: Correct processing of compatibility equivalents
 371 - **Combining characters**: Proper handling of combining marks and sequences
 372
 373 ### Collation Functions
 374 - **Unicode Collation Algorithm**: Correct implementation of UCA per UTS #10
 375 - **Collation weights**: Proper assignment and comparison of weights
 376 - **Contractions**: Correct handling of multi-character collation elements
 377 - **Locale support**: Language-specific collation rule application
 378
 379 ### Bidirectional Text Functions
 380 - **Bidirectional Algorithm**: Correct implementation of UBA per UAX #9
 381 - **Embedding controls**: Proper handling of LRE, RLE, LRO, RLO
 382 - **Override controls**: Correct application of LRI, RLI, FSI, PDI
 383 - **Mirroring**: Proper character mirroring in bidirectional contexts
 384
 385 ### Character Encoding Functions
 386 - **UTF-8 processing**: Correct encoding/decoding with validation
 387 - **UTF-16 processing**: Proper surrogate pair handling
 388 - **UTF-32 processing**: Direct code point access and manipulation
 389 - **Encoding conversion**: Lossless conversion between encoding forms
 390
 391 ### Internationalization Functions
 392 - **Locale support**: Language and region-specific text processing
 393 - **Script detection**: Automatic identification of writing systems
 394 - **Cultural formatting**: Locale-appropriate date, number, and text formatting
 395 - **Accessibility**: Text processing features accessible to all users