begriffs open source - pg_scribe/blob - logical-replication-backup-design.md

   1 # Incremental SQL Backup System Using PostgreSQL Logical Replication
   2
   3 **PostgreSQL Version**: This design is based on PostgreSQL 18.0 documentation. While most features (logical replication, event triggers, pg_recvlogical) are available in earlier versions (PostgreSQL 10+), verify specific parameter availability (e.g., `max_slot_wal_keep_size` requires PostgreSQL 13+) for your target version.
   4
   5 ## Executive Summary
   6
   7 This document details the design for a PostgreSQL backup system that produces human-readable, plain SQL incremental backups using logical replication. The system creates backups that remain readable and restorable for 10+ years while supporting online operation and crash safety.
   8
   9 **Design Decision**: Use `pg_recvlogical` with the `decoder_raw` plugin for DML capture, combined with event triggers for DDL tracking and periodic `pg_dumpall --globals-only` for shared objects.
  10
  11 **Why This Works**:
  12 - **Built-in tooling handles complexity**: `pg_recvlogical` provides streaming infrastructure, crash recovery, and position tracking
  13 - **No transformation layer needed**: `decoder_raw` produces production-ready SQL directly
  14 - **Complete coverage**: Event triggers + `pg_dumpall --globals-only` captures all DDL
  15 - **Long-term readability**: Plain SQL format that can be executed years later
  16
  17 **Key Requirements**:
  18 1. **DDL tracking** - Event triggers capture schema changes; `pg_dumpall --globals-only` handles shared objects
  19 2. **Replica identity configuration** - All tables need proper configuration for UPDATE/DELETE
  20 3. **Aggressive monitoring** - Replication slots must be monitored to prevent operational issues
  21 4. **decoder_raw installation** - Third-party plugin must be compiled and installed
  22
  23 ## Architecture Overview
  24
  25 ### High-Level Design
  26
  27 ```
  28 ┌─────────────────────────────────────────────────────────────┐
  29 │                    PostgreSQL Database                       │
  30 │                                                              │
  31 │  ┌────────────────┐         ┌──────────────────┐           │
  32 │  │  Regular Tables│────────▶│  WAL (Write-Ahead│           │
  33 │  │  (DML Changes) │         │       Log)       │           │
  34 │  └────────────────┘         └──────────────────┘           │
  35 │                                      │                       │
  36 │                                      ▼                       │
  37 │                         ┌─────────────────────────┐         │
  38 │                         │ Logical Decoding Process│         │
  39 │                         │  (decoder_raw plugin)   │         │
  40 │                         └─────────────────────────┘         │
  41 │                                      │                       │
  42 └──────────────────────────────────────┼───────────────────────┘
  43                                        │
  44                                        ▼
  45                         ┌─────────────────────────────┐
  46                         │  Replication Slot           │
  47                         │  (Tracks position, durable) │
  48                         └─────────────────────────────┘
  49                                        │
  50                                        ▼
  51                         ┌─────────────────────────────┐
  52                         │    pg_recvlogical Tool     │
  53                         │ (Built-in PostgreSQL util)  │
  54                         └─────────────────────────────┘
  55                                        │
  56                         ┌──────────────┴──────────────┐
  57                         ▼                             ▼
  58           ┌─────────────────────┐      ┌─────────────────────┐
  59           │  Incremental Files  │      │  Full pg_dump       │
  60           │  (SQL Changes)      │      │  (Periodic)         │
  61           │  - 2024-01-01.sql   │      │  - base-2024-01.sql │
  62           │  - 2024-01-02.sql   │      │  - base-2024-02.sql │
  63           │  - ...              │      │  - ...              │
  64           └─────────────────────┘      └─────────────────────┘
  65 ```
  66
  67 ### Core Components
  68
  69 1. **Logical Replication Slot**: Durable position tracker in PostgreSQL
  70 2. **decoder_raw Plugin**: Transforms binary WAL to executable SQL
  71 3. **pg_recvlogical**: Built-in PostgreSQL tool that streams logical decoding output
  72 4. **Base Backup System**: Regular full `pg_dump` backups with `--snapshot` for consistency
  73 5. **Schema Tracking System**: Event triggers + `pg_dumpall --globals-only` for DDL changes
  74
  75 ## How It Works
  76
  77 ### DML Capture via Logical Replication
  78
  79 PostgreSQL's logical replication decodes the Write-Ahead Log (WAL) into logical changes. The `decoder_raw` plugin outputs these directly as executable SQL:
  80
  81 ```sql
  82 BEGIN;
  83 INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
  84 UPDATE public.users SET name = 'Alice Smith' WHERE id = 1;
  85 DELETE FROM public.orders WHERE id = 42;
  86 COMMIT;
  87 ```
  88
  89 **Key Properties**:
  90 - **Crash-safe**: Replication slots persist position across crashes (positions persisted at checkpoint intervals; after crash, slot may return to earlier LSN causing recent changes to be replayed)
  91 - **Consistent**: Transaction boundaries are preserved
  92 - **Online**: Runs without blocking database operations
  93 - **Idempotent positioning**: Can restart from last known position (clients responsible for handling duplicate messages)
  94
  95 ### DDL Capture via Event Triggers
  96
  97 Logical replication does **not** capture DDL (CREATE TABLE, ALTER TABLE, etc.). We solve this with event triggers:
  98
  99 ```sql
 100 -- Create a table to log DDL changes
 101 CREATE TABLE ddl_history (
 102     id SERIAL PRIMARY KEY,
 103     executed_at TIMESTAMP DEFAULT now(),
 104     ddl_command TEXT,
 105     object_type TEXT,
 106     object_identity TEXT,
 107     query TEXT
 108 );
 109
 110 -- Create event trigger function
 111 CREATE OR REPLACE FUNCTION log_ddl_changes()
 112 RETURNS event_trigger AS $$
 113 DECLARE
 114     obj RECORD;
 115 BEGIN
 116     FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
 117     LOOP
 118         INSERT INTO ddl_history (ddl_command, object_type, object_identity, query)
 119         VALUES (obj.command_tag, obj.object_type, obj.object_identity, current_query());
 120     END LOOP;
 121 END;
 122 $$ LANGUAGE plpgsql;
 123
 124 -- Register the event trigger
 125 CREATE EVENT TRIGGER capture_ddl ON ddl_command_end
 126     EXECUTE FUNCTION log_ddl_changes();
 127 ```
 128
 129 **How it works**:
 130 1. Event trigger captures all DDL commands
 131 2. Stores actual query text in `ddl_history` table with timestamps
 132 3. The `ddl_history` table is replicated via logical replication
 133 4. During restore, apply DDL changes in chronological order
 134
 135 **Limitations**:
 136 - Event triggers don't fire for shared objects: databases, roles (role definitions and role memberships), tablespaces, parameter privileges, and ALTER SYSTEM commands
 137 - Solution: Use periodic `pg_dumpall --globals-only` to capture shared objects
 138
 139 ## Implementation Components
 140
 141 ### 1. Initial Setup Script
 142
 143 **Purpose**: Bootstrap the backup system
 144
 145 **Tasks**:
 146 - Install decoder_raw plugin (compile from source)
 147 - Create logical replication slot with snapshot export (see detailed procedure below)
 148 - Capture the exported snapshot identifier
 149 - Take initial base backup using the exported snapshot
 150 - Set up event triggers for DDL capture
 151 - Create initial `pg_dumpall --globals-only` backup
 152 - Configure `REPLICA IDENTITY` on tables without primary keys
 153 - Document PostgreSQL version and installed extensions
 154
 155 **Critical Detail - Initial Snapshot Consistency**:
 156
 157 From PostgreSQL documentation (Section 47.2.5):
 158 > When a new replication slot is created using the streaming replication interface, a snapshot is exported which will show exactly the state of the database after which all changes will be included in the change stream.
 159
 160 This ensures the base backup and incremental stream are perfectly aligned with no gaps or overlaps.
 161
 162 **Step-by-Step Setup Procedure**:
 163
 164 1. **Create replication slot with snapshot export**:
 165    - Use `pg_recvlogical --create-slot` with `--if-not-exists` flag for idempotency
 166    - Command outputs snapshot identifier in format: `snapshot: 00000003-00000001-1`
 167    - Capture this identifier from stdout/stderr (appears on stderr with "snapshot:" prefix)
 168
 169 2. **Extract and validate snapshot identifier**:
 170    - Parse the snapshot identifier from slot creation output
 171    - Verify identifier was successfully captured before proceeding
 172    - Snapshot is only valid during the creating session/connection
 173
 174 3. **Take synchronized base backup**:
 175    - Use `pg_dump --snapshot=<identifier>` with captured snapshot
 176    - This guarantees perfect alignment between base backup and incremental stream
 177    - For large databases (>50GB), consider parallel dump with `-j` and `-Fd` format
 178
 179 4. **Capture globals and metadata**:
 180    - Run `pg_dumpall --globals-only` for roles, tablespaces
 181    - Document PostgreSQL version, extensions, encoding in metadata file
 182
 183 **Critical Considerations**:
 184 - **Snapshot validity**: Exported snapshot only valid until the session that created it disconnects; must use immediately or maintain connection
 185 - **Idempotency**: The `--if-not-exists` flag allows safe re-execution of setup scripts
 186 - **Timing**: Entire sequence (create slot → capture snapshot → run pg_dump) must complete while snapshot remains valid
 187
 188 ### 2. Incremental Backup Collection
 189
 190 **Tool**: Built-in `pg_recvlogical` utility with `decoder_raw` plugin
 191
 192 **Key Configuration**:
 193
 194 Run `pg_recvlogical --start` with the following important parameters:
 195
 196 - **Status interval** (`--status-interval`): Controls how often client reports position back to server (recommended: 10 seconds)
 197   - Lower values (5-10s) allow server to advance slot position and free WAL more quickly
 198   - Too high risks slot lag and WAL accumulation
 199   - Balance between slot health and network overhead
 200
 201 - **Fsync interval** (`--fsync-interval`): Controls disk write safety (recommended: 10 seconds)
 202   - Frequency of forced disk synchronization for crash safety on backup client
 203   - 0 disables fsync (faster but risks data loss if client crashes)
 204   - Higher values improve performance but increase window of potential loss
 205
 206 - **Plugin options** (`--option`): Pass `include_transaction=on` to decoder_raw
 207   - Includes BEGIN/COMMIT statements in output
 208   - Essential for maintaining transaction boundaries during restore
 209
 210 **What pg_recvlogical provides**:
 211 - Streams decoded changes continuously from the replication slot
 212 - Handles connection failures and automatic reconnection
 213 - Tracks LSN positions with status updates to the server
 214 - Supports file rotation via SIGHUP signal (for log rotation without stopping stream)
 215
 216 **What decoder_raw provides**:
 217 - ✅ Schema-qualified table names: `INSERT INTO public.users (...)`
 218 - ✅ Proper column name quoting with `quote_identifier()`
 219 - ✅ Transaction boundaries via `include_transaction=on` option (BEGIN/COMMIT)
 220 - ✅ Intelligent replica identity handling (DEFAULT, INDEX, FULL, NOTHING)
 221 - ✅ Comprehensive data type support (booleans, NaN, Infinity, NULL, strings, bit strings)
 222 - ✅ TOAST optimization (skips unchanged TOAST columns in UPDATEs)
 223 - ✅ Production-quality memory management
 224
 225 **Custom wrapper script tasks**:
 226 - File rotation and timestamping
 227 - Coordinate with monitoring system
 228 - Metadata file generation (PostgreSQL version, extensions, encoding, collation)
 229
 230 ### 3. Periodic Full Backup Script
 231
 232 **Purpose**: Take regular full backups as restore points
 233
 234 **Tasks**:
 235 - Execute `pg_dump` to create full database backup
 236 - Execute `pg_dumpall --globals-only` to capture shared objects (databases, roles, tablespaces)
 237 - Compress old backups to save space
 238 - Implement retention policy (delete old backups)
 239
 240 **Backup Approaches**:
 241
 242 1. **Standard single-threaded backup**:
 243    - Use `pg_dump --file=backup.sql` for plain SQL output
 244    - Simpler, adequate for small to medium databases (<50GB)
 245    - Single connection to database
 246
 247 2. **Parallel backup for large databases**:
 248    - Use `pg_dump --format=directory --jobs=N`
 249    - Can reduce backup time by 50-80% for large databases
 250    - Each worker needs a database connection: requires `max_connections >= N + 1`
 251    - Recommended: 2-4 workers for most systems; up to 8 for high-end hardware
 252    - Most beneficial with many large tables; less effective with few large tables
 253
 254 **When to Use Parallel Dumps**:
 255 - Database size > 50GB: Consider parallel dump
 256 - Database size > 500GB: Strongly recommended
 257 - Test with different worker counts to find optimal performance for your workload
 258 - Monitor CPU and I/O during initial tests
 259
 260 **Compression**:
 261 - Directory format backups can be compressed after creation (tar + gzip/zstd)
 262 - Or use `pg_dump --compress=<method>` for built-in compression
 263
 264 ### 4. Restore Script
 265
 266 **Purpose**: Restore database to latest captured state from base + incremental backups
 267
 268 **Restore Process**:
 269 1. Locate most recent full backup
 270 2. Find all incremental backups since that full backup
 271 3. Create new target database
 272 4. Restore `pg_dumpall --globals-only` (shared objects: roles, tablespaces)
 273 5. Restore full `pg_dump` backup (use `pg_restore` for directory format)
 274 6. Apply DDL changes from `ddl_history` table in chronological order
 275 7. Apply all incremental SQL backups in chronological order
 276 8. Synchronize sequence values using `setval()` with max values from tables
 277 9. Verify data integrity (row counts, application smoke tests)
 278
 279 **Sequence Synchronization**:
 280 - After applying all changes, sequences may be behind actual max values
 281 - Query `information_schema.sequences` to find all sequences
 282 - For each sequence: `SELECT setval('sequence_name', COALESCE(MAX(id_column), 1)) FROM table`
 283 - Can be automated by generating setval queries from schema metadata
 284
 285 **Handling Duplicate Transactions**:
 286
 287 After PostgreSQL crash, replication slot may return to earlier LSN, causing some transactions to be streamed again. The restore process handles this naturally:
 288
 289 - Most SQL operations are idempotent or fail safely:
 290   - INSERT fails on duplicate primary key (acceptable during restore)
 291   - UPDATE reapplies same values (idempotent)
 292   - DELETE succeeds or reports row not found (acceptable)
 293 - Transaction boundaries (BEGIN/COMMIT from `include_transaction=on`) preserve consistency
 294 - Simply apply all incremental files in order; duplicates will be handled correctly
 295
 296 **Alternative: If exact-once semantics required**:
 297 - Track last successfully applied LSN from `pg_replication_slots.confirmed_flush_lsn`
 298 - Use `pg_recvlogical --startpos=<LSN>` to resume from known position during collection
 299 - More complex: requires persistent state management
 300 - Generally unnecessary for backup/restore use case
 301
 302 ### 5. Monitoring and Health Check Script
 303
 304 **Purpose**: Prevent operational issues from inactive replication slots
 305
 306 **Critical Metrics**:
 307 ```sql
 308 -- Check replication slot health
 309 SELECT
 310     slot_name,
 311     slot_type,
 312     database,
 313     active,
 314     pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replication_lag,
 315     pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmed_lag
 316 FROM pg_replication_slots
 317 WHERE slot_type = 'logical';
 318 ```
 319
 320 **Alerting**:
 321 - **Critical alert** when `restart_lsn` falls more than 1GB behind
 322 - **Emergency alert** when slot lag exceeds 10GB or age exceeds 24 hours
 323 - **Emergency procedure** documented to drop slot if it threatens database availability
 324
 325 **Available Tools**:
 326 - **Prometheus + postgres_exporter + Grafana**: Open-source monitoring stack
 327 - **pgDash** (https://pgdash.io/): Commercial PostgreSQL monitoring
 328 - **check_postgres**: Nagios/Icinga/Zabbix integration
 329 - **Built-in views**: `pg_replication_slots`, `pg_stat_replication_slots`
 330
 331 ## decoder_raw Plugin Details
 332
 333 **Source**: https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw
 334 **License**: PostgreSQL License (permissive, production-ready)
 335 **Compatibility**: PostgreSQL 9.4+
 336
 337 **Installation**:
 338 ```bash
 339 # Install PostgreSQL development headers
 340 apt-get install postgresql-server-dev-XX  # Debian/Ubuntu
 341 yum install postgresql-devel              # RHEL/CentOS
 342
 343 # Clone and compile
 344 git clone https://github.com/michaelpq/pg_plugins.git
 345 cd pg_plugins/decoder_raw
 346 make
 347 sudo make install
 348
 349 # Verify installation
 350 ls $(pg_config --pkglibdir)/decoder_raw.so
 351 ```
 352
 353 **Why decoder_raw is essential**:
 354 - Eliminates the entire SQL transformation layer
 355 - Handles all data type escaping correctly (strings, NULL, NaN, Infinity, booleans)
 356 - Produces production-ready SQL that can be executed with `psql -f changes.sql`
 357 - Mature codebase with comprehensive test suite
 358
 359 **Known Limitation of decoder_raw Plugin**:
 360 - ⚠️ **TRUNCATE not captured by decoder_raw**: While PostgreSQL logical replication supports TRUNCATE replication (documented in Section 29.8), the decoder_raw plugin does not implement the `truncate_cb` callback to output TRUNCATE statements
 361 - **Workarounds** (all viable):
 362   1. Event triggers CAN and WILL capture TRUNCATE as DDL (stored in `ddl_history` table) - **recommended approach**
 363   2. Use `DELETE FROM table` instead of TRUNCATE (slower but captured by decoder_raw)
 364   3. Contribute `truncate_cb` callback implementation to the decoder_raw project upstream
 365 - **Important**: TRUNCATE operations will be captured and replicated via the event trigger mechanism, so this limitation has a complete workaround within this design
 366
 367 ## Key Challenges and Solutions
 368
 369 ### 1. Replica Identity Required for UPDATE/DELETE
 370
 371 **Problem**: Tables need replica identity for UPDATE/DELETE operations.
 372
 373 From PostgreSQL documentation (Section 29.1.1):
 374 > Tables with a replica identity defined as `NOTHING`, `DEFAULT` without a primary key, or `USING INDEX` with a dropped index, **cannot support UPDATE or DELETE operations**. **Attempting such operations will result in an error on the publisher.**
 375
 376 This means UPDATE/DELETE will **fail on the source database**, not just during restore!
 377
 378 **Solution**: Ensure all tables have one of:
 379 - A primary key (automatic replica identity)
 380 - A unique index configured via `REPLICA IDENTITY USING INDEX index_name`
 381 - Explicit `REPLICA IDENTITY FULL` setting (inefficient, last resort)
 382
 383 **Example**:
 384 ```sql
 385 -- Table without primary key will error on UPDATE/DELETE
 386 CREATE TABLE logs (timestamp TIMESTAMPTZ, message TEXT);
 387
 388 -- Fix: Set replica identity to FULL
 389 ALTER TABLE logs REPLICA IDENTITY FULL;
 390 ```
 391
 392 ### 2. Replication Slots Prevent WAL Cleanup
 393
 394 **Problem**: Inactive replication slots prevent WAL cleanup, leading to:
 395 1. Disk fills up (WAL files not cleaned)
 396 2. Table bloat (VACUUM cannot clean old row versions)
 397 3. **Database shutdown** (transaction ID wraparound)
 398
 399 From PostgreSQL documentation (Section 47.2.2):
 400 > In extreme cases this could cause the database to shut down to prevent transaction ID wraparound.
 401
 402 **Solution**:
 403 - **Monitor slot lag aggressively** (see monitoring section)
 404 - Set `max_slot_wal_keep_size` parameter (PostgreSQL 13+) to limit WAL retention
 405 - Have documented emergency procedure to drop slot if needed
 406 - Consider `pg_replication_slot_advance()` to skip ahead (loses backup coverage)
 407
 408 ### 3. Sequences Are Not Replicated
 409
 410 **Problem**: Sequence values are not captured in logical replication.
 411
 412 **Solution**:
 413 - Use `pg_dump --sequence-data` (enabled by default) in periodic full dumps
 414 - After restore, synchronize sequences:
 415   ```sql
 416   SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));
 417   ```
 418
 419 ### 4. Large Objects Are Not Replicated
 420
 421 **Problem**: PostgreSQL large objects are not captured in logical replication.
 422
 423 **Solution**:
 424 - **Preferred**: Use `BYTEA` columns instead (these ARE replicated)
 425 - **Alternative**: Use `pg_dump --large-objects` in periodic full backups
 426   - Note: Incremental changes to large objects NOT captured between full backups
 427
 428 ### 5. Crash Recovery and Duplicate Handling
 429
 430 **Problem**: After database crash, slot position may roll back, causing duplicate changes.
 431
 432 From PostgreSQL documentation (Section 47.2.2):
 433 > The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot might return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts.
 434
 435 **Solution**: `pg_recvlogical` handles position tracking through status reporting and `--startpos` parameter. However, per PostgreSQL documentation (Section 47.2.2): "Logical decoding clients are responsible for avoiding ill effects from handling the same message more than once."
 436
 437 **Implementation Approaches**:
 438
 439 1. **Idempotent Restore** (Recommended):
 440    - Most SQL operations in backup files are naturally idempotent:
 441      - INSERT will fail on duplicate primary key (acceptable during restore)
 442      - UPDATE will reapply same values (idempotent)
 443      - DELETE will succeed or report row not found (acceptable)
 444    - Transaction boundaries (BEGIN/COMMIT) ensure consistency
 445    - Simpler implementation, no LSN tracking required
 446    - See Restore Script section for implementation
 447
 448 2. **LSN Position Tracking** (Advanced):
 449    - Track `confirmed_flush_lsn` from `pg_replication_slots` view
 450    - Use `--startpos` parameter to resume from known position
 451    - Prevents duplicate transaction replay entirely
 452    - More complex, requires persistent state management
 453    - See Restore Script section for implementation with LSN tracking
 454    - Useful for continuous replication scenarios
 455
 456 **Recommendations**:
 457 - Start with idempotent restore approach (simpler, adequate for most use cases)
 458 - Test crash scenarios with `pg_ctl stop -m immediate` to verify duplicate handling
 459 - Consider LSN tracking only if exact-once semantics are required
 460 - Monitor `confirmed_flush_lsn` lag during normal operations (see Monitoring section)
 461
 462 ### 6. Long-Term Readability
 463
 464 **Challenges**:
 465 - PostgreSQL syntax may change between major versions (rare)
 466 - Extension dependencies may not exist in future systems
 467 - Encoding/collation definitions may change
 468
 469 **Solution**: Include metadata file with each backup:
 470 - PostgreSQL version (full version string)
 471 - All installed extension names and versions
 472 - Database encoding
 473 - Locale and collation settings
 474 - Custom data types and enums
 475
 476 Periodically test restoring old backups on current PostgreSQL versions.
 477
 478 ## Prerequisites and Configuration
 479
 480 ### PostgreSQL Configuration
 481
 482 ```ini
 483 # postgresql.conf
 484
 485 # Required: Set WAL level to logical
 486 wal_level = logical
 487
 488 # Required: Allow at least one replication slot
 489 max_replication_slots = 10
 490
 491 # Recommended: Allow replication connections
 492 max_wal_senders = 10
 493
 494 # Recommended: Keep more WAL for safety
 495 wal_keep_size = 1GB
 496
 497 # Recommended: Limit WAL retention for safety (PostgreSQL 13+)
 498 max_slot_wal_keep_size = 10GB
 499
 500 # Optional: Tune checkpoint frequency to persist slot positions more often
 501 checkpoint_timeout = 5min
 502 ```
 503
 504 ### Client Requirements
 505
 506 - PostgreSQL client utilities installed (`pg_recvlogical`, `pg_dump`, `pg_dumpall`)
 507 - Superuser or role with `REPLICATION` privilege
 508 - Permission to create replication slots
 509 - decoder_raw plugin compiled and installed
 510
 511 ## Operational Procedures
 512
 513 ### Backup Schedule
 514
 515 **Recommended**:
 516 - **Incremental**: Continuously streaming via `pg_recvlogical`
 517 - **Full backup**: Daily at 2 AM
 518 - **Globals backup**: Daily (`pg_dumpall --globals-only`)
 519 - **Metadata export**: Daily (PostgreSQL version, extensions, encoding)
 520
 521 ### Retention Policy
 522
 523 - **Incremental backups**: Keep 7 days
 524 - **Full backups**: Keep 30 days, then one per month for 1 year
 525 - **Monitor disk space**: Alert if backup directory exceeds 80% capacity
 526
 527 ### Disaster Recovery Runbook
 528
 529 1. **Stop application** to prevent new writes during restore
 530 2. **Create new database** (don't overwrite production)
 531 3. **Restore shared objects**: `psql -f globals-YYYY-MM-DD.sql`
 532 4. **Restore full backup**: `psql dbname < base-YYYY-MM-DD.sql` (or `pg_restore` for directory format)
 533 5. **Apply DDL changes**: Extract from `ddl_history` table, apply in chronological order
 534 6. **Apply all incremental backups**: `psql dbname < incremental-*.sql` in chronological order
 535 7. **Sync sequences**: Run `setval()` for all sequences to match table max values
 536 8. **Verify data integrity**: Check row counts, run application smoke tests
 537 9. **Test application** against restored database
 538 10. **Switch over** application to restored database
 539
 540 **See Section 4 (Restore Script)** for detailed procedures including sequence synchronization and duplicate transaction handling.
 541
 542 ## Testing Strategy
 543
 544 ### 1. Basic Functionality Test
 545
 546 ```sql
 547 -- Create test database and setup
 548 CREATE DATABASE backup_test;
 549 \c backup_test
 550
 551 -- Create test table
 552 CREATE TABLE test_users (id SERIAL PRIMARY KEY, name TEXT, created_at TIMESTAMP DEFAULT now());
 553
 554 -- Generate data and schema changes
 555 INSERT INTO test_users (name) VALUES ('Alice'), ('Bob'), ('Charlie');
 556 UPDATE test_users SET name = 'Alice Smith' WHERE id = 1;
 557 DELETE FROM test_users WHERE id = 3;
 558 ALTER TABLE test_users ADD COLUMN email TEXT;
 559 UPDATE test_users SET email = 'alice@example.com' WHERE id = 1;
 560
 561 -- Restore and verify
 562 ```
 563
 564 ### 2. Crash Recovery Test
 565
 566 ```bash
 567 # Start collecting incrementals
 568 # Generate load with pgbench
 569 # Simulate crash: pg_ctl stop -m immediate
 570 # Restart PostgreSQL
 571 # Verify no data loss and duplicates handled correctly
 572 # Restore and verify
 573 ```
 574
 575 ### 3. Long-Term Storage Test
 576
 577 ```bash
 578 # Create backup
 579 # Store backup files
 580 # Wait (or simulate) years passing
 581 # Restore on modern PostgreSQL version
 582 # Verify SQL is still readable and executable
 583 ```
 584
 585 ### 4. Replica Identity Test
 586
 587 ```sql
 588 -- Create table without primary key
 589 CREATE TABLE test_no_pk (col1 TEXT, col2 INT);
 590
 591 -- Attempt UPDATE (should fail with replica identity error)
 592 UPDATE test_no_pk SET col2 = 5 WHERE col1 = 'test';
 593
 594 -- Fix with REPLICA IDENTITY FULL
 595 ALTER TABLE test_no_pk REPLICA IDENTITY FULL;
 596
 597 -- Retry UPDATE (should succeed)
 598 ```
 599
 600 ### 5. TRUNCATE Handling Test
 601
 602 ```sql
 603 -- Create test table
 604 CREATE TABLE test_truncate (id INT);
 605 INSERT INTO test_truncate VALUES (1), (2), (3);
 606
 607 -- Perform TRUNCATE
 608 TRUNCATE test_truncate;
 609
 610 -- Verify: Check if decoder_raw incremental backup captured TRUNCATE
 611 -- Expected: NOT captured by decoder_raw (known plugin limitation)
 612 -- Verify: Check if event trigger captured TRUNCATE as DDL in ddl_history table
 613 -- Expected: SHOULD be captured by event trigger (this is the workaround)
 614 SELECT * FROM ddl_history WHERE ddl_command = 'TRUNCATE TABLE' ORDER BY executed_at DESC LIMIT 1;
 615 ```
 616
 617 ## Performance Considerations
 618
 619 **Write Amplification**:
 620 - WAL must be written (normal)
 621 - WAL must be decoded into logical format (additional CPU)
 622 - Event triggers fire on every DDL operation (minimal overhead)
 623
 624 **Disk I/O**:
 625 - Additional WAL volume retained by replication slots
 626 - More frequent checkpoint I/O if checkpoint_timeout is tuned
 627
 628 **Recommendations**:
 629 - Benchmark overhead on test system with production-like workload
 630 - Monitor CPU usage of WAL sender processes
 631 - Monitor disk usage for WAL and backup directories
 632
 633 ## Next Steps for Proof of Concept
 634
 635 1. **Install decoder_raw**
 636    - Clone pg_plugins repository
 637    - Install PostgreSQL development headers
 638    - Compile and install decoder_raw
 639    - Verify installation
 640
 641 2. **Initial Setup**
 642    - Create replication slot with decoder_raw
 643    - Set up event triggers for DDL capture
 644    - Take initial synchronized base backup
 645
 646 3. **Streaming Collection**
 647    - Test `pg_recvlogical` with decoder_raw
 648    - Verify output is immediately executable SQL
 649    - Test with various data types and operations
 650
 651 4. **DDL Handling**
 652    - Test event trigger captures DDL correctly
 653    - Test `pg_dumpall --globals-only` captures shared objects
 654    - Verify coordination during restore
 655
 656 5. **Monitoring Setup**
 657    - Configure replication slot monitoring
 658    - Set up critical alerting
 659    - Document emergency procedures
 660
 661 6. **Restore Process**
 662    - Build restore scripts
 663    - Test point-in-time recovery
 664    - Verify sequence synchronization
 665
 666 7. **Crash Recovery**
 667    - Test duplicate handling
 668    - Verify LSN tracking
 669
 670 8. **Performance Testing**
 671    - Measure storage overhead
 672    - Measure CPU overhead
 673    - Benchmark restore time
 674
 675 ## References
 676
 677 ### PostgreSQL Documentation
 678 - PostgreSQL Documentation: Chapter 25 - Backup and Restore
 679 - PostgreSQL Documentation: Chapter 29 - Logical Replication
 680 - PostgreSQL Documentation: Chapter 47 - Logical Decoding
 681 - PostgreSQL Documentation: Section 29.8 - Logical Replication Restrictions
 682 - PostgreSQL Documentation: `pg_recvlogical` man page
 683 - PostgreSQL Documentation: `pg_dump` man page
 684 - PostgreSQL Documentation: `pg_dumpall` man page
 685
 686 ### Essential Tools
 687 - **decoder_raw**: SQL output plugin for logical decoding
 688   - Source: https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw
 689   - **CRITICAL COMPONENT**: Eliminates output transformation layer
 690   - License: PostgreSQL License (production-ready)
 691   - Compatibility: PostgreSQL 9.4+
 692
 693 ### Monitoring Tools
 694 - **Prometheus + postgres_exporter + Grafana**: Open-source monitoring stack
 695 - **pgDash**: PostgreSQL monitoring - https://pgdash.io/
 696 - **check_postgres**: Nagios/Icinga/Zabbix integration
 697 - **pg_stat_replication_slots**: Built-in PostgreSQL monitoring view