begriffs open source - pg_scribe/blob - logical-replication-backup-design.md

   1 # Incremental SQL Backup System Using PostgreSQL Logical Replication
   2
   3 ## Executive Summary
   4
   5 This document details a proof-of-concept design for a PostgreSQL backup system that produces human-readable, plain SQL incremental backups using logical replication. The system aims to create backups that remain readable and restorable for 10+ years while supporting online operation and crash safety.
   6
   7 **Key Finding**: This approach is **feasible but significantly more complex than initially assessed**.
   8
   9 **Major Challenges**:
  10 1. **DDL is not captured by logical replication** - requires event triggers (which have gaps for shared objects)
  11 2. **Replica identity enforcement** - tables without proper configuration will error on UPDATE/DELETE at the source
  12 3. **Operational risk** - inactive replication slots can cause database shutdown via transaction wraparound
  13 4. **Data type complexity** - transforming WAL output to portable SQL is non-trivial
  14 5. **Incomplete coverage** - event triggers don't fire for databases, roles, or tablespaces
  15
  16 **Bottom Line**: While technically feasible, this approach requires substantially more engineering effort, operational maturity, and ongoing maintenance than standard PostgreSQL backup solutions (pg_basebackup + WAL archiving). Only pursue this if human-readable SQL backups are a hard requirement (e.g., regulatory compliance).
  17
  18 ## Architecture Overview
  19
  20 ### High-Level Design
  21
  22 ```
  23 ┌─────────────────────────────────────────────────────────────┐
  24 │                    PostgreSQL Database                       │
  25 │                                                              │
  26 │  ┌────────────────┐         ┌──────────────────┐           │
  27 │  │  Regular Tables│────────▶│  WAL (Write-Ahead│           │
  28 │  │  (DML Changes) │         │       Log)       │           │
  29 │  └────────────────┘         └──────────────────┘           │
  30 │                                      │                       │
  31 │                                      ▼                       │
  32 │                         ┌─────────────────────────┐         │
  33 │                         │ Logical Decoding Process│         │
  34 │                         │  (Decodes WAL to SQL)   │         │
  35 │                         └─────────────────────────┘         │
  36 │                                      │                       │
  37 └──────────────────────────────────────┼───────────────────────┘
  38                                        │
  39                                        ▼
  40                         ┌─────────────────────────────┐
  41                         │  Replication Slot           │
  42                         │  (Tracks position, durable) │
  43                         └─────────────────────────────┘
  44                                        │
  45                                        ▼
  46                         ┌─────────────────────────────┐
  47                         │  Backup Collection Script   │
  48                         │  (Python/Bash)              │
  49                         └─────────────────────────────┘
  50                                        │
  51                         ┌──────────────┴──────────────┐
  52                         ▼                             ▼
  53           ┌─────────────────────┐      ┌─────────────────────┐
  54           │  Incremental Files  │      │  Full pg_dump       │
  55           │  (SQL Changes)      │      │  (Periodic)         │
  56           │  - 2024-01-01.sql   │      │  - base-2024-01.sql │
  57           │  - 2024-01-02.sql   │      │  - base-2024-02.sql │
  58           │  - ...              │      │  - ...              │
  59           └─────────────────────┘      └─────────────────────┘
  60 ```
  61
  62 ### Core Components
  63
  64 1. **Logical Replication Slot**: Durable position tracker in PostgreSQL
  65 2. **Output Plugin**: Transforms binary WAL to human-readable format (e.g., `test_decoding`)
  66 3. **Collection Script**: Periodically fetches and stores decoded changes
  67 4. **Base Backup System**: Regular full `pg_dump` backups
  68 5. **Schema Tracking System**: Separate mechanism for DDL changes (**required**)
  69
  70 ## How Logical Replication Works
  71
  72 ### What Logical Replication Provides
  73
  74 PostgreSQL's logical replication decodes the Write-Ahead Log (WAL) into logical changes:
  75
  76 ```sql
  77 -- Example output from test_decoding plugin:
  78 BEGIN 10298
  79 table public.users: INSERT: id[integer]:1 name[text]:'Alice' email[text]:'alice@example.com'
  80 table public.users: UPDATE: id[integer]:1 name[text]:'Alice Smith' email[text]:'alice@example.com'
  81 table public.orders: DELETE: id[integer]:42
  82 COMMIT 10298
  83 ```
  84
  85 This can be transformed into standard SQL:
  86
  87 ```sql
  88 BEGIN;
  89 INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
  90 UPDATE public.users SET name = 'Alice Smith' WHERE id = 1;
  91 DELETE FROM public.orders WHERE id = 42;
  92 COMMIT;
  93 ```
  94
  95 ### Key Properties
  96
  97 - **Crash-safe**: Replication slots persist position across crashes
  98 - **Consistent**: Transaction boundaries are preserved
  99 - **Online**: Runs without blocking database operations
 100 - **Idempotent positioning**: Can restart from last known position
 101
 102 ## Critical Challenge: DDL is NOT Replicated
 103
 104 ### The Problem
 105
 106 From PostgreSQL documentation (Section 29.8):
 107
 108 > **The database schema and DDL commands are not replicated.** The initial schema can be copied by hand using `pg_dump --schema-only`. Subsequent schema changes would need to be kept in sync manually.
 109
 110 This means logical replication will **NOT** capture:
 111
 112 - `CREATE TABLE`
 113 - `ALTER TABLE` (adding columns, changing types, etc.)
 114 - `DROP TABLE`
 115 - `CREATE INDEX`
 116 - Any other schema modifications
 117
 118 ### Example of the Issue
 119
 120 ```sql
 121 -- Day 1: Base backup taken
 122 CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
 123
 124 -- Day 2: Data changes (CAPTURED by logical replication)
 125 INSERT INTO users VALUES (1, 'Alice');
 126
 127 -- Day 3: Schema change (NOT CAPTURED)
 128 ALTER TABLE users ADD COLUMN email TEXT;
 129
 130 -- Day 4: More data changes (CAPTURED, but in new schema)
 131 UPDATE users SET email = 'alice@example.com' WHERE id = 1;
 132 ```
 133
 134 **Problem**: The incremental backup for Day 4 will contain:
 135 ```sql
 136 UPDATE users SET email = 'alice@example.com' WHERE id = 1;
 137 ```
 138
 139 But if you restore from Day 1 base backup, the `email` column doesn't exist yet, and the UPDATE will fail.
 140
 141 ### Solutions for DDL Tracking
 142
 143 #### Option 1: Event Triggers (Recommended)
 144
 145 PostgreSQL supports **event triggers** that can capture DDL events:
 146
 147 ```sql
 148 -- Create a table to log DDL changes
 149 CREATE TABLE ddl_history (
 150     id SERIAL PRIMARY KEY,
 151     executed_at TIMESTAMP DEFAULT now(),
 152     ddl_command TEXT,
 153     object_type TEXT,
 154     object_identity TEXT,
 155     query TEXT
 156 );
 157
 158 -- Create event trigger function
 159 CREATE OR REPLACE FUNCTION log_ddl_changes()
 160 RETURNS event_trigger AS $$
 161 DECLARE
 162     obj RECORD;
 163 BEGIN
 164     -- Capture DDL command details
 165     FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
 166     LOOP
 167         INSERT INTO ddl_history (ddl_command, object_type, object_identity, query)
 168         VALUES (obj.command_tag, obj.object_type, obj.object_identity, current_query());
 169     END LOOP;
 170 END;
 171 $$ LANGUAGE plpgsql;
 172
 173 -- Register the event trigger for DDL commands
 174 CREATE EVENT TRIGGER capture_ddl ON ddl_command_end
 175     EXECUTE FUNCTION log_ddl_changes();
 176 ```
 177
 178 **How it works**:
 179 1. Event trigger captures all DDL commands
 180 2. Stores them in `ddl_history` table with timestamps
 181 3. Your backup script exports this table periodically
 182 4. During restore, apply DDL changes in chronological order before applying DML changes
 183
 184 **Benefits**:
 185 - ✓ Captures all DDL automatically
 186 - ✓ Timestamp-ordered for correct replay
 187 - ✓ The `ddl_history` table itself gets replicated via logical replication
 188
 189 **Limitations**:
 190 - ✗ **Event triggers don't fire for shared objects** - this is a significant gap:
 191   - `CREATE/DROP/ALTER DATABASE` commands are not captured
 192   - `CREATE/DROP/ALTER ROLE` and role membership (`GRANT ROLE`) are not captured
 193   - `CREATE/DROP/ALTER TABLESPACE` commands are not captured
 194   - `ALTER SYSTEM` commands are not captured
 195   - These require a **separate tracking mechanism** or periodic `pg_dumpall --globals-only`
 196 - ✗ Event trigger functions cannot themselves use `CREATE EVENT TRIGGER` (triggers on triggers)
 197 - ✗ The `ddl_history` table's own schema evolution creates a bootstrap problem
 198
 199 #### Option 2: Application-Level Tracking
 200
 201 Track all schema migrations in your application:
 202
 203 ```python
 204 # migration_001.sql
 205 CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
 206
 207 # migration_002.sql
 208 ALTER TABLE users ADD COLUMN email TEXT;
 209 ```
 210
 211 Store these migrations alongside your backups and apply them in order during restore.
 212
 213 **Benefits**:
 214 - ✓ Clean, version-controlled schema evolution
 215 - ✓ Works well if you already use migration tools (Flyway, Liquibase, Alembic)
 216
 217 **Limitations**:
 218 - ✗ Requires discipline - all DDL must go through migration system
 219 - ✗ Ad-hoc DDL changes will be missed
 220
 221 #### Option 3: Periodic Schema Dumps
 222
 223 Take regular `pg_dump --schema-only` snapshots:
 224
 225 ```bash
 226 # Every hour or day
 227 pg_dump --schema-only dbname > schema-2024-01-01-14:00.sql
 228 ```
 229
 230 During restore:
 231 1. Restore base backup
 232 2. Apply schema dumps in order
 233 3. Apply incremental DML changes
 234
 235 **Benefits**:
 236 - ✓ Simple to implement
 237 - ✓ Catches all schema changes
 238
 239 **Limitations**:
 240 - ✗ May capture schema at wrong moment (between related DDL+DML)
 241 - ✗ Harder to determine exact order of schema vs. data changes
 242
 243 ## Things That Work Better Than Expected
 244
 245 Before diving into challenges, it's worth noting what works well:
 246
 247 ### 1. Crash Safety is Excellent
 248
 249 **From documentation (Section 47.2.2)**:
 250 > A logical slot will emit each change just once in normal operation. Slots persist independently of the connection using them and are crash-safe.
 251
 252 The replication slot mechanism is robust and well-tested. After crashes:
 253 - No data loss occurs (all changes are preserved in WAL)
 254 - Position tracking is durable (persisted at checkpoints)
 255 - Only duplicate delivery is possible (easily handled)
 256
 257 ### 2. Transaction Boundaries Are Preserved
 258
 259 Logical decoding respects ACID properties:
 260 - All changes within a transaction are grouped together
 261 - Transactions appear atomically in the change stream
 262 - Transaction ordering is maintained
 263 - This makes restore much simpler than systems that capture individual row changes
 264
 265 ### 3. Online Operation Has No Blocking
 266
 267 Unlike some backup methods:
 268 - No locks are held during logical decoding
 269 - No impact on normal database operations
 270 - Can run continuously without maintenance windows
 271 - Minimal performance impact (though measurable - see Section 8)
 272
 273 ### 4. Exported Snapshots Solve Initial Consistency
 274
 275 PostgreSQL provides built-in snapshot export when creating a replication slot, ensuring the base backup and incremental stream are perfectly aligned with no gaps. This is a solved problem.
 276
 277 ## Other Challenges and Limitations
 278
 279 ### 1. Sequences Are Not Replicated
 280
 281 **Problem**: Sequence values (used by `SERIAL` and `IDENTITY` columns) are not replicated.
 282
 283 **From documentation (Section 29.8)**:
 284 > Sequence data is not replicated. The data in serial or identity columns backed by sequences will of course be replicated as part of the table, but the sequence itself would still show the start value on the subscriber.
 285
 286 **Example**:
 287 ```sql
 288 CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);
 289
 290 -- Insert 1000 rows
 291 -- Sequence on source is now at 1001
 292 -- Sequence on restored database is still at 1
 293
 294 -- Next insert will fail with duplicate key error
 295 INSERT INTO users (name) VALUES ('Bob');  -- Tries to use id=1, but id=1 already exists
 296 ```
 297
 298 **Solution**:
 299 Include sequence values in periodic full dumps, or have restore script run:
 300
 301 ```sql
 302 -- After restore, sync all sequences
 303 SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));
 304 ```
 305
 306 ### 2. Large Objects Are Not Replicated
 307
 308 **From documentation (Section 29.8)**:
 309 > Large objects (see Chapter 33) are not replicated. There is no workaround for that, other than storing data in normal tables.
 310
 311 **Solution**: Avoid using PostgreSQL large objects. Use `BYTEA` columns instead, which ARE replicated.
 312
 313 ### 3. Replication Slots Prevent WAL Cleanup
 314
 315 **From documentation (Section 47.2.2)**:
 316 > Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot.
 317
 318 **Problem**: If your backup script stops running, the replication slot will:
 319 1. Prevent WAL files from being cleaned up → **disk fills up**
 320 2. Prevent VACUUM from cleaning old row versions → **table bloat**
 321 3. Prevent VACUUM from cleaning old transaction IDs → **database shutdown to prevent transaction ID wraparound**
 322
 323 **From documentation (Section 47.2.2)**:
 324 > In extreme cases this could cause the database to shut down to prevent transaction ID wraparound.
 325
 326 This is **not a theoretical risk** - an inactive replication slot can and will cause production outages.
 327
 328 **Solution**:
 329 - **Monitor slot lag aggressively**:
 330   ```sql
 331   SELECT * FROM pg_replication_slots;
 332   ```
 333 - Set up **critical alerting** when `restart_lsn` falls more than 1GB behind
 334 - Set up **emergency alerting** when slot lag exceeds 10GB or age exceeds 24 hours
 335 - Have a **documented emergency procedure** to drop the slot if it threatens database availability
 336 - Consider using `pg_replication_slot_advance()` to skip ahead if needed (loses incremental backup coverage)
 337 - Use `max_slot_wal_keep_size` parameter (PostgreSQL 13+) to limit how much WAL a slot can retain
 338
 339 ### 4. Slot Position is Only Persisted at Checkpoint
 340
 341 **From documentation (Section 47.2.2)**:
 342 > The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot might return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts.
 343
 344 **Problem**: After a crash, you might receive duplicate changes.
 345
 346 **Example**:
 347 ```
 348 1. Script fetches changes, receives up to LSN 1000
 349 2. Script writes changes to file
 350 3. Database crashes before checkpoint
 351 4. Slot position rolled back to LSN 900
 352 5. Script fetches again, receives LSN 900-1000 again (duplicates!)
 353 ```
 354
 355 **Solution**:
 356 - Track the highest LSN you've successfully written
 357 - Pass that LSN to `pg_logical_slot_get_changes(slot_name, upto_lsn, ...)`
 358 - Or filter duplicates based on transaction ID
 359
 360 ### 5. TRUNCATE Replication Complexity
 361
 362 **From documentation (Section 29.8)**:
 363 > Replication of TRUNCATE commands is supported, but some care must be taken when truncating groups of tables connected by foreign keys.
 364
 365 **Problem**: If you truncate multiple related tables, the incremental backup needs to capture all of them.
 366
 367 **Solution**: Logical replication handles this automatically, but be aware of the behavior.
 368
 369 ### 6. Replica Identity Required for UPDATE/DELETE
 370
 371 For UPDATE and DELETE operations, PostgreSQL needs to identify which rows to modify. This requires a **replica identity**.
 372
 373 **From documentation (Section 29.1.1)**:
 374 > A published table must have a replica identity configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side.
 375
 376 **Replica Identity Modes**:
 377
 378 - **DEFAULT**: Records old values of primary key columns
 379   - This is the default for non-system tables
 380   - **WARNING**: When there is no primary key, behavior is the same as `NOTHING`
 381
 382 - **USING INDEX _index_name_**: Records old values of columns in the specified unique index
 383   - Index must be unique, not partial, not deferrable
 384   - Index columns must be marked `NOT NULL`
 385   - If the index is later dropped, behavior becomes the same as `NOTHING`
 386
 387 - **FULL**: Records old values of all columns in the row
 388   - Allows UPDATE/DELETE without a primary key, but very expensive
 389   - Can use indexes on subscriber for searching, but inefficient without them
 390
 391 - **NOTHING**: Records no information about the old row
 392   - This is the default for system tables
 393   - Only `INSERT` operations are replicated
 394
 395 **Critical Behavior**:
 396
 397 From documentation (Section 29.1.1):
 398 > Tables with a replica identity defined as `NOTHING`, `DEFAULT` without a primary key, or `USING INDEX` with a dropped index, **cannot support UPDATE or DELETE operations** when included in a publication replicating these actions. **Attempting such operations will result in an error on the publisher.**
 399
 400 This means UPDATE/DELETE will **fail on the source database**, not just during restore!
 401
 402 **Example**:
 403 ```sql
 404 -- Table without primary key
 405 CREATE TABLE logs (timestamp TIMESTAMPTZ, message TEXT);
 406
 407 -- By default, replica identity is DEFAULT
 408 -- Since there's no PK, this behaves as NOTHING
 409 -- UPDATE/DELETE operations will ERROR if this table is in a publication!
 410
 411 -- Attempting this will fail:
 412 UPDATE logs SET message = 'Updated' WHERE timestamp < now();
 413 -- ERROR:  cannot update table "logs" because it does not have a replica identity
 414
 415 -- Fix: Set replica identity to FULL
 416 ALTER TABLE logs REPLICA IDENTITY FULL;
 417
 418 -- Now UPDATE/DELETE will work (but be inefficient)
 419 UPDATE logs SET message = 'Updated' WHERE timestamp < now();  -- OK
 420 ```
 421
 422 **Best practice**: Ensure all tables have either:
 423 - A primary key (automatic replica identity), OR
 424 - A unique index with all columns NOT NULL, configured via `REPLICA IDENTITY USING INDEX`, OR
 425 - Explicit `REPLICA IDENTITY FULL` setting (only as a last resort - very inefficient)
 426
 427 ### 7. Long-Term SQL Readability Challenges
 428
 429 The goal of producing SQL that remains readable and executable for 10+ years introduces additional complexities:
 430
 431 **Version Compatibility Issues**:
 432 - PostgreSQL syntax may change between major versions (though rare)
 433 - Data type representations may evolve
 434 - Default behaviors may change (e.g., how NULLs are handled in certain contexts)
 435
 436 **Extension Dependencies**:
 437 - Custom data types from extensions (PostGIS geometries, hstore, etc.) may not exist in future systems
 438 - Extension versions may be incompatible (e.g., PostGIS 2.x vs 4.x)
 439 - Must document all extension names and versions alongside backups
 440
 441 **Encoding and Collation**:
 442 - Character encoding standards may evolve
 443 - Locale/collation definitions may change
 444 - Text comparison behavior may differ across PostgreSQL versions
 445
 446 **Recommendations**:
 447 - Include a metadata file with each backup containing:
 448   - PostgreSQL version (full version string)
 449   - All installed extension names and versions
 450   - Database encoding
 451   - Locale and collation settings
 452   - Custom data types and enums
 453 - Periodically test restoring old backups on current PostgreSQL versions
 454 - Consider limiting use of PostgreSQL-specific SQL features for better forward compatibility
 455 - Prefer standard SQL types over custom types where possible
 456
 457 ### 8. Performance and Overhead
 458
 459 Logical replication introduces measurable overhead:
 460
 461 **Write Amplification**:
 462 - WAL must be written (normal)
 463 - WAL must be decoded into logical format (additional CPU)
 464 - Event triggers fire on every DDL operation (additional overhead)
 465 - The `ddl_history` table itself generates more WAL entries
 466
 467 **Disk I/O**:
 468 - Additional WAL volume retained by replication slots
 469 - More frequent checkpoint I/O if tuned for faster slot position persistence
 470
 471 **Recommendations**:
 472 - Benchmark the overhead on a test system with production-like workload
 473 - Monitor CPU usage of WAL sender processes
 474 - Consider the trade-off: is the overhead worth human-readable backups?
 475
 476 ## Prerequisites and Configuration
 477
 478 ### PostgreSQL Configuration
 479
 480 ```ini
 481 # postgresql.conf
 482
 483 # Required: Set WAL level to logical
 484 wal_level = logical
 485
 486 # Required: Allow at least one replication slot
 487 max_replication_slots = 10
 488
 489 # Recommended: Allow replication connections
 490 max_wal_senders = 10
 491
 492 # Recommended: Keep more WAL for safety
 493 wal_keep_size = 1GB
 494
 495 # Optional: Tune checkpoint frequency to persist slot positions more often
 496 checkpoint_timeout = 5min
 497 ```
 498
 499 ### Client Requirements
 500
 501 - PostgreSQL client libraries (`psycopg2` for Python)
 502 - Superuser or role with `REPLICATION` privilege
 503 - Permission to create replication slots
 504
 505 ## Implementation Components
 506
 507 When implementing this system, the following programs/scripts will be needed:
 508
 509 ### 1. Initial Setup Script
 510
 511 **Purpose**: Bootstrap the backup system
 512
 513 **Tasks**:
 514 - Create logical replication slot with appropriate output plugin
 515 - Export the snapshot created by the slot for consistency
 516 - Take initial base backup using `pg_dump` **using the exported snapshot**
 517 - Set up DDL tracking infrastructure (event triggers and ddl_history table)
 518 - Create supplemental backup for shared objects (`pg_dumpall --globals-only`)
 519 - Set `REPLICA IDENTITY` on tables without primary keys
 520 - Document PostgreSQL version and installed extensions with versions
 521
 522 **Critical Detail - Initial Snapshot Consistency**:
 523
 524 From documentation (Section 47.2.5):
 525 > When a new replication slot is created using the streaming replication interface, a snapshot is exported which will show exactly the state of the database after which all changes will be included in the change stream. This can be used to create a new replica by using SET TRANSACTION SNAPSHOT to read the state of the database at the moment the slot was created.
 526
 527 This ensures the base backup and incremental stream are perfectly aligned with no gaps or overlaps.
 528
 529 ### 2. Incremental Backup Collection Script
 530
 531 **Purpose**: Regularly collect changes from the replication slot
 532 **Tasks**:
 533 - Connect to PostgreSQL and fetch changes from the replication slot
 534 - Track last processed LSN to handle duplicates after crashes
 535 - Transform test_decoding output into clean SQL statements
 536 - Write changes to timestamped SQL files
 537 - Update state file with last processed LSN
 538 - Handle errors and retry logic
 539
 540 ### 3. Periodic Full Backup Script
 541
 542 **Purpose**: Take regular full backups as restore points
 543 **Tasks**:
 544 - Execute `pg_dump` to create full database backup
 545 - Take schema-only dump separately for reference
 546 - Compress old backups to save space
 547 - Implement retention policy (delete old backups)
 548 - Export DDL history table
 549
 550 ### 4. Restore Script
 551
 552 **Purpose**: Restore database from base + incremental backups
 553 **Tasks**:
 554 - Locate most recent full backup
 555 - Find all incremental backups since that full backup
 556 - Recreate target database
 557 - Restore full backup
 558 - Apply incremental backups in chronological order
 559 - Handle point-in-time recovery (stop at specific timestamp)
 560 - Synchronize sequence values to current maximums
 561 - Verify data integrity
 562
 563 ### 5. Output Plugin / Format Transformer
 564
 565 **Purpose**: Convert logical decoding output to clean SQL
 566
 567 **Tasks**:
 568 - Parse output plugin format (test_decoding or pgoutput)
 569 - Generate proper INSERT/UPDATE/DELETE statements
 570 - Handle complex data type conversions and escaping:
 571   - Arrays (e.g., `{1,2,3}`)
 572   - JSON/JSONB values
 573   - BYTEA (binary data)
 574   - Custom/composite types
 575   - NULL vs empty string vs literal 'null'
 576   - Text with quotes, backslashes, and special characters
 577   - Temporal types (timestamp with timezone, intervals)
 578 - Preserve transaction boundaries (BEGIN/COMMIT)
 579 - Format output for human readability
 580 - Ensure output is valid SQL that can be executed years later
 581
 582 **Challenges**:
 583 - The `test_decoding` plugin is primarily for debugging, not production use
 584 - Consider using the `pgoutput` plugin (used by PostgreSQL's built-in logical replication)
 585 - Data type semantics may change across PostgreSQL major versions
 586 - Custom types and extensions (PostGIS, etc.) may not exist in future systems
 587
 588 ### 6. Monitoring and Health Check Script
 589
 590 **Purpose**: Ensure backup system is functioning correctly
 591 **Tasks**:
 592 - Check replication slot health and lag
 593 - Alert if slot is not advancing
 594 - Monitor disk space in backup directories
 595 - Verify backup files are being created
 596 - Test restore procedures periodically
 597
 598 ## Testing Strategy
 599
 600 ### 1. Basic Functionality Test
 601
 602 ```sql
 603 -- Create test database
 604 CREATE DATABASE backup_test;
 605 \c backup_test
 606
 607 -- Run setup script
 608 -- Then create test data
 609 CREATE TABLE test_users (id SERIAL PRIMARY KEY, name TEXT, created_at TIMESTAMP DEFAULT now());
 610
 611 INSERT INTO test_users (name) VALUES ('Alice'), ('Bob'), ('Charlie');
 612
 613 -- Wait for incremental backup to run
 614 -- Add more data
 615 UPDATE test_users SET name = 'Alice Smith' WHERE id = 1;
 616 DELETE FROM test_users WHERE id = 3;
 617
 618 -- Take another incremental
 619 -- Add schema change
 620 ALTER TABLE test_users ADD COLUMN email TEXT;
 621 UPDATE test_users SET email = 'alice@example.com' WHERE id = 1;
 622
 623 -- Now restore and verify all data is present and correct
 624 ```
 625
 626 ### 2. Crash Recovery Test
 627
 628 ```bash
 629 # Start collecting incrementals
 630 # Generate load with pgbench
 631 # Simulate crash with pg_ctl stop -m immediate
 632 # Restart PostgreSQL
 633 # Collect incrementals again - verify no data loss and duplicates are handled
 634 # Restore and verify data matches
 635 ```
 636
 637 ### 3. Long-Term Storage Test
 638
 639 ```bash
 640 # Create backup
 641 # Store backup in time capsule (simulate 10 years)
 642 # On modern system, verify SQL is still readable and executable
 643 ```
 644
 645 ## Operational Procedures
 646
 647 ### Monitoring
 648
 649 Monitor these metrics:
 650
 651 ```sql
 652 -- Check replication slot health
 653 SELECT
 654     slot_name,
 655     slot_type,
 656     database,
 657     active,
 658     pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replication_lag,
 659     pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmed_lag
 660 FROM pg_replication_slots
 661 WHERE slot_type = 'logical';
 662
 663 -- Alert if lag > 1GB or slot is inactive for > 1 hour
 664 ```
 665
 666 ### Backup Schedule
 667
 668 **Recommended**:
 669 - **Incremental**: Every 15 minutes (or continuously streaming)
 670 - **Full backup**: Daily at 2 AM
 671 - **Schema-only dump**: Every hour
 672 - **DDL history export**: Every 15 minutes (part of incremental)
 673
 674 ### Retention Policy
 675
 676 - **Incremental backups**: Keep 7 days
 677 - **Full backups**: Keep 30 days, then one per month for 1 year
 678 - **Monitor disk space**: Alert if backup directory exceeds 80% capacity
 679
 680 ### Disaster Recovery Runbook
 681
 682 1. **Stop application** to prevent new writes during restore
 683 2. **Create new database** (don't overwrite production)
 684 3. **Restore latest full backup**
 685 4. **Apply DDL changes** from `ddl_history` in chronological order
 686 5. **Apply incremental backups** in chronological order
 687 6. **Sync sequences** to latest values
 688 7. **Verify data integrity** with checksums or row counts
 689 8. **Test application** against restored database
 690 9. **Switch over** application to restored database
 691
 692 ## Comparison with Alternatives
 693
 694 | Feature | Logical Replication | pg_dump Only | WAL Archiving |
 695 |---------|-------------------|--------------|---------------|
 696 | Incremental | ✓ Yes | ✗ No | ✓ Yes |
 697 | Human-readable | ✓ SQL-like | ✓ SQL | ✗ Binary |
 698 | Online operation | ✓ Yes | ✓ Yes | ✓ Yes |
 699 | Crash-safe position | ✓ Yes | N/A | ✓ Yes |
 700 | Captures DDL | ✗ **No** | ✓ Yes | ✓ Yes |
 701 | Long-term readable | ✓ Yes | ✓ Yes | ✗ No |
 702 | Point-in-time recovery | ✓ Yes | ✗ No | ✓ Yes |
 703 | Storage efficiency | ✓ Good | ✗ Poor | ✓ Excellent |
 704
 705 ## Conclusion
 706
 707 ### Feasibility: YES, but More Complex Than Initially Assessed
 708
 709 The logical replication approach **is feasible** for incremental SQL backups, but the implementation is **more complex than it initially appears**. Key requirements:
 710
 711 1. **Separate DDL tracking** using event triggers (with gaps for shared objects like databases/roles)
 712 2. **Additional tracking** for shared objects via periodic `pg_dumpall --globals-only`
 713 3. **Aggressive monitoring** of replication slot health to prevent production outages
 714 4. **Replica identity configuration** on all tables (or face errors on UPDATE/DELETE)
 715 5. **Sequence synchronization** during restore
 716 6. **Duplicate handling** for crash scenarios
 717 7. **Complex data type handling** in the SQL transformation layer
 718 8. **Version and extension metadata** tracking for long-term readability
 719 9. **Performance overhead** monitoring and capacity planning
 720
 721 ### Primary Limitations
 722
 723 **1. DDL Capture is Incomplete**
 724
 725 While logical replication doesn't capture DDL, event triggers only partially solve this:
 726 - ✓ Captures table/index/function DDL
 727 - ✗ Does NOT capture database creation/deletion
 728 - ✗ Does NOT capture role management and permissions
 729 - ✗ Does NOT capture tablespace changes
 730 - Requires supplemental `pg_dumpall --globals-only` backups
 731
 732 **2. Replica Identity Enforcement is Strict**
 733
 734 Tables without proper replica identity will **cause errors on the source database** for UPDATE/DELETE operations, not just during restore. This requires proactive configuration.
 735
 736 **3. Operational Risk from Inactive Slots**
 737
 738 An inactive replication slot can cause database shutdown via transaction ID wraparound. This is not a theoretical concern - it requires 24/7 monitoring and documented emergency procedures.
 739
 740 ### When This Approach Makes Sense
 741
 742 - You need human-readable backups for compliance/archival (legal requirement)
 743 - You want continuous incremental backups without WAL shipping
 744 - You have **strong operational maturity** and 24/7 monitoring capability
 745 - You can dedicate engineering time to building and maintaining the system
 746 - You can enforce discipline around replica identity and schema changes
 747 - Your database has moderate change velocity (not extremely high-volume OLTP)
 748 - You're willing to accept the performance overhead
 749
 750 ### When to Consider Alternatives
 751
 752 - If you need zero-configuration backup → use `pg_dump` + `pg_basebackup`
 753 - If operational complexity is too high → use WAL archiving + `pg_basebackup` (standard PITR)
 754 - If you lack 24/7 monitoring capability → any approach without replication slots
 755 - If you need guaranteed DDL capture → use WAL archiving (captures everything)
 756 - If performance overhead is unacceptable → use `pg_basebackup` + WAL archiving
 757 - If you need native PostgreSQL tooling only → use continuous archiving
 758
 759 ### Realistic Complexity Assessment
 760
 761 **This is NOT a simple weekend project.** A production-ready implementation requires:
 762
 763 - **Development effort**: 4-8 weeks for initial implementation
 764 - **Testing effort**: 2-4 weeks for comprehensive testing
 765 - **Documentation**: Detailed runbooks for operations team
 766 - **Ongoing maintenance**: Regular updates as PostgreSQL evolves
 767 - **24/7 monitoring**: Critical alerting on replication slot health
 768 - **Operational expertise**: Deep PostgreSQL knowledge required
 769
 770 The original assessment of "feasible with significant caveats" should be updated to "**feasible but significantly more complex than standard backup solutions**".
 771
 772 ### Next Steps for Proof of Concept
 773
 774 1. Implement the setup and collection scripts
 775 2. Add DDL tracking via event triggers
 776 3. Build transformation layer from `test_decoding` output to clean SQL
 777 4. Test crash recovery scenarios
 778 5. Validate restore procedures
 779 6. Measure storage and performance overhead
 780 7. Document operational procedures
 781
 782 ## References
 783
 784 - PostgreSQL Documentation: Chapter 25 - Backup and Restore
 785 - PostgreSQL Documentation: Chapter 29 - Logical Replication
 786 - PostgreSQL Documentation: Chapter 47 - Logical Decoding
 787 - PostgreSQL Documentation: Section 29.8 - Logical Replication Restrictions