# Incremental SQL Backup System Using PostgreSQL Logical Replication ## Executive Summary This document details a proof-of-concept design for a PostgreSQL backup system that produces human-readable, plain SQL incremental backups using logical replication. The system aims to create backups that remain readable and restorable for 10+ years while supporting online operation and crash safety. **Key Finding**: This approach is **feasible and significantly aided by built-in PostgreSQL tools**. **UPDATE**: The built-in `pg_recvlogical` utility provides most of the incremental collection functionality, substantially reducing implementation complexity from initial estimates. **Major Challenges**: 1. **DDL is not captured by logical replication** - requires event triggers (which have gaps for shared objects) - ✅ **Addressed by**: `pg_dumpall --globals-only` for databases/roles/tablespaces 2. **Replica identity enforcement** - tables without proper configuration will error on UPDATE/DELETE at the source - ⚠️ **No tool assistance** - requires manual configuration and enforcement 3. **Operational risk** - inactive replication slots can cause database shutdown via transaction wraparound - ⚠️ **Partial assistance** - monitoring views exist (`pg_stat_replication_slots`) but alerting must be custom 4. **Data type complexity** - transforming WAL output to portable SQL is non-trivial - ✅ **Simplified by**: `pg_recvlogical` provides structured output from `test_decoding` plugin 5. **Incomplete coverage** - event triggers don't fire for databases, roles, or tablespaces - ✅ **Addressed by**: `pg_dumpall --globals-only` captures all shared objects **Bottom Line**: While technically feasible, this approach requires substantially more engineering effort, operational maturity, and ongoing maintenance than standard PostgreSQL backup solutions (pg_basebackup + WAL archiving). However, the existence of `pg_recvlogical` reduces complexity significantly. Only pursue this if human-readable SQL backups are a hard requirement (e.g., regulatory compliance). ## Architecture Overview ### High-Level Design ``` ┌─────────────────────────────────────────────────────────────┐ │ PostgreSQL Database │ │ │ │ ┌────────────────┐ ┌──────────────────┐ │ │ │ Regular Tables│────────▶│ WAL (Write-Ahead│ │ │ │ (DML Changes) │ │ Log) │ │ │ └────────────────┘ └──────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Logical Decoding Process│ │ │ │ (Decodes WAL to SQL) │ │ │ └─────────────────────────┘ │ │ │ │ └──────────────────────────────────────┼───────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ Replication Slot │ │ (Tracks position, durable) │ └─────────────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ pg_recvlogical Tool │ │ (Built-in PostgreSQL util) │ └─────────────────────────────┘ │ ┌──────────────┴──────────────┐ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ Incremental Files │ │ Full pg_dump │ │ (SQL Changes) │ │ (Periodic) │ │ - 2024-01-01.sql │ │ - base-2024-01.sql │ │ - 2024-01-02.sql │ │ - base-2024-02.sql │ │ - ... │ │ - ... │ └─────────────────────┘ └─────────────────────┘ ``` ### Core Components 1. **Logical Replication Slot**: Durable position tracker in PostgreSQL 2. **Output Plugin**: Transforms binary WAL to human-readable format (e.g., `test_decoding`) 3. **`pg_recvlogical`**: Built-in PostgreSQL tool that streams logical decoding output 4. **Base Backup System**: Regular full `pg_dump` backups with `--snapshot` for consistency 5. **Schema Tracking System**: Event triggers + `pg_dumpall --globals-only` for DDL changes ## How Logical Replication Works ### What Logical Replication Provides PostgreSQL's logical replication decodes the Write-Ahead Log (WAL) into logical changes: ```sql -- Example output from test_decoding plugin: BEGIN 10298 table public.users: INSERT: id[integer]:1 name[text]:'Alice' email[text]:'alice@example.com' table public.users: UPDATE: id[integer]:1 name[text]:'Alice Smith' email[text]:'alice@example.com' table public.orders: DELETE: id[integer]:42 COMMIT 10298 ``` This can be transformed into standard SQL: ```sql BEGIN; INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com'); UPDATE public.users SET name = 'Alice Smith' WHERE id = 1; DELETE FROM public.orders WHERE id = 42; COMMIT; ``` ### Key Properties - **Crash-safe**: Replication slots persist position across crashes - **Consistent**: Transaction boundaries are preserved - **Online**: Runs without blocking database operations - **Idempotent positioning**: Can restart from last known position ## Critical Challenge: DDL is NOT Replicated ### The Problem From PostgreSQL documentation (Section 29.8): > **The database schema and DDL commands are not replicated.** The initial schema can be copied by hand using `pg_dump --schema-only`. Subsequent schema changes would need to be kept in sync manually. This means logical replication will **NOT** capture: - `CREATE TABLE` - `ALTER TABLE` (adding columns, changing types, etc.) - `DROP TABLE` - `CREATE INDEX` - Any other schema modifications ### Example of the Issue ```sql -- Day 1: Base backup taken CREATE TABLE users (id INT PRIMARY KEY, name TEXT); -- Day 2: Data changes (CAPTURED by logical replication) INSERT INTO users VALUES (1, 'Alice'); -- Day 3: Schema change (NOT CAPTURED) ALTER TABLE users ADD COLUMN email TEXT; -- Day 4: More data changes (CAPTURED, but in new schema) UPDATE users SET email = 'alice@example.com' WHERE id = 1; ``` **Problem**: The incremental backup for Day 4 will contain: ```sql UPDATE users SET email = 'alice@example.com' WHERE id = 1; ``` But if you restore from Day 1 base backup, the `email` column doesn't exist yet, and the UPDATE will fail. ### Solutions for DDL Tracking #### Option 1: Event Triggers (Recommended) PostgreSQL supports **event triggers** that can capture DDL events: ```sql -- Create a table to log DDL changes CREATE TABLE ddl_history ( id SERIAL PRIMARY KEY, executed_at TIMESTAMP DEFAULT now(), ddl_command TEXT, object_type TEXT, object_identity TEXT, query TEXT ); -- Create event trigger function CREATE OR REPLACE FUNCTION log_ddl_changes() RETURNS event_trigger AS $$ DECLARE obj RECORD; BEGIN -- Capture DDL command details FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands() LOOP INSERT INTO ddl_history (ddl_command, object_type, object_identity, query) VALUES (obj.command_tag, obj.object_type, obj.object_identity, current_query()); END LOOP; END; $$ LANGUAGE plpgsql; -- Register the event trigger for DDL commands CREATE EVENT TRIGGER capture_ddl ON ddl_command_end EXECUTE FUNCTION log_ddl_changes(); ``` **How it works**: 1. Event trigger captures all DDL commands 2. Stores them in `ddl_history` table with timestamps 3. Your backup script exports this table periodically 4. During restore, apply DDL changes in chronological order before applying DML changes **Benefits**: - ✓ Captures all DDL automatically - ✓ Timestamp-ordered for correct replay - ✓ The `ddl_history` table itself gets replicated via logical replication **Limitations**: - ✗ **Event triggers don't fire for shared objects** - this is a significant gap: - `CREATE/DROP/ALTER DATABASE` commands are not captured - `CREATE/DROP/ALTER ROLE` and role membership (`GRANT ROLE`) are not captured - `CREATE/DROP/ALTER TABLESPACE` commands are not captured - `ALTER SYSTEM` commands are not captured - These require a **separate tracking mechanism** or periodic `pg_dumpall --globals-only` - ✗ Event trigger functions cannot themselves use `CREATE EVENT TRIGGER` (triggers on triggers) - ✗ The `ddl_history` table's own schema evolution creates a bootstrap problem #### Option 2: Application-Level Tracking Track all schema migrations in your application: ```python # migration_001.sql CREATE TABLE users (id INT PRIMARY KEY, name TEXT); # migration_002.sql ALTER TABLE users ADD COLUMN email TEXT; ``` Store these migrations alongside your backups and apply them in order during restore. **Benefits**: - ✓ Clean, version-controlled schema evolution - ✓ Works well if you already use migration tools (Flyway, Liquibase, Alembic) **Limitations**: - ✗ Requires discipline - all DDL must go through migration system - ✗ Ad-hoc DDL changes will be missed #### Option 3: Periodic Schema Dumps Take regular `pg_dump --schema-only` snapshots: ```bash # Every hour or day pg_dump --schema-only dbname > schema-2024-01-01-14:00.sql ``` During restore: 1. Restore base backup 2. Apply schema dumps in order 3. Apply incremental DML changes **Benefits**: - ✓ Simple to implement - ✓ Catches all schema changes **Limitations**: - ✗ May capture schema at wrong moment (between related DDL+DML) - ✗ Harder to determine exact order of schema vs. data changes ## Things That Work Better Than Expected Before diving into challenges, it's worth noting what works well: ### 1. Crash Safety is Excellent **From documentation (Section 47.2.2)**: > A logical slot will emit each change just once in normal operation. Slots persist independently of the connection using them and are crash-safe. The replication slot mechanism is robust and well-tested. After crashes: - No data loss occurs (all changes are preserved in WAL) - Position tracking is durable (persisted at checkpoints) - Only duplicate delivery is possible (easily handled) ### 2. Transaction Boundaries Are Preserved Logical decoding respects ACID properties: - All changes within a transaction are grouped together - Transactions appear atomically in the change stream - Transaction ordering is maintained - This makes restore much simpler than systems that capture individual row changes ### 3. Online Operation Has No Blocking Unlike some backup methods: - No locks are held during logical decoding - No impact on normal database operations - Can run continuously without maintenance windows - Minimal performance impact (though measurable - see Section 8) ### 4. Exported Snapshots Solve Initial Consistency PostgreSQL provides built-in snapshot export when creating a replication slot, ensuring the base backup and incremental stream are perfectly aligned with no gaps. This is a solved problem. ## Other Challenges and Limitations ### 1. Sequences Are Not Replicated **Problem**: Sequence values (used by `SERIAL` and `IDENTITY` columns) are not replicated. **From documentation (Section 29.8)**: > Sequence data is not replicated. The data in serial or identity columns backed by sequences will of course be replicated as part of the table, but the sequence itself would still show the start value on the subscriber. **Example**: ```sql CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT); -- Insert 1000 rows -- Sequence on source is now at 1001 -- Sequence on restored database is still at 1 -- Next insert will fail with duplicate key error INSERT INTO users (name) VALUES ('Bob'); -- Tries to use id=1, but id=1 already exists ``` **Solution**: Use `pg_dump --sequence-data` (enabled by default) in periodic full dumps to capture sequence values. After restore, synchronize sequences: ```sql -- After restore, sync all sequences SELECT setval('users_id_seq', (SELECT MAX(id) FROM users)); ``` The `pg_dump --sequence-data` flag is enabled by default and will include current sequence values in full backups. ### 2. Large Objects Are Not Replicated **From documentation (Section 29.8)**: > Large objects (see Chapter 33) are not replicated. There is no workaround for that, other than storing data in normal tables. **Solution**: - **Preferred**: Avoid using PostgreSQL large objects. Use `BYTEA` columns instead, which ARE replicated. - **Alternative**: Use `pg_dump --large-objects` (or `-b`) in periodic full backups to capture large objects - Note: Incremental changes to large objects will NOT be captured between full backups ### 3. Replication Slots Prevent WAL Cleanup **From documentation (Section 47.2.2)**: > Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot. **Problem**: If your backup script stops running, the replication slot will: 1. Prevent WAL files from being cleaned up → **disk fills up** 2. Prevent VACUUM from cleaning old row versions → **table bloat** 3. Prevent VACUUM from cleaning old transaction IDs → **database shutdown to prevent transaction ID wraparound** **From documentation (Section 47.2.2)**: > In extreme cases this could cause the database to shut down to prevent transaction ID wraparound. This is **not a theoretical risk** - an inactive replication slot can and will cause production outages. **Solution**: - **Monitor slot lag aggressively**: ```sql SELECT * FROM pg_replication_slots; ``` - Set up **critical alerting** when `restart_lsn` falls more than 1GB behind - Set up **emergency alerting** when slot lag exceeds 10GB or age exceeds 24 hours - Have a **documented emergency procedure** to drop the slot if it threatens database availability - Consider using `pg_replication_slot_advance()` to skip ahead if needed (loses incremental backup coverage) - Use `max_slot_wal_keep_size` parameter (PostgreSQL 13+) to limit how much WAL a slot can retain ### 4. Slot Position is Only Persisted at Checkpoint **From documentation (Section 47.2.2)**: > The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot might return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts. **Problem**: After a crash, you might receive duplicate changes. **Example**: ``` 1. Script fetches changes, receives up to LSN 1000 2. Script writes changes to file 3. Database crashes before checkpoint 4. Slot position rolled back to LSN 900 5. Script fetches again, receives LSN 900-1000 again (duplicates!) ``` **Solution**: - `pg_recvlogical` handles this automatically through its status reporting mechanism - It sends periodic status updates back to the server confirming processed positions - On restart, `pg_recvlogical` can use `--startpos` to specify a starting LSN - For custom implementations: track highest LSN and pass to `pg_logical_slot_get_changes(slot_name, upto_lsn, ...)` ### 5. TRUNCATE Replication Complexity **From documentation (Section 29.8)**: > Replication of TRUNCATE commands is supported, but some care must be taken when truncating groups of tables connected by foreign keys. **Problem**: If you truncate multiple related tables, the incremental backup needs to capture all of them. **Solution**: Logical replication handles this automatically, but be aware of the behavior. ### 6. Replica Identity Required for UPDATE/DELETE For UPDATE and DELETE operations, PostgreSQL needs to identify which rows to modify. This requires a **replica identity**. **From documentation (Section 29.1.1)**: > A published table must have a replica identity configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side. **Replica Identity Modes**: - **DEFAULT**: Records old values of primary key columns - This is the default for non-system tables - **WARNING**: When there is no primary key, behavior is the same as `NOTHING` - **USING INDEX _index_name_**: Records old values of columns in the specified unique index - Index must be unique, not partial, not deferrable - Index columns must be marked `NOT NULL` - If the index is later dropped, behavior becomes the same as `NOTHING` - **FULL**: Records old values of all columns in the row - Allows UPDATE/DELETE without a primary key, but very expensive - Can use indexes on subscriber for searching, but inefficient without them - **NOTHING**: Records no information about the old row - This is the default for system tables - Only `INSERT` operations are replicated **Critical Behavior**: From documentation (Section 29.1.1): > Tables with a replica identity defined as `NOTHING`, `DEFAULT` without a primary key, or `USING INDEX` with a dropped index, **cannot support UPDATE or DELETE operations** when included in a publication replicating these actions. **Attempting such operations will result in an error on the publisher.** This means UPDATE/DELETE will **fail on the source database**, not just during restore! **Example**: ```sql -- Table without primary key CREATE TABLE logs (timestamp TIMESTAMPTZ, message TEXT); -- By default, replica identity is DEFAULT -- Since there's no PK, this behaves as NOTHING -- UPDATE/DELETE operations will ERROR if this table is in a publication! -- Attempting this will fail: UPDATE logs SET message = 'Updated' WHERE timestamp < now(); -- ERROR: cannot update table "logs" because it does not have a replica identity -- Fix: Set replica identity to FULL ALTER TABLE logs REPLICA IDENTITY FULL; -- Now UPDATE/DELETE will work (but be inefficient) UPDATE logs SET message = 'Updated' WHERE timestamp < now(); -- OK ``` **Best practice**: Ensure all tables have either: - A primary key (automatic replica identity), OR - A unique index with all columns NOT NULL, configured via `REPLICA IDENTITY USING INDEX`, OR - Explicit `REPLICA IDENTITY FULL` setting (only as a last resort - very inefficient) ### 7. Long-Term SQL Readability Challenges The goal of producing SQL that remains readable and executable for 10+ years introduces additional complexities: **Version Compatibility Issues**: - PostgreSQL syntax may change between major versions (though rare) - Data type representations may evolve - Default behaviors may change (e.g., how NULLs are handled in certain contexts) **Extension Dependencies**: - Custom data types from extensions (PostGIS geometries, hstore, etc.) may not exist in future systems - Extension versions may be incompatible (e.g., PostGIS 2.x vs 4.x) - Must document all extension names and versions alongside backups **Encoding and Collation**: - Character encoding standards may evolve - Locale/collation definitions may change - Text comparison behavior may differ across PostgreSQL versions **Recommendations**: - Include a metadata file with each backup containing: - PostgreSQL version (full version string) - All installed extension names and versions - Database encoding - Locale and collation settings - Custom data types and enums - Periodically test restoring old backups on current PostgreSQL versions - Consider limiting use of PostgreSQL-specific SQL features for better forward compatibility - Prefer standard SQL types over custom types where possible ### 8. Performance and Overhead Logical replication introduces measurable overhead: **Write Amplification**: - WAL must be written (normal) - WAL must be decoded into logical format (additional CPU) - Event triggers fire on every DDL operation (additional overhead) - The `ddl_history` table itself generates more WAL entries **Disk I/O**: - Additional WAL volume retained by replication slots - More frequent checkpoint I/O if tuned for faster slot position persistence **Recommendations**: - Benchmark the overhead on a test system with production-like workload - Monitor CPU usage of WAL sender processes - Consider the trade-off: is the overhead worth human-readable backups? ## Prerequisites and Configuration ### PostgreSQL Configuration ```ini # postgresql.conf # Required: Set WAL level to logical wal_level = logical # Required: Allow at least one replication slot max_replication_slots = 10 # Recommended: Allow replication connections max_wal_senders = 10 # Recommended: Keep more WAL for safety wal_keep_size = 1GB # Optional: Tune checkpoint frequency to persist slot positions more often checkpoint_timeout = 5min ``` ### Client Requirements - PostgreSQL client utilities installed (`pg_recvlogical`, `pg_dump`, `pg_dumpall`) - Superuser or role with `REPLICATION` privilege - Permission to create replication slots ## Built-in PostgreSQL Tools That Help PostgreSQL provides several built-in utilities that address the major challenges of this backup approach: ### `pg_recvlogical` - Logical Decoding Stream Collector **Purpose**: Streams logical decoding output from a replication slot to files. **Key Features**: - Creates/drops logical replication slots (`--create-slot`, `--drop-slot`) - Streams changes continuously with automatic reconnection (`--start`) - Supports multiple output plugins (`--plugin=test_decoding`) - Handles crash recovery with LSN position tracking - File rotation via SIGHUP signal - Configurable fsync intervals (`--fsync-interval`) - Can stop at specific LSN (`--endpos`) **Impact**: Eliminates need to write custom code for streaming infrastructure, connection management, and position tracking. ### `pg_dump` - Database Backup Utility **Relevant Options**: - `--snapshot=name`: Synchronize dump with a logical replication slot's exported snapshot - `--schema-only`: Capture only schema for DDL tracking - `--sequence-data`: Include sequence values (enabled by default) - `--large-objects` / `-b`: Include large objects if needed **Impact**: Provides consistent initial backup synchronized with the replication slot. ### `pg_dumpall` - Cluster-wide Backup Utility **Relevant Options**: - `--globals-only`: Dump only shared objects (databases, roles, tablespaces) **Impact**: Solves the event trigger gap for shared objects that cannot be captured via DDL events. ### Monitoring Views **Relevant Views**: - `pg_replication_slots`: Monitor slot health, lag, and active status - `pg_stat_replication`: Track replication connections and status - `pg_stat_replication_slots`: Statistics on replication slot usage **Impact**: Provides visibility into replication slot health for critical alerting. ## Implementation Components When implementing this system, the following programs/scripts will be needed: ### 1. Initial Setup Script **Purpose**: Bootstrap the backup system **Tasks**: - Create logical replication slot with appropriate output plugin - Export the snapshot created by the slot for consistency - Take initial base backup using `pg_dump` **using the exported snapshot** - Set up DDL tracking infrastructure (event triggers and ddl_history table) - Create supplemental backup for shared objects (`pg_dumpall --globals-only`) - Set `REPLICA IDENTITY` on tables without primary keys - Document PostgreSQL version and installed extensions with versions **Critical Detail - Initial Snapshot Consistency**: From documentation (Section 47.2.5): > When a new replication slot is created using the streaming replication interface, a snapshot is exported which will show exactly the state of the database after which all changes will be included in the change stream. This can be used to create a new replica by using SET TRANSACTION SNAPSHOT to read the state of the database at the moment the slot was created. This ensures the base backup and incremental stream are perfectly aligned with no gaps or overlaps. ### 2. Incremental Backup Collection **Tool**: Use built-in `pg_recvlogical` utility **Key Capabilities**: - Creates and manages logical replication slots - Streams decoded changes continuously from the replication slot - Handles connection failures and automatic reconnection - Tracks LSN positions with status updates to the server - Supports file rotation via SIGHUP signal - Can stop at a specific LSN with `--endpos` - Configurable fsync intervals for crash safety **Example Usage**: ```bash # Create slot and start streaming pg_recvlogical \ --dbname=mydb \ --slot=backup_slot \ --plugin=test_decoding \ --create-slot \ --start \ --file=/backups/incremental/changes.sql \ --fsync-interval=10 ``` **Additional Tasks** (custom wrapper script): - File rotation and timestamping - Transform `test_decoding` output into clean, portable SQL - Handle output format conversion for long-term readability - Coordinate with monitoring system ### 3. Periodic Full Backup Script **Purpose**: Take regular full backups as restore points **Tasks**: - Execute `pg_dump` to create full database backup - Take schema-only dump separately for reference - Compress old backups to save space - Implement retention policy (delete old backups) - Export DDL history table ### 4. Restore Script **Purpose**: Restore database from base + incremental backups **Tasks**: - Locate most recent full backup - Find all incremental backups since that full backup - Recreate target database - Restore full backup - Apply incremental backups in chronological order - Handle point-in-time recovery (stop at specific timestamp) - Synchronize sequence values to current maximums - Verify data integrity ### 5. Output Plugin / Format Transformer **Purpose**: Convert logical decoding output to clean SQL **Tasks**: - Parse output plugin format (test_decoding or pgoutput) - Generate proper INSERT/UPDATE/DELETE statements - Handle complex data type conversions and escaping: - Arrays (e.g., `{1,2,3}`) - JSON/JSONB values - BYTEA (binary data) - Custom/composite types - NULL vs empty string vs literal 'null' - Text with quotes, backslashes, and special characters - Temporal types (timestamp with timezone, intervals) - Preserve transaction boundaries (BEGIN/COMMIT) - Format output for human readability - Ensure output is valid SQL that can be executed years later **Challenges**: - The `test_decoding` plugin is primarily for debugging, not production use - Consider using the `pgoutput` plugin (used by PostgreSQL's built-in logical replication) - Data type semantics may change across PostgreSQL major versions - Custom types and extensions (PostGIS, etc.) may not exist in future systems ### 6. Monitoring and Health Check Script **Purpose**: Ensure backup system is functioning correctly **Tasks**: - Check replication slot health and lag - Alert if slot is not advancing - Monitor disk space in backup directories - Verify backup files are being created - Test restore procedures periodically ## Testing Strategy ### 1. Basic Functionality Test ```sql -- Create test database CREATE DATABASE backup_test; \c backup_test -- Run setup script -- Then create test data CREATE TABLE test_users (id SERIAL PRIMARY KEY, name TEXT, created_at TIMESTAMP DEFAULT now()); INSERT INTO test_users (name) VALUES ('Alice'), ('Bob'), ('Charlie'); -- Wait for incremental backup to run -- Add more data UPDATE test_users SET name = 'Alice Smith' WHERE id = 1; DELETE FROM test_users WHERE id = 3; -- Take another incremental -- Add schema change ALTER TABLE test_users ADD COLUMN email TEXT; UPDATE test_users SET email = 'alice@example.com' WHERE id = 1; -- Now restore and verify all data is present and correct ``` ### 2. Crash Recovery Test ```bash # Start collecting incrementals # Generate load with pgbench # Simulate crash with pg_ctl stop -m immediate # Restart PostgreSQL # Collect incrementals again - verify no data loss and duplicates are handled # Restore and verify data matches ``` ### 3. Long-Term Storage Test ```bash # Create backup # Store backup in time capsule (simulate 10 years) # On modern system, verify SQL is still readable and executable ``` ## Operational Procedures ### Monitoring Monitor these metrics: ```sql -- Check replication slot health SELECT slot_name, slot_type, database, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replication_lag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmed_lag FROM pg_replication_slots WHERE slot_type = 'logical'; -- Alert if lag > 1GB or slot is inactive for > 1 hour ``` ### Backup Schedule **Recommended**: - **Incremental**: Every 15 minutes (or continuously streaming) - **Full backup**: Daily at 2 AM - **Schema-only dump**: Every hour - **DDL history export**: Every 15 minutes (part of incremental) ### Retention Policy - **Incremental backups**: Keep 7 days - **Full backups**: Keep 30 days, then one per month for 1 year - **Monitor disk space**: Alert if backup directory exceeds 80% capacity ### Disaster Recovery Runbook 1. **Stop application** to prevent new writes during restore 2. **Create new database** (don't overwrite production) 3. **Restore latest full backup** 4. **Apply DDL changes** from `ddl_history` in chronological order 5. **Apply incremental backups** in chronological order 6. **Sync sequences** to latest values 7. **Verify data integrity** with checksums or row counts 8. **Test application** against restored database 9. **Switch over** application to restored database ## Comparison with Alternatives | Feature | Logical Replication | pg_dump Only | WAL Archiving | |---------|-------------------|--------------|---------------| | Incremental | ✓ Yes | ✗ No | ✓ Yes | | Human-readable | ✓ SQL-like | ✓ SQL | ✗ Binary | | Online operation | ✓ Yes | ✓ Yes | ✓ Yes | | Crash-safe position | ✓ Yes | N/A | ✓ Yes | | Captures DDL | ✗ **No** | ✓ Yes | ✓ Yes | | Long-term readable | ✓ Yes | ✓ Yes | ✗ No | | Point-in-time recovery | ✓ Yes | ✗ No | ✓ Yes | | Storage efficiency | ✓ Good | ✗ Poor | ✓ Excellent | ## Conclusion ### Feasibility: YES, but More Complex Than Initially Assessed The logical replication approach **is feasible** for incremental SQL backups, but the implementation is **more complex than it initially appears**. Key requirements: 1. **Separate DDL tracking** using event triggers (with gaps for shared objects like databases/roles) 2. **Additional tracking** for shared objects via periodic `pg_dumpall --globals-only` 3. **Aggressive monitoring** of replication slot health to prevent production outages 4. **Replica identity configuration** on all tables (or face errors on UPDATE/DELETE) 5. **Sequence synchronization** during restore 6. **Duplicate handling** for crash scenarios 7. **Complex data type handling** in the SQL transformation layer 8. **Version and extension metadata** tracking for long-term readability 9. **Performance overhead** monitoring and capacity planning ### Primary Limitations **1. DDL Capture is Incomplete** While logical replication doesn't capture DDL, event triggers only partially solve this: - ✓ Captures table/index/function DDL - ✗ Does NOT capture database creation/deletion - ✗ Does NOT capture role management and permissions - ✗ Does NOT capture tablespace changes - Requires supplemental `pg_dumpall --globals-only` backups **2. Replica Identity Enforcement is Strict** Tables without proper replica identity will **cause errors on the source database** for UPDATE/DELETE operations, not just during restore. This requires proactive configuration. **3. Operational Risk from Inactive Slots** An inactive replication slot can cause database shutdown via transaction ID wraparound. This is not a theoretical concern - it requires 24/7 monitoring and documented emergency procedures. ### When This Approach Makes Sense - You need human-readable backups for compliance/archival (legal requirement) - You want continuous incremental backups without WAL shipping - You have **strong operational maturity** and 24/7 monitoring capability - You can dedicate engineering time to building and maintaining the system - You can enforce discipline around replica identity and schema changes - Your database has moderate change velocity (not extremely high-volume OLTP) - You're willing to accept the performance overhead ### When to Consider Alternatives - If you need zero-configuration backup → use `pg_dump` + `pg_basebackup` - If operational complexity is too high → use WAL archiving + `pg_basebackup` (standard PITR) - If you lack 24/7 monitoring capability → any approach without replication slots - If you need guaranteed DDL capture → use WAL archiving (captures everything) - If performance overhead is unacceptable → use `pg_basebackup` + WAL archiving - If you need native PostgreSQL tooling only → use continuous archiving ### Realistic Complexity Assessment **This is NOT a simple weekend project.** A production-ready implementation requires: - **Development effort**: 2-4 weeks for initial implementation (reduced from 4-8 weeks due to `pg_recvlogical`) - Most streaming infrastructure already exists via `pg_recvlogical` - Primary development work is output transformation and orchestration - **Testing effort**: 2-4 weeks for comprehensive testing - **Documentation**: Detailed runbooks for operations team - **Ongoing maintenance**: Regular updates as PostgreSQL evolves - **24/7 monitoring**: Critical alerting on replication slot health - **Operational expertise**: Deep PostgreSQL knowledge required **Key Insight**: The existence of `pg_recvlogical` substantially reduces implementation complexity. The original assessment significantly underestimated the availability of built-in PostgreSQL tools for this use case. ### Next Steps for Proof of Concept 1. **Initial Setup**: Create replication slot and DDL tracking infrastructure - Use `pg_recvlogical --create-slot` to create logical replication slot - Set up event triggers for DDL capture - Take initial `pg_dump` with `--snapshot` synchronized to the slot 2. **Streaming Collection**: Test `pg_recvlogical` for continuous streaming - Run `pg_recvlogical --start` to begin capturing changes - Verify `test_decoding` output format 3. **Output Transformation**: Build parser for `test_decoding` format - Transform to clean, portable SQL - Handle complex data types (arrays, JSON, bytea, etc.) 4. **DDL Handling**: Implement shared object tracking - Schedule periodic `pg_dumpall --globals-only` - Coordinate with event trigger output 5. **Restore Process**: Build and test restore scripts - Test point-in-time recovery - Validate sequence synchronization 6. **Crash Recovery**: Test duplicate handling and LSN tracking 7. **Performance**: Measure storage and CPU overhead 8. **Documentation**: Write operational runbooks and emergency procedures ## References - PostgreSQL Documentation: Chapter 25 - Backup and Restore - PostgreSQL Documentation: Chapter 29 - Logical Replication - PostgreSQL Documentation: Chapter 47 - Logical Decoding - PostgreSQL Documentation: Section 29.8 - Logical Replication Restrictions - PostgreSQL Documentation: `pg_recvlogical` man page - PostgreSQL Documentation: `pg_dump` man page - PostgreSQL Documentation: `pg_dumpall` man page