1 # Incremental SQL Backup System Using PostgreSQL Logical Replication
5 This document details a proof-of-concept design for a PostgreSQL backup system that produces human-readable, plain SQL incremental backups using logical replication. The system aims to create backups that remain readable and restorable for 10+ years while supporting online operation and crash safety.
7 **Key Finding**: This approach is **feasible and significantly aided by built-in PostgreSQL tools**. The built-in `pg_recvlogical` utility provides most of the incremental collection functionality, substantially reducing implementation complexity.
10 1. **DDL is not captured by logical replication** - requires event triggers (which have gaps for shared objects)
11 - Addressed by: `pg_dumpall --globals-only` for databases/roles/tablespaces
12 2. **Replica identity enforcement** - tables without proper configuration will error on UPDATE/DELETE at the source
13 - No tool assistance - requires manual configuration and enforcement
14 3. **Operational risk** - inactive replication slots can cause database shutdown via transaction wraparound
15 - Partial assistance - monitoring views exist (`pg_stat_replication_slots`) but alerting must be custom
16 4. **Data type complexity** - transforming WAL output to portable SQL is non-trivial
17 - Simplified by: `pg_recvlogical` provides structured output from `test_decoding` plugin
18 5. **Incomplete coverage** - event triggers don't fire for databases, roles, or tablespaces
19 - Addressed by: `pg_dumpall --globals-only` captures all shared objects
21 **Bottom Line**: While technically feasible, this approach requires substantially more engineering effort, operational maturity, and ongoing maintenance than standard PostgreSQL backup solutions (pg_basebackup + WAL archiving). However, the existence of `pg_recvlogical` reduces complexity significantly, and alternative output plugins (`decoder_raw`, `wal2json`) can further simplify SQL transformation. Only pursue this if human-readable SQL backups are a hard requirement (e.g., regulatory compliance).
23 ## Architecture Overview
28 ┌─────────────────────────────────────────────────────────────┐
29 │ PostgreSQL Database │
31 │ ┌────────────────┐ ┌──────────────────┐ │
32 │ │ Regular Tables│────────▶│ WAL (Write-Ahead│ │
33 │ │ (DML Changes) │ │ Log) │ │
34 │ └────────────────┘ └──────────────────┘ │
37 │ ┌─────────────────────────┐ │
38 │ │ Logical Decoding Process│ │
39 │ │ (Decodes WAL to SQL) │ │
40 │ └─────────────────────────┘ │
42 └──────────────────────────────────────┼───────────────────────┘
45 ┌─────────────────────────────┐
47 │ (Tracks position, durable) │
48 └─────────────────────────────┘
51 ┌─────────────────────────────┐
52 │ pg_recvlogical Tool │
53 │ (Built-in PostgreSQL util) │
54 └─────────────────────────────┘
56 ┌──────────────┴──────────────┐
58 ┌─────────────────────┐ ┌─────────────────────┐
59 │ Incremental Files │ │ Full pg_dump │
60 │ (SQL Changes) │ │ (Periodic) │
61 │ - 2024-01-01.sql │ │ - base-2024-01.sql │
62 │ - 2024-01-02.sql │ │ - base-2024-02.sql │
64 └─────────────────────┘ └─────────────────────┘
69 1. **Logical Replication Slot**: Durable position tracker in PostgreSQL
70 2. **Output Plugin**: Transforms binary WAL to human-readable format (e.g., `decoder_raw`, `wal2json`, or `test_decoding`)
71 3. **`pg_recvlogical`**: Built-in PostgreSQL tool that streams logical decoding output
72 4. **Base Backup System**: Regular full `pg_dump` backups with `--snapshot` for consistency
73 5. **Schema Tracking System**: Event triggers (or `postgres_ddl_change_tracker` module) + `pg_dumpall --globals-only` for DDL changes
75 ## How Logical Replication Works
77 ### What Logical Replication Provides
79 PostgreSQL's logical replication decodes the Write-Ahead Log (WAL) into logical changes:
82 -- Example output from test_decoding plugin:
84 table public.users: INSERT: id[integer]:1 name[text]:'Alice' email[text]:'alice@example.com'
85 table public.users: UPDATE: id[integer]:1 name[text]:'Alice Smith' email[text]:'alice@example.com'
86 table public.orders: DELETE: id[integer]:42
90 This can be transformed into standard SQL:
94 INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
95 UPDATE public.users SET name = 'Alice Smith' WHERE id = 1;
96 DELETE FROM public.orders WHERE id = 42;
102 - **Crash-safe**: Replication slots persist position across crashes
103 - **Consistent**: Transaction boundaries are preserved
104 - **Online**: Runs without blocking database operations
105 - **Idempotent positioning**: Can restart from last known position
107 ## Critical Challenge: DDL is NOT Replicated
111 From PostgreSQL documentation (Section 29.8):
113 > **The database schema and DDL commands are not replicated.** The initial schema can be copied by hand using `pg_dump --schema-only`. Subsequent schema changes would need to be kept in sync manually.
115 Logical replication will **NOT** capture:
118 - `ALTER TABLE` (adding columns, changing types, etc.)
121 - Any other schema modifications
123 ### Example of the Issue
126 -- Day 1: Base backup taken
127 CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
129 -- Day 2: Data changes (CAPTURED by logical replication)
130 INSERT INTO users VALUES (1, 'Alice');
132 -- Day 3: Schema change (NOT CAPTURED)
133 ALTER TABLE users ADD COLUMN email TEXT;
135 -- Day 4: More data changes (CAPTURED, but in new schema)
136 UPDATE users SET email = 'alice@example.com' WHERE id = 1;
139 **Problem**: The incremental backup for Day 4 will contain:
141 UPDATE users SET email = 'alice@example.com' WHERE id = 1;
144 But if you restore from Day 1 base backup, the `email` column doesn't exist yet, and the UPDATE will fail.
146 ### Solutions for DDL Tracking
148 #### Option 1: Event Triggers with postgres_ddl_change_tracker (Recommended)
150 PostgreSQL supports **event triggers** that can capture DDL events. For a production-ready implementation, consider using the **postgres_ddl_change_tracker** open-source module (https://github.com/i-Cell-Mobilsoft-Open-Source/postgres_ddl_change_tracker), which provides a complete DDL tracking solution.
152 **Manual Implementation:**
154 For a custom implementation, you can create event triggers directly:
157 -- Create a table to log DDL changes
158 CREATE TABLE ddl_history (
159 id SERIAL PRIMARY KEY,
160 executed_at TIMESTAMP DEFAULT now(),
163 object_identity TEXT,
167 -- Create event trigger function
168 CREATE OR REPLACE FUNCTION log_ddl_changes()
169 RETURNS event_trigger AS $$
173 -- Capture DDL command details
174 FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
176 INSERT INTO ddl_history (ddl_command, object_type, object_identity, query)
177 VALUES (obj.command_tag, obj.object_type, obj.object_identity, current_query());
182 -- Register the event trigger for DDL commands
183 CREATE EVENT TRIGGER capture_ddl ON ddl_command_end
184 EXECUTE FUNCTION log_ddl_changes();
188 1. Event trigger captures all DDL commands
189 2. Stores them in `ddl_history` table with timestamps
190 3. Your backup script exports this table periodically
191 4. During restore, apply DDL changes in chronological order before applying DML changes
194 - Captures all DDL automatically
195 - Timestamp-ordered for correct replay
196 - The `ddl_history` table itself gets replicated via logical replication
199 - **Event triggers don't fire for shared objects** - this is a significant gap:
200 - `CREATE/DROP/ALTER DATABASE` commands are not captured
201 - `CREATE/DROP/ALTER ROLE` and role membership (`GRANT ROLE`) are not captured
202 - `CREATE/DROP/ALTER TABLESPACE` commands are not captured
203 - `ALTER SYSTEM` commands are not captured
204 - These require a **separate tracking mechanism** or periodic `pg_dumpall --globals-only`
205 - Event trigger functions cannot themselves use `CREATE EVENT TRIGGER` (triggers on triggers)
206 - The `ddl_history` table's own schema evolution creates a bootstrap problem
208 #### Option 2: Application-Level Tracking
210 Track all schema migrations in your application:
214 CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
217 ALTER TABLE users ADD COLUMN email TEXT;
220 Store these migrations alongside your backups and apply them in order during restore.
223 - Clean, version-controlled schema evolution
224 - Works well if you already use migration tools (Flyway, Liquibase, Alembic)
227 - Requires discipline - all DDL must go through migration system
228 - Ad-hoc DDL changes will be missed
230 #### Option 3: Periodic Schema Dumps
232 Take regular `pg_dump --schema-only` snapshots:
236 pg_dump --schema-only dbname > schema-2024-01-01-14:00.sql
240 1. Restore base backup
241 2. Apply schema dumps in order
242 3. Apply incremental DML changes
245 - Simple to implement
246 - Catches all schema changes
249 - May capture schema at wrong moment (between related DDL+DML)
250 - Harder to determine exact order of schema vs. data changes
252 ## Other Challenges and Limitations
254 ### 1. Sequences Are Not Replicated
256 **Problem**: Sequence values (used by `SERIAL` and `IDENTITY` columns) are not replicated.
258 From PostgreSQL documentation (Section 29.8):
259 > Sequence data is not replicated. The data in serial or identity columns backed by sequences will of course be replicated as part of the table, but the sequence itself would still show the start value on the subscriber.
263 CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);
266 -- Sequence on source is now at 1001
267 -- Sequence on restored database is still at 1
269 -- Next insert will fail with duplicate key error
270 INSERT INTO users (name) VALUES ('Bob'); -- Tries to use id=1, but id=1 already exists
274 Use `pg_dump --sequence-data` (enabled by default) in periodic full dumps to capture sequence values. After restore, synchronize sequences:
277 -- After restore, sync all sequences
278 SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));
281 ### 2. Large Objects Are Not Replicated
283 From PostgreSQL documentation (Section 29.8):
284 > Large objects (see Chapter 33) are not replicated. There is no workaround for that, other than storing data in normal tables.
287 - **Preferred**: Avoid using PostgreSQL large objects. Use `BYTEA` columns instead, which ARE replicated.
288 - **Alternative**: Use `pg_dump --large-objects` (or `-b`) in periodic full backups to capture large objects
289 - Note: Incremental changes to large objects will NOT be captured between full backups
291 ### 3. Replication Slots Prevent WAL Cleanup
293 From PostgreSQL documentation (Section 47.2.2):
294 > Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot.
296 **Problem**: If your backup script stops running, the replication slot will:
297 1. Prevent WAL files from being cleaned up → **disk fills up**
298 2. Prevent VACUUM from cleaning old row versions → **table bloat**
299 3. Prevent VACUUM from cleaning old transaction IDs → **database shutdown to prevent transaction ID wraparound**
301 From PostgreSQL documentation (Section 47.2.2):
302 > In extreme cases this could cause the database to shut down to prevent transaction ID wraparound.
304 This is **not a theoretical risk** - an inactive replication slot can and will cause production outages.
307 - **Monitor slot lag aggressively**:
309 SELECT * FROM pg_replication_slots;
311 - Set up **critical alerting** when `restart_lsn` falls more than 1GB behind
312 - Set up **emergency alerting** when slot lag exceeds 10GB or age exceeds 24 hours
313 - Have a **documented emergency procedure** to drop the slot if it threatens database availability
314 - Consider using `pg_replication_slot_advance()` to skip ahead if needed (loses incremental backup coverage)
315 - Use `max_slot_wal_keep_size` parameter (PostgreSQL 13+) to limit how much WAL a slot can retain
317 ### 4. Slot Position is Only Persisted at Checkpoint
319 From PostgreSQL documentation (Section 47.2.2):
320 > The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot might return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts.
322 **Problem**: After a crash, you might receive duplicate changes.
326 1. Script fetches changes, receives up to LSN 1000
327 2. Script writes changes to file
328 3. Database crashes before checkpoint
329 4. Slot position rolled back to LSN 900
330 5. Script fetches again, receives LSN 900-1000 again (duplicates!)
334 - `pg_recvlogical` handles this automatically through its status reporting mechanism
335 - It sends periodic status updates back to the server confirming processed positions
336 - On restart, `pg_recvlogical` can use `--startpos` to specify a starting LSN
337 - For custom implementations: track highest LSN and pass to `pg_logical_slot_get_changes(slot_name, upto_lsn, ...)`
339 ### 5. TRUNCATE Replication Complexity
341 From PostgreSQL documentation (Section 29.8):
342 > Replication of TRUNCATE commands is supported, but some care must be taken when truncating groups of tables connected by foreign keys.
344 **Problem**: If you truncate multiple related tables, the incremental backup needs to capture all of them.
346 **Solution**: Logical replication handles this automatically, but be aware of the behavior.
348 ### 6. Replica Identity Required for UPDATE/DELETE
350 For UPDATE and DELETE operations, PostgreSQL needs to identify which rows to modify. This requires a **replica identity**.
352 From PostgreSQL documentation (Section 29.1.1):
353 > A published table must have a replica identity configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side.
355 **Replica Identity Modes**:
357 - **DEFAULT**: Records old values of primary key columns
358 - This is the default for non-system tables
359 - **WARNING**: When there is no primary key, behavior is the same as `NOTHING`
361 - **USING INDEX _index_name_**: Records old values of columns in the specified unique index
362 - Index must be unique, not partial, not deferrable
363 - Index columns must be marked `NOT NULL`
364 - If the index is later dropped, behavior becomes the same as `NOTHING`
366 - **FULL**: Records old values of all columns in the row
367 - Allows UPDATE/DELETE without a primary key, but very expensive
368 - Can use indexes on subscriber for searching, but inefficient without them
370 - **NOTHING**: Records no information about the old row
371 - This is the default for system tables
372 - Only `INSERT` operations are replicated
374 **Critical Behavior**:
376 From documentation (Section 29.1.1):
377 > Tables with a replica identity defined as `NOTHING`, `DEFAULT` without a primary key, or `USING INDEX` with a dropped index, **cannot support UPDATE or DELETE operations** when included in a publication replicating these actions. **Attempting such operations will result in an error on the publisher.**
379 This means UPDATE/DELETE will **fail on the source database**, not just during restore!
383 -- Table without primary key
384 CREATE TABLE logs (timestamp TIMESTAMPTZ, message TEXT);
386 -- By default, replica identity is DEFAULT
387 -- Since there's no PK, this behaves as NOTHING
388 -- UPDATE/DELETE operations will ERROR if this table is in a publication!
390 -- Attempting this will fail:
391 UPDATE logs SET message = 'Updated' WHERE timestamp < now();
392 -- ERROR: cannot update table "logs" because it does not have a replica identity
394 -- Fix: Set replica identity to FULL
395 ALTER TABLE logs REPLICA IDENTITY FULL;
397 -- Now UPDATE/DELETE will work (but be inefficient)
398 UPDATE logs SET message = 'Updated' WHERE timestamp < now(); -- OK
401 **Best practice**: Ensure all tables have either:
402 - A primary key (automatic replica identity), OR
403 - A unique index with all columns NOT NULL, configured via `REPLICA IDENTITY USING INDEX`, OR
404 - Explicit `REPLICA IDENTITY FULL` setting (only as a last resort - very inefficient)
406 ### 7. Long-Term SQL Readability Challenges
408 The goal of producing SQL that remains readable and executable for 10+ years introduces additional complexities:
410 **Version Compatibility Issues**:
411 - PostgreSQL syntax may change between major versions (though rare)
412 - Data type representations may evolve
413 - Default behaviors may change (e.g., how NULLs are handled in certain contexts)
415 **Extension Dependencies**:
416 - Custom data types from extensions (PostGIS geometries, hstore, etc.) may not exist in future systems
417 - Extension versions may be incompatible (e.g., PostGIS 2.x vs 4.x)
418 - Must document all extension names and versions alongside backups
420 **Encoding and Collation**:
421 - Character encoding standards may evolve
422 - Locale/collation definitions may change
423 - Text comparison behavior may differ across PostgreSQL versions
426 - Include a metadata file with each backup containing:
427 - PostgreSQL version (full version string)
428 - All installed extension names and versions
430 - Locale and collation settings
431 - Custom data types and enums
432 - Periodically test restoring old backups on current PostgreSQL versions
433 - Consider limiting use of PostgreSQL-specific SQL features for better forward compatibility
434 - Prefer standard SQL types over custom types where possible
436 ### 8. Performance and Overhead
438 Logical replication introduces measurable overhead:
440 **Write Amplification**:
441 - WAL must be written (normal)
442 - WAL must be decoded into logical format (additional CPU)
443 - Event triggers fire on every DDL operation (additional overhead)
444 - The `ddl_history` table itself generates more WAL entries
447 - Additional WAL volume retained by replication slots
448 - More frequent checkpoint I/O if tuned for faster slot position persistence
451 - Benchmark the overhead on a test system with production-like workload
452 - Monitor CPU usage of WAL sender processes
453 - Consider the trade-off: is the overhead worth human-readable backups?
455 ## Prerequisites and Configuration
457 ### PostgreSQL Configuration
462 # Required: Set WAL level to logical
465 # Required: Allow at least one replication slot
466 max_replication_slots = 10
468 # Recommended: Allow replication connections
471 # Recommended: Keep more WAL for safety
474 # Optional: Tune checkpoint frequency to persist slot positions more often
475 checkpoint_timeout = 5min
478 ### Client Requirements
480 - PostgreSQL client utilities installed (`pg_recvlogical`, `pg_dump`, `pg_dumpall`)
481 - Superuser or role with `REPLICATION` privilege
482 - Permission to create replication slots
484 ## Built-in PostgreSQL Tools That Help
486 PostgreSQL provides several built-in utilities that address the major challenges of this backup approach:
488 ### `pg_recvlogical` - Logical Decoding Stream Collector
490 **Purpose**: Streams logical decoding output from a replication slot to files.
493 - Creates/drops logical replication slots (`--create-slot`, `--drop-slot`)
494 - Streams changes continuously with automatic reconnection (`--start`)
495 - Supports multiple output plugins (`--plugin=test_decoding`)
496 - Handles crash recovery with LSN position tracking
497 - File rotation via SIGHUP signal
498 - Configurable fsync intervals (`--fsync-interval`)
499 - Can stop at specific LSN (`--endpos`)
501 **Impact**: Eliminates need to write custom code for streaming infrastructure, connection management, and position tracking.
503 ### `pg_dump` - Database Backup Utility
505 **Relevant Options**:
506 - `--snapshot=name`: Synchronize dump with a logical replication slot's exported snapshot
507 - `--schema-only`: Capture only schema for DDL tracking
508 - `--sequence-data`: Include sequence values (enabled by default)
509 - `--large-objects` / `-b`: Include large objects if needed
511 **Impact**: Provides consistent initial backup synchronized with the replication slot.
513 ### `pg_dumpall` - Cluster-wide Backup Utility
515 **Relevant Options**:
516 - `--globals-only`: Dump only shared objects (databases, roles, tablespaces)
518 **Impact**: Solves the event trigger gap for shared objects that cannot be captured via DDL events.
523 - `pg_replication_slots`: Monitor slot health, lag, and active status
524 - `pg_stat_replication`: Track replication connections and status
525 - `pg_stat_replication_slots`: Statistics on replication slot usage
527 **Impact**: Provides visibility into replication slot health for critical alerting.
529 ## Implementation Components
531 When implementing this system, the following programs/scripts will be needed:
533 ### 1. Initial Setup Script
535 **Purpose**: Bootstrap the backup system
538 - Create logical replication slot with appropriate output plugin
539 - Export the snapshot created by the slot for consistency
540 - Take initial base backup using `pg_dump` **using the exported snapshot**
541 - Set up DDL tracking infrastructure (event triggers and ddl_history table)
542 - Create supplemental backup for shared objects (`pg_dumpall --globals-only`)
543 - Set `REPLICA IDENTITY` on tables without primary keys
544 - Document PostgreSQL version and installed extensions with versions
546 **Critical Detail - Initial Snapshot Consistency**:
548 From documentation (Section 47.2.5):
549 > When a new replication slot is created using the streaming replication interface, a snapshot is exported which will show exactly the state of the database after which all changes will be included in the change stream. This can be used to create a new replica by using SET TRANSACTION SNAPSHOT to read the state of the database at the moment the slot was created.
551 This ensures the base backup and incremental stream are perfectly aligned with no gaps or overlaps.
553 ### 2. Incremental Backup Collection
555 **Tool**: Use built-in `pg_recvlogical` utility
557 **Key Capabilities**:
558 - Creates and manages logical replication slots
559 - Streams decoded changes continuously from the replication slot
560 - Handles connection failures and automatic reconnection
561 - Tracks LSN positions with status updates to the server
562 - Supports file rotation via SIGHUP signal
563 - Can stop at a specific LSN with `--endpos`
564 - Configurable fsync intervals for crash safety
568 # Using decoder_raw for direct SQL output (recommended)
572 --plugin=decoder_raw \
575 --file=/backups/incremental/changes.sql \
578 # Alternative: Using wal2json for JSON output
585 --file=/backups/incremental/changes.json \
588 # Alternative: Using test_decoding (requires more transformation)
592 --plugin=test_decoding \
595 --file=/backups/incremental/changes.sql \
599 **Additional Tasks** (custom wrapper script):
600 - File rotation and timestamping
601 - Transform output to clean, portable SQL (if using wal2json or test_decoding)
602 - Handle output format conversion for long-term readability
603 - Coordinate with monitoring system
605 ### 3. Periodic Full Backup Script
607 **Purpose**: Take regular full backups as restore points
609 - Execute `pg_dump` to create full database backup
610 - Take schema-only dump separately for reference
611 - Compress old backups to save space
612 - Implement retention policy (delete old backups)
613 - Export DDL history table
615 ### 4. Restore Script
617 **Purpose**: Restore database from base + incremental backups
619 - Locate most recent full backup
620 - Find all incremental backups since that full backup
621 - Recreate target database
622 - Restore full backup
623 - Apply incremental backups in chronological order
624 - Handle point-in-time recovery (stop at specific timestamp)
625 - Synchronize sequence values to current maximums
626 - Verify data integrity
628 ### 5. Output Plugin / Format Transformer
630 **Purpose**: Convert logical decoding output to clean SQL
632 **Recommended Approach**: Use alternative output plugins that simplify SQL transformation:
634 1. **decoder_raw plugin** (Recommended for SQL output)
635 - Generates raw SQL queries directly from logical changes
636 - Produces executable INSERT/UPDATE/DELETE statements
637 - Eliminates need for complex parsing and transformation
638 - Available as a contrib module in some PostgreSQL distributions
640 2. **wal2json plugin** (Recommended for structured output)
641 - Open source: https://github.com/eulerto/wal2json
642 - Outputs changes in JSON format (easier to parse than test_decoding)
643 - Widely supported on managed PostgreSQL services (AWS RDS, Google Cloud SQL, Azure)
644 - Well-documented with active maintenance
645 - Requires JSON-to-SQL transformation but provides structured data
647 3. **test_decoding plugin** (Built-in, but requires more work)
648 - Built into PostgreSQL core
649 - Primarily designed for debugging, not production
650 - Requires custom parser to transform text output to SQL
651 - Parsing libraries available: pylogicaldecoding (Python), pg-logical-replication (Node.js), LogicalDecode (Java)
653 **If using test_decoding or wal2json, transformation tasks include**:
654 - Parse output plugin format
655 - Generate proper INSERT/UPDATE/DELETE statements
656 - Handle complex data type conversions and escaping:
657 - Arrays (e.g., `{1,2,3}`)
659 - BYTEA (binary data)
660 - Custom/composite types
661 - NULL vs empty string vs literal 'null'
662 - Text with quotes, backslashes, and special characters
663 - Temporal types (timestamp with timezone, intervals)
664 - Preserve transaction boundaries (BEGIN/COMMIT)
665 - Format output for human readability
666 - Ensure output is valid SQL that can be executed years later
669 - Data type semantics may change across PostgreSQL major versions
670 - Custom types and extensions (PostGIS, etc.) may not exist in future systems
671 - Available plugins vary by PostgreSQL distribution and managed service providers
673 ### 6. Monitoring and Health Check Script
675 **Purpose**: Ensure backup system is functioning correctly
678 - **pg_slot_notify / pgslot-notify-bot**: Open-source tool that monitors replication slot growth and sends Slack alerts
679 - **pgDash** (https://pgdash.io/): Commercial monitoring solution designed specifically for PostgreSQL
680 - **check_postgres**: Perl script that integrates with Nagios/Icinga/Zabbix for PostgreSQL monitoring
681 - **Prometheus + postgres_exporter + Grafana**: Complete open-source monitoring stack with visualization
682 - **Built-in views**: `pg_replication_slots`, `pg_stat_replication`, `pg_stat_replication_slots`
685 - Check replication slot health and lag
686 - Alert if slot is not advancing
687 - Monitor disk space in backup directories
688 - Verify backup files are being created
689 - Test restore procedures periodically
693 ### 1. Basic Functionality Test
696 -- Create test database
697 CREATE DATABASE backup_test;
701 -- Then create test data
702 CREATE TABLE test_users (id SERIAL PRIMARY KEY, name TEXT, created_at TIMESTAMP DEFAULT now());
704 INSERT INTO test_users (name) VALUES ('Alice'), ('Bob'), ('Charlie');
706 -- Wait for incremental backup to run
708 UPDATE test_users SET name = 'Alice Smith' WHERE id = 1;
709 DELETE FROM test_users WHERE id = 3;
711 -- Take another incremental
713 ALTER TABLE test_users ADD COLUMN email TEXT;
714 UPDATE test_users SET email = 'alice@example.com' WHERE id = 1;
716 -- Now restore and verify all data is present and correct
719 ### 2. Crash Recovery Test
722 # Start collecting incrementals
723 # Generate load with pgbench
724 # Simulate crash with pg_ctl stop -m immediate
726 # Collect incrementals again - verify no data loss and duplicates are handled
727 # Restore and verify data matches
730 ### 3. Long-Term Storage Test
734 # Store backup in time capsule (simulate 10 years)
735 # On modern system, verify SQL is still readable and executable
738 ## Operational Procedures
742 Monitor these metrics:
745 -- Check replication slot health
751 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replication_lag,
752 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmed_lag
753 FROM pg_replication_slots
754 WHERE slot_type = 'logical';
756 -- Alert if lag > 1GB or slot is inactive for > 1 hour
762 - **Incremental**: Every 15 minutes (or continuously streaming)
763 - **Full backup**: Daily at 2 AM
764 - **Schema-only dump**: Every hour
765 - **DDL history export**: Every 15 minutes (part of incremental)
769 - **Incremental backups**: Keep 7 days
770 - **Full backups**: Keep 30 days, then one per month for 1 year
771 - **Monitor disk space**: Alert if backup directory exceeds 80% capacity
773 ### Disaster Recovery Runbook
775 1. **Stop application** to prevent new writes during restore
776 2. **Create new database** (don't overwrite production)
777 3. **Restore latest full backup**
778 4. **Apply DDL changes** from `ddl_history` in chronological order
779 5. **Apply incremental backups** in chronological order
780 6. **Sync sequences** to latest values
781 7. **Verify data integrity** with checksums or row counts
782 8. **Test application** against restored database
783 9. **Switch over** application to restored database
785 ## Comparison with Alternatives
787 | Feature | Logical Replication | pg_dump Only | WAL Archiving |
788 |---------|-------------------|--------------|---------------|
789 | Incremental | ✓ Yes | ✗ No | ✓ Yes |
790 | Human-readable | ✓ SQL-like | ✓ SQL | ✗ Binary |
791 | Online operation | ✓ Yes | ✓ Yes | ✓ Yes |
792 | Crash-safe position | ✓ Yes | N/A | ✓ Yes |
793 | Captures DDL | ✗ **No** | ✓ Yes | ✓ Yes |
794 | Long-term readable | ✓ Yes | ✓ Yes | ✗ No |
795 | Point-in-time recovery | ✓ Yes | ✗ No | ✓ Yes |
796 | Storage efficiency | ✓ Good | ✗ Poor | ✓ Excellent |
800 ### Feasibility Assessment
802 The logical replication approach **is feasible** for incremental SQL backups, but the implementation is **more complex than it initially appears**. Key requirements:
804 1. **Separate DDL tracking** using event triggers (with gaps for shared objects like databases/roles)
805 2. **Additional tracking** for shared objects via periodic `pg_dumpall --globals-only`
806 3. **Aggressive monitoring** of replication slot health to prevent production outages
807 4. **Replica identity configuration** on all tables (or face errors on UPDATE/DELETE)
808 5. **Sequence synchronization** during restore
809 6. **Duplicate handling** for crash scenarios
810 7. **Complex data type handling** in the SQL transformation layer
811 8. **Version and extension metadata** tracking for long-term readability
812 9. **Performance overhead** monitoring and capacity planning
814 ### Primary Limitations
816 **1. DDL Capture is Incomplete**
818 While logical replication doesn't capture DDL, event triggers only partially solve this:
819 - ✓ Captures table/index/function DDL
820 - ✗ Does NOT capture database creation/deletion
821 - ✗ Does NOT capture role management and permissions
822 - ✗ Does NOT capture tablespace changes
823 - Requires supplemental `pg_dumpall --globals-only` backups
825 **2. Replica Identity Enforcement is Strict**
827 Tables without proper replica identity will **cause errors on the source database** for UPDATE/DELETE operations, not just during restore. This requires proactive configuration.
829 **3. Operational Risk from Inactive Slots**
831 An inactive replication slot can cause database shutdown via transaction ID wraparound. This is not a theoretical concern - it requires 24/7 monitoring and documented emergency procedures.
833 ### When This Approach Makes Sense
835 - You need human-readable backups for compliance/archival (legal requirement)
836 - You want continuous incremental backups without WAL shipping
837 - You have **strong operational maturity** and 24/7 monitoring capability
838 - You can dedicate engineering time to building and maintaining the system
839 - You can enforce discipline around replica identity and schema changes
840 - Your database has moderate change velocity (not extremely high-volume OLTP)
841 - You're willing to accept the performance overhead
843 ### When to Consider Alternatives
845 - If you need zero-configuration backup → use `pg_dump` + `pg_basebackup`
846 - If operational complexity is too high → use WAL archiving + `pg_basebackup` (standard PITR)
847 - If you lack 24/7 monitoring capability → any approach without replication slots
848 - If you need guaranteed DDL capture → use WAL archiving (captures everything)
849 - If performance overhead is unacceptable → use `pg_basebackup` + WAL archiving
850 - If you need native PostgreSQL tooling only → use continuous archiving
852 ### Realistic Complexity Assessment
854 **This is NOT a simple weekend project.** A production-ready implementation requires:
856 - **Development effort**: 2-4 weeks for initial implementation
857 - Most streaming infrastructure already exists via `pg_recvlogical`
858 - Primary development work is output transformation and orchestration
859 - **Testing effort**: 2-4 weeks for comprehensive testing
860 - **Documentation**: Detailed runbooks for operations team
861 - **Ongoing maintenance**: Regular updates as PostgreSQL evolves
862 - **24/7 monitoring**: Critical alerting on replication slot health
863 - **Operational expertise**: Deep PostgreSQL knowledge required
865 ### Next Steps for Proof of Concept
867 1. **Initial Setup**: Create replication slot and DDL tracking infrastructure
868 - Use `pg_recvlogical --create-slot` to create logical replication slot
869 - Install and configure DDL tracking:
870 - **Option A**: Use `postgres_ddl_change_tracker` module (https://github.com/i-Cell-Mobilsoft-Open-Source/postgres_ddl_change_tracker)
871 - **Option B**: Set up custom event triggers for DDL capture
872 - Take initial `pg_dump` with `--snapshot` synchronized to the slot
873 2. **Streaming Collection**: Test `pg_recvlogical` for continuous streaming
874 - Choose output plugin: `decoder_raw` (for SQL), `wal2json` (for JSON), or `test_decoding`
875 - Run `pg_recvlogical --start --plugin=<chosen_plugin>` to begin capturing changes
876 - Verify output format and executability
877 3. **Output Transformation**: Build transformation layer (if needed)
878 - If using `decoder_raw`: Minimal transformation needed (already SQL)
879 - If using `wal2json`: Build JSON-to-SQL transformer
880 - If using `test_decoding`: Build parser using existing libraries (pylogicaldecoding, pg-logical-replication, etc.)
881 - Handle complex data types (arrays, JSON, bytea, etc.)
882 4. **DDL Handling**: Implement shared object tracking
883 - Schedule periodic `pg_dumpall --globals-only`
884 - Coordinate with event trigger output or postgres_ddl_change_tracker
885 5. **Monitoring Setup**: Configure replication slot monitoring
886 - Deploy pg_slot_notify for Slack alerts, or
887 - Set up Prometheus + postgres_exporter + Grafana, or
888 - Configure check_postgres with Nagios/Zabbix
889 6. **Restore Process**: Build and test restore scripts
890 - Test point-in-time recovery
891 - Validate sequence synchronization
892 7. **Crash Recovery**: Test duplicate handling and LSN tracking
893 8. **Performance**: Measure storage and CPU overhead
894 9. **Documentation**: Write operational runbooks and emergency procedures
898 ### PostgreSQL Documentation
899 - PostgreSQL Documentation: Chapter 25 - Backup and Restore
900 - PostgreSQL Documentation: Chapter 29 - Logical Replication
901 - PostgreSQL Documentation: Chapter 47 - Logical Decoding
902 - PostgreSQL Documentation: Section 29.8 - Logical Replication Restrictions
903 - PostgreSQL Documentation: `pg_recvlogical` man page
904 - PostgreSQL Documentation: `pg_dump` man page
905 - PostgreSQL Documentation: `pg_dumpall` man page
907 ### Third-Party Tools and Plugins
908 - **wal2json**: JSON output plugin for logical decoding - https://github.com/eulerto/wal2json
909 - **postgres_ddl_change_tracker**: DDL tracking module using event triggers - https://github.com/i-Cell-Mobilsoft-Open-Source/postgres_ddl_change_tracker
910 - **decoder_raw**: SQL output plugin for logical decoding (check PostgreSQL contrib modules)
911 - **pg_slot_notify**: Replication slot monitoring with Slack integration
912 - **pgDash**: PostgreSQL monitoring solution - https://pgdash.io/
913 - **check_postgres**: Monitoring script for Nagios/Icinga/Zabbix integration
915 ### Parsing Libraries
916 - **pylogicaldecoding**: Python library for parsing test_decoding output
917 - **pg-logical-replication**: Node.js library for logical replication
918 - **LogicalDecode**: Java library for parsing logical decoding output