# Incremental SQL Backup System Using PostgreSQL Logical Replication

**PostgreSQL Version**: This design is based on PostgreSQL 18.0 documentation. While most features (logical replication, event triggers, pg_recvlogical) are available in earlier versions (PostgreSQL 10+), verify specific parameter availability (e.g., `max_slot_wal_keep_size` requires PostgreSQL 13+) for your target version.

## Executive Summary

This document details the design for a PostgreSQL backup system that produces human-readable, plain SQL incremental backups using logical replication. The system creates backups that remain readable and restorable for 10+ years while supporting online operation and crash safety.

**Design Decision**: Use `pg_recvlogical` with the `wal2sql` plugin for DML capture, combined with event triggers using `pg_logical_emit_message()` for DDL tracking and periodic `pg_dumpall --globals-only` for shared objects.

**Why This Works**:
- **Built-in tooling handles complexity**: `pg_recvlogical` provides streaming infrastructure, crash recovery, and position tracking
- **No transformation layer needed**: `wal2sql` produces production-ready SQL directly
- **Complete coverage**: Event triggers + `pg_logical_emit_message()` + `pg_dumpall --globals-only` captures all DDL at correct chronological positions
- **Long-term readability**: Plain SQL format that can be executed years later
- **Correct DDL/DML ordering**: DDL messages appear in replication stream at exact time of execution

**Key Requirements**:
1. **DDL tracking** - Event triggers emit DDL via `pg_logical_emit_message()`; `pg_dumpall --globals-only` handles shared objects
2. **Replica identity configuration** - All tables need proper configuration for UPDATE/DELETE
3. **Aggressive monitoring** - Replication slots must be monitored to prevent operational issues
4. **wal2sql extension** - ✅ **COMPLETE** - Logical decoding plugin with full DML, DDL, and TRUNCATE support

## Architecture Overview

### High-Level Design

```
┌─────────────────────────────────────────────────────────────┐
│                    PostgreSQL Database                       │
│                                                              │
│  ┌────────────────┐         ┌──────────────────┐           │
│  │  Regular Tables│────────▶│  WAL (Write-Ahead│           │
│  │  (DML Changes) │         │       Log)       │           │
│  └────────────────┘         └──────────────────┘           │
│                                      │                       │
│                                      ▼                       │
│                         ┌─────────────────────────┐         │
│                         │ Logical Decoding Process│         │
│                         │    (wal2sql plugin)     │         │
│                         └─────────────────────────┘         │
│                                      │                       │
└──────────────────────────────────────┼───────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────┐
                        │  Replication Slot           │
                        │  (Tracks position, durable) │
                        └─────────────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────┐
                        │    pg_recvlogical Tool     │
                        │ (Built-in PostgreSQL util)  │
                        └─────────────────────────────┘
                                       │
                        ┌──────────────┴──────────────┐
                        ▼                             ▼
          ┌─────────────────────┐      ┌─────────────────────┐
          │  Incremental Files  │      │  Full pg_dump       │
          │  (SQL Changes)      │      │  (Periodic)         │
          │  - 2024-01-01.sql   │      │  - base-2024-01.sql │
          │  - 2024-01-02.sql   │      │  - base-2024-02.sql │
          │  - ...              │      │  - ...              │
          └─────────────────────┘      └─────────────────────┘
```

### Core Components

1. **Logical Replication Slot**: Durable position tracker in PostgreSQL
2. **wal2sql Plugin**: ✅ **IMPLEMENTED** - Transforms binary WAL to executable SQL
3. **pg_recvlogical**: Built-in PostgreSQL tool that streams logical decoding output
4. **Base Backup System**: ⏳ **PLANNED** - Regular full `pg_dump` backups with `--snapshot` for consistency
5. **Schema Tracking System**: ⏳ **PLANNED** - Event triggers + `pg_dumpall --globals-only` for DDL changes

## How It Works

### DML Capture via Logical Replication

PostgreSQL's logical replication decodes the Write-Ahead Log (WAL) into logical changes. The `wal2sql` plugin outputs these directly as executable SQL:

```sql
BEGIN;
INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
UPDATE public.users SET name = 'Alice Smith' WHERE id = 1;
DELETE FROM public.orders WHERE id = 42;
COMMIT;
```

**Key Properties**:
- **Crash-safe**: Replication slots persist position across crashes (positions persisted at checkpoint intervals; after crash, slot may return to earlier LSN causing recent changes to be replayed)
- **Consistent**: Transaction boundaries are preserved
- **Online**: Runs without blocking database operations
- **Idempotent positioning**: Can restart from last known position (clients responsible for handling duplicate messages)

### DDL Capture via Event Triggers and Logical Messages

Logical replication does **not** capture DDL (CREATE TABLE, ALTER TABLE, etc.). We solve this by emitting DDL commands directly into the logical replication stream using PostgreSQL's `pg_logical_emit_message()` function.

The wal2sql extension (installed via `CREATE EXTENSION wal2sql`) provides an event trigger that automatically captures DDL and emits it into the replication stream with the 'ddl' prefix. See `wal2sql/wal2sql--0.1.sql` for the implementation.

**How it works**:
1. Event trigger fires on all DDL commands
2. `pg_logical_emit_message()` writes DDL into WAL as a logical decoding message
3. Message appears in replication stream at exact chronological position relative to DML
4. Output plugin's `message_cb` callback outputs DDL as executable SQL
5. Restore is simple: just execute the incremental backup file sequentially

**Example incremental backup output**:
```sql
BEGIN;
INSERT INTO public.users (id, name, email) VALUES (1, 'Alice', 'alice@example.com');
COMMIT;
BEGIN;
-- DDL message appears at exact time of execution
ALTER TABLE public.users DROP COLUMN email;
COMMIT;
BEGIN;
-- Subsequent DML only references remaining columns
INSERT INTO public.users (id, name) VALUES (2, 'Bob');
COMMIT;
```

**Key Properties**:
- ✅ **Perfect chronological ordering**: DDL appears exactly when it was executed
- ✅ **Transactional integrity**: DDL message commits with its transaction
- ✅ **Simple restore**: Execute backup file sequentially with `psql -f`
- ✅ **PostgreSQL built-in**: `pg_logical_emit_message()` available since PostgreSQL 9.6

**Limitations**:
- Event triggers don't fire for shared objects: databases, roles, tablespaces, parameter privileges, and ALTER SYSTEM commands
- Solution: Use periodic `pg_dumpall --globals-only` to capture shared objects

## Database Schema Prerequisites

This section summarizes the **mandatory database requirements** for this backup system. The setup script MUST validate these and fail with clear errors if not met.

### Critical Requirements

**1. Replica Identity** (see "Key Challenges" section #1 for details):
- Every table MUST have: PRIMARY KEY, or `REPLICA IDENTITY USING INDEX`, or `REPLICA IDENTITY FULL`
- Without this: UPDATE/DELETE operations **fail on the source database**
- Validation: `SELECT n.nspname, c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind = 'r' AND c.relreplident = 'd' AND NOT EXISTS (SELECT 1 FROM pg_index i WHERE i.indrelid = c.oid AND i.indisprimary) AND n.nspname NOT IN ('pg_catalog', 'information_schema');`

**2. PostgreSQL Configuration**:
- `wal_level = logical` (required)
- `max_replication_slots >= 1` (required)
- `max_wal_senders >= 1` (required)

**3. Event Trigger Installation**:
- Run `CREATE EXTENSION wal2sql;` in every database being backed up
- Without this: DDL changes NOT captured

### Objects NOT Captured (plan alternatives)

- **Unlogged/temporary tables**: Silently excluded from logical replication
- **Sequence state**: Not replicated incrementally (restore script automatically synchronizes via `setval()`)
- **Large objects**: Use `BYTEA` instead or accept loss (see "Key Challenges" #4)
- **Shared objects**: Captured via `pg_dumpall --globals-only`
- **System columns**: `ctid`, `xmin`, `xmax` will differ after restore

### Additional Constraints

- **REPLICA IDENTITY FULL** + geometric types (`point`, `box`, `circle`, etc.): Use PRIMARY KEY instead
- **Foreign keys**: TRUNCATE CASCADE fails if related tables split across different publications
- **Row-level security**: Disable RLS on target tables during restore
- **Triggers**: Disable triggers during restore to avoid duplicate side effects
- **Extensions**: Must be available on restore target (document in metadata)
- **Collation/encoding**: Must be compatible between source and target

### Pre-Deployment Validation

```sql
-- CRITICAL: Find tables without adequate replica identity
SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema')
  AND c.relreplident IN ('d', 'n')  -- DEFAULT or NOTHING
  AND NOT EXISTS (SELECT 1 FROM pg_index i WHERE i.indrelid = c.oid AND i.indisprimary);
-- Must return 0 rows

-- CRITICAL: Verify configuration
SHOW wal_level;  -- Must be 'logical'
SELECT count(*) FROM pg_event_trigger WHERE evtname = 'wal2sql_ddl_trigger';  -- Must be 1

-- WARNING: Check for unlogged tables (won't be backed up)
SELECT n.nspname, c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND c.relpersistence = 'u'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema');

-- WARNING: Check for large objects (not incrementally backed up)
SELECT count(*) FROM pg_largeobject_metadata;
```

See "Key Challenges and Solutions" section below for detailed explanations and solutions.

## Implementation Components

### 1. Initial Setup Script

**Purpose**: Bootstrap the backup system
**Status**: ⏳ **PLANNED** - Not yet implemented

**Philosophy**: The setup script validates prerequisites and **fails fast** with clear error messages. It does NOT automatically fix schema issues - human oversight required for replica identity decisions due to performance tradeoffs.

**Tasks**:
1. **Validate Prerequisites** (print report to stdout):
   - **CRITICAL** (must fix before proceeding):
     - Verify `wal_level = logical`, `max_replication_slots >= 1`, `max_wal_senders >= 1`
     - Check all tables have adequate replica identity (PRIMARY KEY, USING INDEX, or FULL)
   - **Coverage Warnings**:
     - List unlogged tables (will not be backed up)
     - Check for large objects (not incrementally backed up)
   - **If validation fails**: Print detailed report and EXIT with error
   - **Human must fix CRITICAL issues** before re-running setup

2. **Setup Replication Infrastructure** (after validation passes):
   - Create logical replication slot with snapshot export
   - Capture the exported snapshot identifier
   - Set up event triggers via `CREATE EXTENSION wal2sql`

3. **Initial Backup**:
   - Take synchronized base backup using exported snapshot (`pg_dump`)
   - Create initial `pg_dumpall --globals-only` backup
   - Write `pg_scribe_version.txt` to backup directory (PostgreSQL version, pg_scribe version, setup timestamp)

**Note**: Most metadata (sequences, extensions, replica identity, encoding) is already captured in the `pg_dump` and `pg_dumpall` files. The restore script queries this information directly from the backup files or restored database as needed.

**Critical Detail - Initial Snapshot Consistency**:

From PostgreSQL documentation (Section 47.2.5):
> When a new replication slot is created using the streaming replication interface, a snapshot is exported which will show exactly the state of the database after which all changes will be included in the change stream.

This ensures the base backup and incremental stream are perfectly aligned with no gaps or overlaps.

**Step-by-Step Setup Procedure**:

1. **Verify prerequisites**:
   - Check backup directory is empty (or create if doesn't exist)
   - Verify replication slot doesn't already exist (fail if it does)
   - Validate PostgreSQL configuration and table replica identity

2. **Create replication slot**:
   - Use `pg_create_logical_replication_slot()` SQL function
   - Capture returned LSN position for base backup coordination
   - If creation fails, exit with error (no cleanup needed yet)

3. **Take synchronized base backup**:
   - Use `pg_dump --file=backup.sql` immediately after slot creation
   - The slot preserves WAL from its creation LSN forward, ensuring no changes are lost
   - Track created files for cleanup if subsequent steps fail

4. **Capture globals and metadata**:
   - Run `pg_dumpall --globals-only` for roles, tablespaces
   - Document PostgreSQL version, extensions, encoding in metadata file
   - Track all created files for cleanup if any step fails

**Critical Considerations**:
- **Cleanup on failure**: If any step fails, automatically remove replication slot and partial backup files
- **Non-idempotent**: Following PostgreSQL conventions (`initdb`, `pg_basebackup`), the operation requires a clean state and does not support re-running
- **Empty directory required**: Like `pg_basebackup`, the backup directory must be empty to prevent accidental overwrites

### 2. Incremental Backup Collection

**Status**: ✅ **IMPLEMENTED** - `pg_scribe --start` command available
**Tool**: Built-in `pg_recvlogical` utility with `wal2sql` plugin

**Implementation Approach**:

The `pg_scribe --start` command is a thin wrapper that:
1. Validates the database connection and replication slot
2. Displays configuration and user information
3. Uses `exec` to **replace itself** with `pg_recvlogical`

This design has several advantages over traditional parent-child process management:
- **No orphaned processes**: The pg_scribe process becomes pg_recvlogical (same PID)
- **Direct signal handling**: Signals (SIGTERM, SIGINT, SIGHUP) go directly to pg_recvlogical
- **Simpler implementation**: No need for signal forwarding or child process tracking
- **Reliable cleanup**: Test suites and process management tools interact with a single process

**Key Configuration**:

The wrapper invokes `pg_recvlogical --start` with the following important parameters:

- **Status interval** (`--status-interval`): Controls how often client reports position back to server (default: 10 seconds)
  - Lower values allow server to advance slot position and free WAL more quickly
  - Balance between slot health and network overhead

- **Fsync interval** (`--fsync-interval`): Controls disk write safety (default: 10 seconds)
  - Frequency of forced disk synchronization for crash safety on backup client
  - 0 disables fsync (faster but risks data loss if client crashes)
  - Higher values improve performance but increase window of potential loss

- **Plugin options** (`--option`): Pass `include_transaction=on` to wal2sql
  - Includes BEGIN/COMMIT statements in output
  - Essential for maintaining transaction boundaries during restore

**What pg_recvlogical provides**:
- Streams decoded changes continuously from the replication slot
- Handles connection failures and automatic reconnection
- Tracks LSN positions with status updates to the server
- Supports file rotation via SIGHUP signal (for log rotation without stopping stream)

**What wal2sql provides** (✅ **ALL IMPLEMENTED**):
- ✅ Schema-qualified table names: `INSERT INTO public.users (...)`
- ✅ Proper column name quoting with `quote_identifier()`
- ✅ Transaction boundaries via `include_transaction=on` option (BEGIN/COMMIT)
- ✅ Intelligent replica identity handling (DEFAULT, INDEX, FULL, NOTHING)
- ✅ Comprehensive data type support (booleans, NaN, Infinity, NULL, strings, bit strings)
- ✅ TOAST optimization (skips unchanged TOAST columns in UPDATEs)
- ✅ Production-quality memory management
- ✅ **`message_cb` callback** to handle DDL messages from `pg_logical_emit_message()`
- ✅ **`truncate_cb` callback** to handle TRUNCATE operations

### 3. Periodic Full Backup Script

**Status**: ⏳ **PLANNED** - Not yet implemented
**Purpose**: Take regular full backups as restore points

**Tasks**:
- Execute `pg_dump --file=backup.sql` to create full database backup in plain SQL format
- Execute `pg_dumpall --globals-only` to capture shared objects (databases, roles, tablespaces)
- Compress backups to save space
- Implement retention policy (delete old backups)

**Backup Approach**:
- Use `pg_dump --file=backup.sql` for plain SQL output
- Maintains human-readable format consistent with design goals
- Single connection to database
- Universal compatibility across PostgreSQL versions

**Compression**:
- Use `pg_dump --compress=<method>` for built-in compression (e.g., `--compress=zstd:9`)
- Or compress after creation with external tools (gzip, zstd)
- Compression significantly reduces storage requirements while maintaining recoverability

### 4. Restore Script

**Status**: ⏳ **PLANNED** - Not yet implemented
**Purpose**: Restore database to latest captured state from base + incremental backups

**Restore Process**:
1. Locate most recent full backup and all incremental backups since then
2. Create new target database
3. Restore `pg_dumpall --globals-only` (shared objects: roles, tablespaces)
4. Restore full `pg_dump` backup using `psql -f backup.sql`
5. Disable triggers to prevent duplicate side effects during replay
6. Apply all incremental SQL backups in chronological order using `psql -f incremental-*.sql`
   - DDL and DML are already interleaved in correct chronological order
7. Automatically synchronize all sequence values to match table max values
8. Re-enable triggers
9. Verify data integrity (row counts, application smoke tests)

**Key Automation**:
- **Sequences**: Script automatically runs `setval()` for all sequences after restore
- **Triggers**: Script automatically disables/re-enables to prevent side effects
- **RLS**: Script handles row-level security if present (disable during restore or use superuser)

**Handling Duplicate Transactions**:

After PostgreSQL crash, replication slot may return to earlier LSN, causing some transactions to be streamed again. The restore process handles this naturally through idempotent operations:

- Most SQL operations are idempotent or fail safely:
  - INSERT fails on duplicate primary key (acceptable during restore)
  - UPDATE reapplies same values (idempotent)
  - DELETE succeeds or reports row not found (acceptable)
- Transaction boundaries (BEGIN/COMMIT from `include_transaction=on`) preserve consistency
- Simply apply all incremental files in order; duplicates will be handled correctly
- No additional LSN tracking infrastructure required

### 5. Monitoring and Health Check Script

**Status**: ⏳ **PLANNED** - Not yet implemented
**Purpose**: Prevent operational issues from inactive replication slots

**Critical Metrics**:
```sql
-- Check replication slot health
SELECT
    slot_name,
    slot_type,
    database,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replication_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmed_lag
FROM pg_replication_slots
WHERE slot_type = 'logical';
```

**Alerting**:
- **Critical alert** when `restart_lsn` falls more than 1GB behind
- **Emergency alert** when slot lag exceeds 10GB or age exceeds 24 hours
- **Emergency procedure** documented to drop slot if it threatens database availability

**Available Tools**:
- **Prometheus + postgres_exporter + Grafana**: Open-source monitoring stack
- **pgDash** (https://pgdash.io/): Commercial PostgreSQL monitoring
- **check_postgres**: Nagios/Icinga/Zabbix integration
- **Built-in views**: `pg_replication_slots`, `pg_stat_replication_slots`

## wal2sql Plugin Details

**Status**: ✅ **COMPLETE** - All required functionality implemented and tested
**Source**: Forked from decoder_raw (https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw)
**Location**: `wal2sql/` directory
**Compatibility**: PostgreSQL 18.0+

**Installation**:
```bash
cd wal2sql
make && make install
PGUSER=postgres make installcheck  # Run tests
```

**Implemented Features**:
- ✅ INSERT/UPDATE/DELETE statements with proper escaping
- ✅ Transaction boundaries (BEGIN/COMMIT)
- ✅ Replica identity handling (DEFAULT, INDEX, FULL, NOTHING)
- ✅ **`message_cb`** - DDL capture via `pg_logical_emit_message()`
- ✅ **`truncate_cb`** - TRUNCATE statement capture
- ✅ Comprehensive data type support (booleans, NaN, Infinity, NULL, strings)
- ✅ TOAST optimization
- ✅ Full test coverage

## Key Challenges and Solutions

### 1. Replica Identity Required for UPDATE/DELETE

**Problem**: Tables need replica identity for UPDATE/DELETE operations.

From PostgreSQL documentation (Section 29.1.1):
> Tables with a replica identity defined as `NOTHING`, `DEFAULT` without a primary key, or `USING INDEX` with a dropped index, **cannot support UPDATE or DELETE operations**. **Attempting such operations will result in an error on the publisher.**

This means UPDATE/DELETE will **fail on the source database**, not just during restore!

**Solution**: Ensure all tables have one of:
- A primary key (automatic replica identity)
- A unique index configured via `REPLICA IDENTITY USING INDEX index_name`
- Explicit `REPLICA IDENTITY FULL` setting (inefficient, last resort)

**Example**:
```sql
-- Table without primary key will error on UPDATE/DELETE
CREATE TABLE logs (timestamp TIMESTAMPTZ, message TEXT);

-- Fix: Set replica identity to FULL
ALTER TABLE logs REPLICA IDENTITY FULL;
```

### 2. Replication Slots Prevent WAL Cleanup

**Problem**: Inactive replication slots prevent WAL cleanup, leading to:
1. Disk fills up (WAL files not cleaned)
2. Table bloat (VACUUM cannot clean old row versions)
3. **Database shutdown** (transaction ID wraparound)

From PostgreSQL documentation (Section 47.2.2):
> In extreme cases this could cause the database to shut down to prevent transaction ID wraparound.

**Solution**:
- **Monitor slot lag aggressively** (see monitoring section)
- Set `max_slot_wal_keep_size` parameter (PostgreSQL 13+) to limit WAL retention
- Have documented emergency procedure to drop slot if needed
- Consider `pg_replication_slot_advance()` to skip ahead (loses backup coverage)

### 3. Sequences Are Not Replicated

**Problem**: Sequence values are not captured in logical replication.

**Solution**:
- Use `pg_dump --sequence-data` (enabled by default) in periodic full dumps
- After restore, synchronize sequences:
  ```sql
  SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));
  ```

### 4. Large Objects Are Not Replicated

**Problem**: PostgreSQL large objects are not captured in logical replication.

**Solution**:
- **Preferred**: Use `BYTEA` columns instead (these ARE replicated)
- **Alternative**: Use `pg_dump --large-objects` in periodic full backups
  - Note: Incremental changes to large objects NOT captured between full backups

### 5. Crash Recovery and Duplicate Handling

**Problem**: After database crash, slot position may roll back, causing duplicate changes.

From PostgreSQL documentation (Section 47.2.2):
> The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot might return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts.

**Solution**: The restore process handles duplicates naturally through idempotent operations. Per PostgreSQL documentation (Section 47.2.2): "Logical decoding clients are responsible for avoiding ill effects from handling the same message more than once."

**Implementation**:
- Most SQL operations in backup files are naturally idempotent:
  - INSERT will fail on duplicate primary key (acceptable during restore)
  - UPDATE will reapply same values (idempotent)
  - DELETE will succeed or report row not found (acceptable)
- Transaction boundaries (BEGIN/COMMIT from `include_transaction=on`) ensure consistency
- Simply apply all incremental files in chronological order
- No additional LSN tracking infrastructure required
- See Restore Script section (Section 4) for implementation details

**Testing**:
- Test crash scenarios with `pg_ctl stop -m immediate` to verify duplicate handling
- Monitor `confirmed_flush_lsn` lag during normal operations (see Monitoring section)

### 6. Long-Term Readability

**Challenges**:
- PostgreSQL syntax may change between major versions (rare)
- Extension dependencies may not exist in future systems
- Encoding/collation definitions may change

**Solution**: Include metadata file with each backup:
- PostgreSQL version (full version string)
- All installed extension names and versions
- Database encoding
- Locale and collation settings
- Custom data types and enums

Periodically test restoring old backups on current PostgreSQL versions.

## Prerequisites and Configuration

### PostgreSQL Configuration

```ini
# postgresql.conf

# Required: Set WAL level to logical
wal_level = logical

# Required: Allow at least one replication slot
max_replication_slots = 10

# Recommended: Allow replication connections
max_wal_senders = 10

# Recommended: Keep more WAL for safety
wal_keep_size = 1GB

# Recommended: Limit WAL retention for safety (PostgreSQL 13+)
max_slot_wal_keep_size = 10GB

# Optional: Tune checkpoint frequency to persist slot positions more often
checkpoint_timeout = 5min
```

### Client Requirements

- PostgreSQL client utilities installed (`pg_recvlogical`, `pg_dump`, `pg_dumpall`)
- Superuser or role with `REPLICATION` privilege
- Permission to create replication slots
- wal2sql plugin compiled and installed

## Operational Procedures

### Backup Schedule

**Recommended**:
- **Incremental**: Continuously streaming via `pg_recvlogical`
- **Full backup**: Daily at 2 AM
- **Globals backup**: Daily (`pg_dumpall --globals-only`)
- **Metadata export**: Daily (PostgreSQL version, extensions, encoding)

### Retention Policy

- **Incremental backups**: Keep 7 days
- **Full backups**: Keep 30 days, then one per month for 1 year
- **Monitor disk space**: Alert if backup directory exceeds 80% capacity

### Disaster Recovery Runbook

1. **Stop application** to prevent new writes during restore
2. **Create new database** (don't overwrite production)
3. **Restore shared objects**: `psql -f globals-YYYY-MM-DD.sql`
4. **Restore full backup**: `psql dbname < base-YYYY-MM-DD.sql`
5. **Apply all incremental backups**: `for f in incremental-*.sql; do psql dbname < "$f"; done`
   - DDL and DML are already interleaved in correct chronological order
6. **Sync sequences**: Run `setval()` for all sequences to match table max values
7. **Verify data integrity**: Check row counts, run application smoke tests
8. **Test application** against restored database
9. **Switch over** application to restored database

**See Section 4 (Restore Script)** for detailed procedures including sequence synchronization and duplicate transaction handling.

## Testing Strategy

### 1. Basic Functionality Test

```sql
-- Create test database and setup
CREATE DATABASE backup_test;
\c backup_test

-- Create test table
CREATE TABLE test_users (id SERIAL PRIMARY KEY, name TEXT, created_at TIMESTAMP DEFAULT now());

-- Generate data and schema changes to test DDL/DML ordering
INSERT INTO test_users (name) VALUES ('Alice'), ('Bob'), ('Charlie');
UPDATE test_users SET name = 'Alice Smith' WHERE id = 1;
DELETE FROM test_users WHERE id = 3;

-- Add column - DDL message should appear in stream here
ALTER TABLE test_users ADD COLUMN email TEXT;

-- Use the new column - should work because DDL already executed
UPDATE test_users SET email = 'alice@example.com' WHERE id = 1;

-- Drop column - DDL message should appear in stream here
ALTER TABLE test_users DROP COLUMN created_at;

-- Subsequent inserts should work without the dropped column
INSERT INTO test_users (name, email) VALUES ('David', 'david@example.com');

-- Restore and verify:
-- 1. All operations should replay successfully
-- 2. DML before column add should not reference email column
-- 3. DML after column add should reference email column
-- 4. DML after column drop should not reference created_at column
```

### 2. Crash Recovery Test

```bash
# Start collecting incrementals
# Generate load with pgbench
# Simulate crash: pg_ctl stop -m immediate
# Restart PostgreSQL
# Verify no data loss and duplicates handled correctly
# Restore and verify
```

### 3. Long-Term Storage Test

```bash
# Create backup
# Store backup files
# Wait (or simulate) years passing
# Restore on modern PostgreSQL version
# Verify SQL is still readable and executable
```

### 4. Replica Identity Test

```sql
-- Create table without primary key
CREATE TABLE test_no_pk (col1 TEXT, col2 INT);

-- Attempt UPDATE (should fail with replica identity error)
UPDATE test_no_pk SET col2 = 5 WHERE col1 = 'test';

-- Fix with REPLICA IDENTITY FULL
ALTER TABLE test_no_pk REPLICA IDENTITY FULL;

-- Retry UPDATE (should succeed)
```

### 5. TRUNCATE Handling Test

```sql
-- Create test table
CREATE TABLE test_truncate (id INT);
INSERT INTO test_truncate VALUES (1), (2), (3);

-- Perform TRUNCATE
TRUNCATE test_truncate;

-- Verify: Check if wal2sql incremental backup captured TRUNCATE
-- Expected: ✅ CAPTURED by wal2sql with truncate_cb
-- Look for: TRUNCATE TABLE public.test_truncate;
-- Note: Event triggers do NOT capture TRUNCATE (it's DML, not DDL)

-- Test TRUNCATE with multiple tables (foreign key cascade)
CREATE TABLE parent_table (id INT PRIMARY KEY);
CREATE TABLE child_table (parent_id INT REFERENCES parent_table(id));
INSERT INTO parent_table VALUES (1), (2);
INSERT INTO child_table VALUES (1), (2);

-- TRUNCATE CASCADE should capture both tables
TRUNCATE parent_table, child_table;
-- Expected output: TRUNCATE TABLE public.parent_table, public.child_table;
```

## Performance Considerations

**Write Amplification**:
- WAL must be written (normal)
- WAL must be decoded into logical format (additional CPU)
- Event triggers fire on every DDL operation (minimal overhead)

**Disk I/O**:
- Additional WAL volume retained by replication slots
- More frequent checkpoint I/O if checkpoint_timeout is tuned

**Recommendations**:
- Benchmark overhead on test system with production-like workload
- Monitor CPU usage of WAL sender processes
- Monitor disk usage for WAL and backup directories

## Implementation Status & Next Steps

### ✅ Completed

1. **wal2sql Plugin** - Fully implemented and tested
   - ✅ Core DML operations (INSERT/UPDATE/DELETE)
   - ✅ `message_cb` for DDL capture
   - ✅ `truncate_cb` for TRUNCATE support
   - ✅ Comprehensive test suite passing

### ⏳ Next Steps

1. **Setup Script** - Create initial backup environment
   - Create replication slot with `wal2sql`
   - Set up event triggers for DDL capture using `pg_logical_emit_message()`
   - Take synchronized base backup with `pg_dump --snapshot`
   - Configure `REPLICA IDENTITY` on tables

2. **Backup Collection Script** - Wrapper around `pg_recvlogical`
   - File rotation and timestamping
   - Metadata generation (PostgreSQL version, extensions, encoding)
   - Error handling and restart logic

3. **Restore Script** - Apply base + incremental backups
   - Locate and order backup files
   - Apply in sequence with error handling
   - Sequence synchronization with `setval()`

4. **Monitoring Script** - Replication slot health checks
   - Track slot lag and WAL retention
   - Alert on dangerous conditions
   - Emergency procedures documentation

5. **Integration Testing**
   - End-to-end backup and restore
   - DDL/DML interleaving scenarios
   - Crash recovery with `pg_ctl stop -m immediate`
   - Performance benchmarking

## References

### PostgreSQL Documentation
- PostgreSQL Documentation: Chapter 25 - Backup and Restore
- PostgreSQL Documentation: Chapter 29 - Logical Replication
- PostgreSQL Documentation: Chapter 47 - Logical Decoding
- PostgreSQL Documentation: Section 29.8 - Logical Replication Restrictions
- PostgreSQL Documentation: Section 47.6.4.8 - Generic Message Callback (message_cb)
- PostgreSQL Documentation: Section 9.28.6 - `pg_logical_emit_message()` function
- PostgreSQL Documentation: `pg_recvlogical` man page
- PostgreSQL Documentation: `pg_dump` man page
- PostgreSQL Documentation: `pg_dumpall` man page

### Essential Tools
- **wal2sql**: ✅ **IMPLEMENTED** - SQL output plugin for logical decoding
  - Location: `wal2sql/` directory in this repository
  - Forked from: https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw
  - License: PostgreSQL License (production-ready)
  - Compatibility: PostgreSQL 18.0+

### Monitoring Tools
- **Prometheus + postgres_exporter + Grafana**: Open-source monitoring stack
- **pgDash**: PostgreSQL monitoring - https://pgdash.io/
- **check_postgres**: Nagios/Icinga/Zabbix integration
- **pg_stat_replication_slots**: Built-in PostgreSQL monitoring view