# Database Differential Backup Tool - Design Document ## 1. Survey of Existing Backup Tools ### Backup Chain Concept A **backup chain** is a unit consisting of: - One full backup (base) - Zero or more dependent incremental backups When a new successful full backup is taken, it starts a new chain. The previous chain (old base + incrementals) becomes "outdated" and can be cleaned up if desired. ### Terminology Across Tools - **Duplicity**: "Backup Chain" - **Amanda**: "Dump Cycle" - **Bacula/Bareos**: No specific term (uses "Full/Differential/Incremental levels") - **dump/restore**: Backup levels (0-9) ### Chain-Style Backup Tools #### Open Source 1. **duplicity** - Tar-based, encrypted, designed for remote/cloud storage 2. **Amanda (ADSM)** - Network backup system (1990s), dump cycle concept 3. **Bacula / Bareos** - Enterprise network backup, very powerful 4. **GNU tar (--listed-incremental)** - Built-in incremental support 5. **dar (Disk ARchive)** - Enhanced tar replacement 6. **dump/restore** - Classic Unix backup (1970s), filesystem-aware #### Commercial/Enterprise - Veritas NetBackup - Veeam Backup & Replication - IBM Spectrum Protect (formerly TSM) - Commvault --- ### File Storage Formats #### duplicity **File naming pattern:** ``` duplicity-full.20231215T120000Z.vol1.difftar.gpg duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg ``` **Format details:** - Timestamp in ISO 8601 format - `.difftar` = tar differential format - `.gpg` = encrypted (optional) - Volumes numbered if split: `.vol1`, `.vol2`, etc. - Time range in incremental filename: `from.to` **Example directory:** ``` /backups/myserver/ duplicity-full.20231215T120000Z.vol1.difftar.gpg duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg duplicity-inc.20231216T120000Z.to.20231217T120000Z.vol1.difftar.gpg ``` --- #### Amanda **Directory structure (virtual tapes):** ``` /amanda/vtapes/ slot1/ 00001.DailySet1 # label file 00002.hostname._var.0 # level 0 dump 00003.hostname._home.1 # level 1 dump slot2/ 00001.DailySet2 00002.hostname._etc.0 ``` **File naming pattern:** ``` slot02-00000001-server-_etc-20231215-0.data ``` **Format components:** - Slot number - Sequence number - Hostname - Directory path (slashes become underscores) - Date - Dump level (0-9) - `.data` extension **Key features:** - Each "slot" represents a virtual tape - Mimics physical tape behavior - Must be labeled before use --- #### Bacula/Bareos **Simple volume naming:** ``` /bacula/volumes/ Vol0001 Vol0002 Vol0003 Weekly-0001 Full-January-2023 ``` **Naming configuration:** - Configured via `LabelFormat` in Pool config - Default: `LabelFormat = "Vol"` → Vol0001, Vol0002, etc. - Fully customizable: can use date-based, purpose-based names - No indication of full/incremental in filename - Relationships tracked in catalog database (PostgreSQL/MySQL/SQLite) **Key features:** - Volume = single file on disk (or tape) - Catalog database required to understand backup chain - Flexible naming scheme - Professional/enterprise oriented --- #### GNU tar (--listed-incremental) **Snapshot file + tar archives:** ``` /backups/home/ backup.snar # metadata/snapshot file backup-full-2023-12-15.tar.gz # full backup backup-inc-2023-12-16.tar.gz # incremental backup-inc-2023-12-17.tar.gz # incremental ``` **Snapshot file format (.snar):** - Text-based metadata file - Tracks directory timestamps and inode information - Multiple format versions (0, 1, 2) - Format 2 is current (modern tar versions) - First two lines: seconds and nanoseconds since epoch - Followed by directory records (ASCII 0 delimited) **Key features:** - Same `.snar` file used for entire chain - User chooses tar archive filenames - `.snar` file must be preserved with backups - Losing `.snar` file breaks incremental chain --- #### dar (Disk ARchive) **Sliced archive format:** ``` /backups/ backup-full-20231215.1.dar backup-full-20231215.2.dar # additional slices backup-inc-20231216.1.dar backup-inc-20231217.1.dar ``` **With reference tracking (wrapper scripts):** ``` server_daily_20231215T0352UTC.1.dar server_daily_20231216T0403UTC_based_on_20231215T0352UTC.1.dar server_daily_20231217T0403UTC_based_on_20231216T0403UTC.1.dar ``` **Format pattern:** ``` basename.slice_number.dar ``` **Key features:** - Archives split into "slices" for size management - User chooses basename - Reference archive specified via `-A` option - No distinction between differential/incremental (depends on reference) - Wrapper utilities add reference info to basename --- #### dump/restore **User-specified output files:** ``` /backups/ home-level0-20231215.dump home-level1-20231216.dump home-level2-20231217.dump var-level0-20231215.dump var-level1-20231216.dump ``` **Tracking file:** ``` /etc/dumpdates ``` **dumpdates format (text file):** ``` /dev/sda1 0 Sun Dec 15 03:00:00 2023 /dev/sda1 1 Mon Dec 16 03:00:00 2023 /home 0 Sun Dec 15 04:00:00 2023 ``` **Key features:** - Dump levels: 0 (full) through 9 (incremental) - Level n backs up changes since last level n-1 - Output filename specified by user with `-f` flag - `/etc/dumpdates` tracks: filesystem, level, timestamp - Filesystem-aware (ext2/3/4, UFS) - Very old (1970s Unix) --- ### Key Observations #### Filename Standardization **Self-describing filenames:** - ✅ **duplicity**: Fully standardized, metadata in filename - ✅ **Amanda**: Standardized format with metadata - ✅ **dar** (with wrappers): Can encode reference information **User-defined or generic:** - ❌ **Bacula**: Generic names, requires database - ❌ **GNU tar**: User chooses names, `.snar` tracks metadata - ❌ **dump**: User chooses names, `/etc/dumpdates` tracks metadata #### External Metadata Storage - **duplicity**: None needed (self-contained) - **Amanda**: None needed (filename contains metadata) - **Bacula**: SQL database (PostgreSQL/MySQL/SQLite) - **GNU tar**: `.snar` file (must be preserved) - **dar**: Optional (can use wrapper naming convention) - **dump**: `/etc/dumpdates` file #### Chain Management Tools differ in how they handle "obsolete chain" cleanup: - **duplicity**: Has `remove-all-but-n-full` command - **Amanda**: Tape rotation policy - **Bacula**: Retention periods in configuration - **tar/dar/dump**: Manual management by user --- ### Design Considerations for New Tools Based on this survey: 1. **Self-describing filenames** (duplicity/Amanda model) make tools more robust - No external database dependency - Easy to inspect backups with standard tools - Clear what files belong to which chain 2. **Timestamp in filename** helps with: - Sorting and organization - Understanding backup history - Automated retention policies 3. **Chain/reference encoding** makes relationships explicit: - Know which full backup an incremental depends on - Can validate chain integrity - Simplifies cleanup of complete chains 4. **Trade-offs:** - Self-describing = longer filenames - Database storage = more flexible queries but adds dependency - Separate metadata file = simpler filenames but extra file to manage --- ## 2. Design for Database Differential Backup Tool ### 2.1 Terminology Based on the survey of existing tools, we establish the following terminology: #### Base Backup - **Term chosen**: "base" or "base backup" - **Alternative terms in other tools**: "full backup" (duplicity, tar, dar), "level 0" (dump/restore) - **Definition**: A complete backup of the database - **Rationale**: - "Base" is concise and clearly indicates this is the foundation of the chain - More specific than "full" for database contexts where we're backing up a database instance - Aligns with PostgreSQL's terminology (pg_basebackup) #### Differential Backup - **Term chosen**: "differential" or "diff" - **Alternative terms in other tools**: "incremental" (duplicity, tar, dar), "level 1-9" (dump/restore) - **Definition**: A backup containing only changes since the last **base backup** in this chain - **Rationale**: - Distinguishes from "incremental" (which typically means changes since the *last backup of any type*) - All differentials in a chain reference the same base backup - Simpler restore process: base + one differential (vs. base + all incrementals in sequence) - Standard term in database backup contexts (SQL Server, PostgreSQL) #### Backup Chain - **Term chosen**: "chain" or "backup chain" - **Alternative terms in other tools**: "dump cycle" (Amanda), no specific term (Bacula, dump/restore) - **Definition**: A unit consisting of: - One base backup - Zero or more differential backups that depend on that base - **Lifecycle**: - A new successful base backup starts a new chain - The previous chain becomes "outdated" or "superseded" - Outdated chains can be retained for historical purposes or cleaned up - **Rationale**: - "Chain" clearly conveys the dependency relationship - Widely understood term (duplicity uses this) - Simple and intuitive #### Chain Identifier - **Term chosen**: "chain ID" - **Definition**: A unique identifier for a backup chain, typically the timestamp of the base backup - **Usage**: Links differential backups to their base backup - **Rationale**: - Enables self-describing filenames - Allows validation of chain integrity - Simplifies automated cleanup of complete chains --- ### Summary Table | Concept | Our Term | Size | Depends On | Typical Frequency | |---------|----------|------|------------|-------------------| | Complete backup | **base** | Large | Nothing | Weekly, monthly | | Changes since base | **differential** | Small-Medium | Base in same chain | Daily, hourly | | Base + its differentials | **chain** | N/A | N/A | Per base backup | | Chain's unique ID | **chain ID** | N/A | N/A | One per chain | --- ### 2.2 Filesystem Layout #### Design Philosophy Our tool follows the **opinionated, self-organizing** approach: - User specifies only the backup directory - Tool manages all filenames and organization automatically - Self-describing structure (no external database required) - Easy to inspect, validate, and clean up with standard filesystem tools #### Directory Structure **Recommended layout: Subdirectories per chain** ``` /path/to/backups/ # User-specified backup directory chain-20231215T120000Z/ # Chain directory (named by chain ID) base.sql # Base backup diff-20231216T083000Z.sql # Differential backup diff-20231217T083000Z.sql # Differential backup diff-20231218T083000Z.sql # Differential backup chain-20231222T120000Z/ # New chain (new base started) base.sql # Base backup diff-20231223T083000Z.sql # Differential backup chain-20231229T120000Z/ # Current/latest chain base.sql # Base backup ``` #### Naming Conventions **Chain directory:** ``` chain-{TIMESTAMP}/ ``` - Format: `chain-` prefix + ISO 8601 timestamp (UTC) - Timestamp pattern: `YYYYMMDDTHHMMSSZ` - Example: `chain-20231215T120000Z` - The timestamp is the **chain ID** (taken from when base backup started) **Base backup file:** ``` base.sql ``` - Simple, fixed name within each chain directory - No timestamp needed (directory already identifies the chain) - Extension: `.sql` (SQL dump format) **Differential backup files:** ``` diff-{TIMESTAMP}.sql ``` - Format: `diff-` prefix + ISO 8601 timestamp (UTC) - Timestamp: when this differential backup was taken - Example: `diff-20231216T083000Z.sql` - Sorts chronologically with standard `ls` or `sort` #### Rationale **Why subdirectories per chain?** 1. **Clear grouping**: Each chain is a self-contained unit - Easy to see which backups belong together - Obvious dependencies (all diffs depend on base in same directory) 2. **Simple cleanup**: Delete entire chain by removing one directory - No need to parse filenames to find all parts of a chain - Atomic operation for chain removal 3. **Better than flat directory** (duplicity style): - Doesn't get cluttered with dozens of chains - Easier to navigate and understand visually - Chain operations (validate, cleanup) work on directories 4. **Better than generic names** (Bacula style): - No external database needed - Self-documenting structure - Can inspect backups with `ls`, `tree`, etc. **Why simple filenames within chains?** 1. **Base backup** doesn't need timestamp: - Chain directory already encodes the timestamp - Avoids redundancy - Simpler: just `base.sql` 2. **Differential backups** include timestamp: - Need to distinguish multiple differentials - Timestamp shows when each differential was taken - Sorts chronologically **Why ISO 8601 timestamps?** - Unambiguous (includes timezone: Z = UTC) - Sortable lexicographically - International standard - Compact format (no dashes in time portion) - Precedent: duplicity, dar wrappers #### Directory Listing Examples **Recent backups first:** ```bash $ ls -lt /path/to/backups/ drwxr-xr-x 2 backup backup 4096 Dec 29 12:00 chain-20231229T120000Z drwxr-xr-x 2 backup backup 4096 Dec 22 12:00 chain-20231222T120000Z drwxr-xr-x 2 backup backup 4096 Dec 15 12:00 chain-20231215T120000Z ``` **Contents of a chain:** ```bash $ ls -lh /path/to/backups/chain-20231222T120000Z/ -rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql -rw-r--r-- 1 backup backup 85M Dec 23 08:30 diff-20231223T083000Z.sql -rw-r--r-- 1 backup backup 92M Dec 24 08:30 diff-20231224T083000Z.sql -rw-r--r-- 1 backup backup 78M Dec 25 08:30 diff-20231225T083000Z.sql ``` **Tree view:** ```bash $ tree -h /path/to/backups/ /path/to/backups/ ├── chain-20231215T120000Z │ ├── base.sql [2.0G] │ ├── diff-20231216T083000Z.sql [80M] │ └── diff-20231217T083000Z.sql [88M] ├── chain-20231222T120000Z │ ├── base.sql [2.1G] │ ├── diff-20231223T083000Z.sql [85M] │ └── diff-20231224T083000Z.sql [92M] └── chain-20231229T120000Z └── base.sql [2.2G] ``` --- ### 2.3 Active vs Sealed Differentials #### Backup States When continuously backing up a database, differentials progress through two states: **Active differential:** - Currently streaming changes from the database - Still being written to; incomplete - Not safe to restore from **Sealed differential:** - Completed differential that is no longer being written to - Immutable and safe to restore - Result of the "sealing" process **Sealing:** - The process of finishing an active differential and starting a new one - Happens periodically (e.g., via log rotation, scheduled checkpoint) - Makes a differential complete and immutable #### File Naming **Active differential:** ``` active.sql ``` - Fixed name within chain directory - Indicates work in progress - Only one active differential per chain at any time **Sealed differentials:** ``` diff-{TIMESTAMP}.sql ``` - When sealed, the active differential is renamed with its timestamp - Timestamp: when the differential was sealed - Then a new `active.sql` begins #### Lifecycle Example ``` # Chain starts with base backup chain-20231215T120000Z/ base.sql # Active differential streams changes chain-20231215T120000Z/ base.sql active.sql # currently streaming # First seal (e.g., daily rotation) chain-20231215T120000Z/ base.sql diff-20231216T083000Z.sql # sealed (renamed from active.sql) active.sql # new active starts # Second seal chain-20231215T120000Z/ base.sql diff-20231216T083000Z.sql # sealed diff-20231217T083000Z.sql # sealed (renamed from active.sql) active.sql # new active starts ``` **Directory listing with active backup:** ```bash $ ls -lh /path/to/backups/chain-20231222T120000Z/ -rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql -rw-r--r-- 1 backup backup 85M Dec 23 08:30 diff-20231223T083000Z.sql -rw-r--r-- 1 backup backup 92M Dec 24 08:30 diff-20231224T083000Z.sql -rw-r--r-- 1 backup backup 23M Dec 25 06:45 active.sql # growing in size ``` #### Restore Behavior **Default restore:** - Uses: `base.sql` + all sealed differentials (`diff-*.sql`) - Excludes: `active.sql` - Restores to: the point when the last differential was sealed **Rationale:** - Active differential is incomplete and potentially inconsistent - Sealed differentials are guaranteed complete and consistent - Predictable: users know exactly what restore point they get **To include active differential:** - Must first seal it (manual operation) - Then it becomes a regular `diff-{TIMESTAMP}.sql` file - Now safe to include in restore