# Database Differential Backup Tool - Design Document

## 1. Survey of Existing Backup Tools

### Backup Chain Concept

A **backup chain** is a unit consisting of:
- One full backup (base)
- Zero or more dependent incremental backups

When a new successful full backup is taken, it starts a new chain. The previous chain (old base + incrementals) becomes "outdated" and can be cleaned up if desired.

### Terminology Across Tools

- **Duplicity**: "Backup Chain"
- **Amanda**: "Dump Cycle"
- **Bacula/Bareos**: No specific term (uses "Full/Differential/Incremental levels")
- **dump/restore**: Backup levels (0-9)

### Chain-Style Backup Tools

#### Open Source
1. **duplicity** - Tar-based, encrypted, designed for remote/cloud storage
2. **Amanda (ADSM)** - Network backup system (1990s), dump cycle concept
3. **Bacula / Bareos** - Enterprise network backup, very powerful
4. **GNU tar (--listed-incremental)** - Built-in incremental support
5. **dar (Disk ARchive)** - Enhanced tar replacement
6. **dump/restore** - Classic Unix backup (1970s), filesystem-aware

#### Commercial/Enterprise
- Veritas NetBackup
- Veeam Backup & Replication
- IBM Spectrum Protect (formerly TSM)
- Commvault

---

### File Storage Formats

#### duplicity

**File naming pattern:**
```
duplicity-full.20231215T120000Z.vol1.difftar.gpg
duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg
```

**Format details:**
- Timestamp in ISO 8601 format
- `.difftar` = tar differential format
- `.gpg` = encrypted (optional)
- Volumes numbered if split: `.vol1`, `.vol2`, etc.
- Time range in incremental filename: `from.to`

**Example directory:**
```
/backups/myserver/
  duplicity-full.20231215T120000Z.vol1.difftar.gpg
  duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg
  duplicity-inc.20231216T120000Z.to.20231217T120000Z.vol1.difftar.gpg
```

---

#### Amanda

**Directory structure (virtual tapes):**
```
/amanda/vtapes/
  slot1/
    00001.DailySet1                           # label file
    00002.hostname._var.0                     # level 0 dump
    00003.hostname._home.1                    # level 1 dump
  slot2/
    00001.DailySet2
    00002.hostname._etc.0
```

**File naming pattern:**
```
slot02-00000001-server-_etc-20231215-0.data
```

**Format components:**
- Slot number
- Sequence number
- Hostname
- Directory path (slashes become underscores)
- Date
- Dump level (0-9)
- `.data` extension

**Key features:**
- Each "slot" represents a virtual tape
- Mimics physical tape behavior
- Must be labeled before use

---

#### Bacula/Bareos

**Simple volume naming:**
```
/bacula/volumes/
  Vol0001
  Vol0002
  Vol0003
  Weekly-0001
  Full-January-2023
```

**Naming configuration:**
- Configured via `LabelFormat` in Pool config
- Default: `LabelFormat = "Vol"` → Vol0001, Vol0002, etc.
- Fully customizable: can use date-based, purpose-based names
- No indication of full/incremental in filename
- Relationships tracked in catalog database (PostgreSQL/MySQL/SQLite)

**Key features:**
- Volume = single file on disk (or tape)
- Catalog database required to understand backup chain
- Flexible naming scheme
- Professional/enterprise oriented

---

#### GNU tar (--listed-incremental)

**Snapshot file + tar archives:**
```
/backups/home/
  backup.snar                                # metadata/snapshot file
  backup-full-2023-12-15.tar.gz             # full backup
  backup-inc-2023-12-16.tar.gz              # incremental
  backup-inc-2023-12-17.tar.gz              # incremental
```

**Snapshot file format (.snar):**
- Text-based metadata file
- Tracks directory timestamps and inode information
- Multiple format versions (0, 1, 2)
- Format 2 is current (modern tar versions)
- First two lines: seconds and nanoseconds since epoch
- Followed by directory records (ASCII 0 delimited)

**Key features:**
- Same `.snar` file used for entire chain
- User chooses tar archive filenames
- `.snar` file must be preserved with backups
- Losing `.snar` file breaks incremental chain

---

#### dar (Disk ARchive)

**Sliced archive format:**
```
/backups/
  backup-full-20231215.1.dar
  backup-full-20231215.2.dar                # additional slices
  backup-inc-20231216.1.dar
  backup-inc-20231217.1.dar
```

**With reference tracking (wrapper scripts):**
```
server_daily_20231215T0352UTC.1.dar
server_daily_20231216T0403UTC_based_on_20231215T0352UTC.1.dar
server_daily_20231217T0403UTC_based_on_20231216T0403UTC.1.dar
```

**Format pattern:**
```
basename.slice_number.dar
```

**Key features:**
- Archives split into "slices" for size management
- User chooses basename
- Reference archive specified via `-A` option
- No distinction between differential/incremental (depends on reference)
- Wrapper utilities add reference info to basename

---

#### dump/restore

**User-specified output files:**
```
/backups/
  home-level0-20231215.dump
  home-level1-20231216.dump
  home-level2-20231217.dump
  var-level0-20231215.dump
  var-level1-20231216.dump
```

**Tracking file:**
```
/etc/dumpdates
```

**dumpdates format (text file):**
```
/dev/sda1 0 Sun Dec 15 03:00:00 2023
/dev/sda1 1 Mon Dec 16 03:00:00 2023
/home     0 Sun Dec 15 04:00:00 2023
```

**Key features:**
- Dump levels: 0 (full) through 9 (incremental)
- Level n backs up changes since last level n-1
- Output filename specified by user with `-f` flag
- `/etc/dumpdates` tracks: filesystem, level, timestamp
- Filesystem-aware (ext2/3/4, UFS)
- Very old (1970s Unix)

---

### Key Observations

#### Filename Standardization

**Self-describing filenames:**
- ✅ **duplicity**: Fully standardized, metadata in filename
- ✅ **Amanda**: Standardized format with metadata
- ✅ **dar** (with wrappers): Can encode reference information

**User-defined or generic:**
- ❌ **Bacula**: Generic names, requires database
- ❌ **GNU tar**: User chooses names, `.snar` tracks metadata
- ❌ **dump**: User chooses names, `/etc/dumpdates` tracks metadata

#### External Metadata Storage

- **duplicity**: None needed (self-contained)
- **Amanda**: None needed (filename contains metadata)
- **Bacula**: SQL database (PostgreSQL/MySQL/SQLite)
- **GNU tar**: `.snar` file (must be preserved)
- **dar**: Optional (can use wrapper naming convention)
- **dump**: `/etc/dumpdates` file

#### Chain Management

Tools differ in how they handle "obsolete chain" cleanup:

- **duplicity**: Has `remove-all-but-n-full` command
- **Amanda**: Tape rotation policy
- **Bacula**: Retention periods in configuration
- **tar/dar/dump**: Manual management by user

---

### Design Considerations for New Tools

Based on this survey:

1. **Self-describing filenames** (duplicity/Amanda model) make tools more robust
   - No external database dependency
   - Easy to inspect backups with standard tools
   - Clear what files belong to which chain

2. **Timestamp in filename** helps with:
   - Sorting and organization
   - Understanding backup history
   - Automated retention policies

3. **Chain/reference encoding** makes relationships explicit:
   - Know which full backup an incremental depends on
   - Can validate chain integrity
   - Simplifies cleanup of complete chains

4. **Trade-offs:**
   - Self-describing = longer filenames
   - Database storage = more flexible queries but adds dependency
   - Separate metadata file = simpler filenames but extra file to manage

---

## 2. Design for Database Differential Backup Tool

### 2.1 Terminology

Based on the survey of existing tools, we establish the following terminology:

#### Base Backup
- **Term chosen**: "base" or "base backup"
- **Alternative terms in other tools**: "full backup" (duplicity, tar, dar), "level 0" (dump/restore)
- **Definition**: A complete backup of the database
- **Rationale**:
  - "Base" is concise and clearly indicates this is the foundation of the chain
  - More specific than "full" for database contexts where we're backing up a database instance
  - Aligns with PostgreSQL's terminology (pg_basebackup)

#### Differential Backup
- **Term chosen**: "differential" or "diff"
- **Alternative terms in other tools**: "incremental" (duplicity, tar, dar), "level 1-9" (dump/restore)
- **Definition**: A backup containing only changes since the last **base backup** in this chain
- **Rationale**:
  - Distinguishes from "incremental" (which typically means changes since the *last backup of any type*)
  - All differentials in a chain reference the same base backup
  - Simpler restore process: base + one differential (vs. base + all incrementals in sequence)
  - Standard term in database backup contexts (SQL Server, PostgreSQL)

#### Backup Chain
- **Term chosen**: "chain" or "backup chain"
- **Alternative terms in other tools**: "dump cycle" (Amanda), no specific term (Bacula, dump/restore)
- **Definition**: A unit consisting of:
  - One base backup
  - Zero or more differential backups that depend on that base
- **Lifecycle**:
  - A new successful base backup starts a new chain
  - The previous chain becomes "outdated" or "superseded"
  - Outdated chains can be retained for historical purposes or cleaned up
- **Rationale**:
  - "Chain" clearly conveys the dependency relationship
  - Widely understood term (duplicity uses this)
  - Simple and intuitive

#### Chain Identifier
- **Term chosen**: "chain ID"
- **Definition**: A unique identifier for a backup chain, typically the timestamp of the base backup
- **Usage**: Links differential backups to their base backup
- **Rationale**:
  - Enables self-describing filenames
  - Allows validation of chain integrity
  - Simplifies automated cleanup of complete chains

---

### Summary Table

| Concept | Our Term | Size | Depends On | Typical Frequency |
|---------|----------|------|------------|-------------------|
| Complete backup | **base** | Large | Nothing | Weekly, monthly |
| Changes since base | **differential** | Small-Medium | Base in same chain | Daily, hourly |
| Base + its differentials | **chain** | N/A | N/A | Per base backup |
| Chain's unique ID | **chain ID** | N/A | N/A | One per chain |

---

### 2.2 Filesystem Layout

#### Design Philosophy

Our tool follows the **opinionated, self-organizing** approach:
- User specifies only the backup directory
- Tool manages all filenames and organization automatically
- Self-describing structure (no external database required)
- Easy to inspect, validate, and clean up with standard filesystem tools

#### Directory Structure

**Recommended layout: Subdirectories per chain**

```
/path/to/backups/                            # User-specified backup directory
  chain-20231215T120000Z/                    # Chain directory (named by chain ID)
    base.sql                                 # Base backup
    diff-20231216T083000Z.sql                # Differential backup
    diff-20231217T083000Z.sql                # Differential backup
    diff-20231218T083000Z.sql                # Differential backup
  chain-20231222T120000Z/                    # New chain (new base started)
    base.sql                                 # Base backup
    diff-20231223T083000Z.sql                # Differential backup
  chain-20231229T120000Z/                    # Current/latest chain
    base.sql                                 # Base backup
```

#### Naming Conventions

**Chain directory:**
```
chain-{TIMESTAMP}/
```
- Format: `chain-` prefix + ISO 8601 timestamp (UTC)
- Timestamp pattern: `YYYYMMDDTHHMMSSZ`
- Example: `chain-20231215T120000Z`
- The timestamp is the **chain ID** (taken from when base backup started)

**Base backup file:**
```
base.sql
```
- Simple, fixed name within each chain directory
- No timestamp needed (directory already identifies the chain)
- Extension: `.sql` (SQL dump format)

**Differential backup files:**
```
diff-{TIMESTAMP}.sql
```
- Format: `diff-` prefix + ISO 8601 timestamp (UTC)
- Timestamp: when this differential backup was taken
- Example: `diff-20231216T083000Z.sql`
- Sorts chronologically with standard `ls` or `sort`

#### Rationale

**Why subdirectories per chain?**

1. **Clear grouping**: Each chain is a self-contained unit
   - Easy to see which backups belong together
   - Obvious dependencies (all diffs depend on base in same directory)

2. **Simple cleanup**: Delete entire chain by removing one directory
   - No need to parse filenames to find all parts of a chain
   - Atomic operation for chain removal

3. **Better than flat directory** (duplicity style):
   - Doesn't get cluttered with dozens of chains
   - Easier to navigate and understand visually
   - Chain operations (validate, cleanup) work on directories

4. **Better than generic names** (Bacula style):
   - No external database needed
   - Self-documenting structure
   - Can inspect backups with `ls`, `tree`, etc.

**Why simple filenames within chains?**

1. **Base backup** doesn't need timestamp:
   - Chain directory already encodes the timestamp
   - Avoids redundancy
   - Simpler: just `base.sql`

2. **Differential backups** include timestamp:
   - Need to distinguish multiple differentials
   - Timestamp shows when each differential was taken
   - Sorts chronologically

**Why ISO 8601 timestamps?**

- Unambiguous (includes timezone: Z = UTC)
- Sortable lexicographically
- International standard
- Compact format (no dashes in time portion)
- Precedent: duplicity, dar wrappers

#### Directory Listing Examples

**Recent backups first:**
```bash
$ ls -lt /path/to/backups/
drwxr-xr-x 2 backup backup 4096 Dec 29 12:00 chain-20231229T120000Z
drwxr-xr-x 2 backup backup 4096 Dec 22 12:00 chain-20231222T120000Z
drwxr-xr-x 2 backup backup 4096 Dec 15 12:00 chain-20231215T120000Z
```

**Contents of a chain:**
```bash
$ ls -lh /path/to/backups/chain-20231222T120000Z/
-rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql
-rw-r--r-- 1 backup backup  85M Dec 23 08:30 diff-20231223T083000Z.sql
-rw-r--r-- 1 backup backup  92M Dec 24 08:30 diff-20231224T083000Z.sql
-rw-r--r-- 1 backup backup  78M Dec 25 08:30 diff-20231225T083000Z.sql
```

**Tree view:**
```bash
$ tree -h /path/to/backups/
/path/to/backups/
├── chain-20231215T120000Z
│   ├── base.sql [2.0G]
│   ├── diff-20231216T083000Z.sql [80M]
│   └── diff-20231217T083000Z.sql [88M]
├── chain-20231222T120000Z
│   ├── base.sql [2.1G]
│   ├── diff-20231223T083000Z.sql [85M]
│   └── diff-20231224T083000Z.sql [92M]
└── chain-20231229T120000Z
    └── base.sql [2.2G]
```

---

### 2.3 Active vs Sealed Differentials

#### Backup States

When continuously backing up a database, differentials progress through two states:

**Active differential:**
- Currently streaming changes from the database
- Still being written to; incomplete
- Not safe to restore from

**Sealed differential:**
- Completed differential that is no longer being written to
- Immutable and safe to restore
- Result of the "sealing" process

**Sealing:**
- The process of finishing an active differential and starting a new one
- Happens periodically (e.g., via log rotation, scheduled checkpoint)
- Makes a differential complete and immutable

#### File Naming

**Active differential:**
```
active.sql
```
- Fixed name within chain directory
- Indicates work in progress
- Only one active differential per chain at any time

**Sealed differentials:**
```
diff-{TIMESTAMP}.sql
```
- When sealed, the active differential is renamed with its timestamp
- Timestamp: when the differential was sealed
- Then a new `active.sql` begins

#### Lifecycle Example

```
# Chain starts with base backup
chain-20231215T120000Z/
  base.sql

# Active differential streams changes
chain-20231215T120000Z/
  base.sql
  active.sql                    # currently streaming

# First seal (e.g., daily rotation)
chain-20231215T120000Z/
  base.sql
  diff-20231216T083000Z.sql    # sealed (renamed from active.sql)
  active.sql                    # new active starts

# Second seal
chain-20231215T120000Z/
  base.sql
  diff-20231216T083000Z.sql    # sealed
  diff-20231217T083000Z.sql    # sealed (renamed from active.sql)
  active.sql                    # new active starts
```

**Directory listing with active backup:**
```bash
$ ls -lh /path/to/backups/chain-20231222T120000Z/
-rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql
-rw-r--r-- 1 backup backup  85M Dec 23 08:30 diff-20231223T083000Z.sql
-rw-r--r-- 1 backup backup  92M Dec 24 08:30 diff-20231224T083000Z.sql
-rw-r--r-- 1 backup backup  23M Dec 25 06:45 active.sql  # growing in size
```

#### Restore Behavior

**Default restore:**
- Uses: `base.sql` + all sealed differentials (`diff-*.sql`)
- Excludes: `active.sql`
- Restores to: the point when the last differential was sealed

**Rationale:**
- Active differential is incomplete and potentially inconsistent
- Sealed differentials are guaranteed complete and consistent
- Predictable: users know exactly what restore point they get

**To include active differential:**
- Must first seal it (manual operation)
- Then it becomes a regular `diff-{TIMESTAMP}.sql` file
- Now safe to include in restore