1 # Database Differential Backup Tool - Design Document
3 ## 1. Survey of Existing Backup Tools
5 ### Backup Chain Concept
7 A **backup chain** is a unit consisting of:
8 - One full backup (base)
9 - Zero or more dependent incremental backups
11 When a new successful full backup is taken, it starts a new chain. The previous chain (old base + incrementals) becomes "outdated" and can be cleaned up if desired.
13 ### Terminology Across Tools
15 - **Duplicity**: "Backup Chain"
16 - **Amanda**: "Dump Cycle"
17 - **Bacula/Bareos**: No specific term (uses "Full/Differential/Incremental levels")
18 - **dump/restore**: Backup levels (0-9)
20 ### Chain-Style Backup Tools
23 1. **duplicity** - Tar-based, encrypted, designed for remote/cloud storage
24 2. **Amanda (ADSM)** - Network backup system (1990s), dump cycle concept
25 3. **Bacula / Bareos** - Enterprise network backup, very powerful
26 4. **GNU tar (--listed-incremental)** - Built-in incremental support
27 5. **dar (Disk ARchive)** - Enhanced tar replacement
28 6. **dump/restore** - Classic Unix backup (1970s), filesystem-aware
30 #### Commercial/Enterprise
32 - Veeam Backup & Replication
33 - IBM Spectrum Protect (formerly TSM)
38 ### File Storage Formats
42 **File naming pattern:**
44 duplicity-full.20231215T120000Z.vol1.difftar.gpg
45 duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg
49 - Timestamp in ISO 8601 format
50 - `.difftar` = tar differential format
51 - `.gpg` = encrypted (optional)
52 - Volumes numbered if split: `.vol1`, `.vol2`, etc.
53 - Time range in incremental filename: `from.to`
55 **Example directory:**
58 duplicity-full.20231215T120000Z.vol1.difftar.gpg
59 duplicity-inc.20231215T120000Z.to.20231216T120000Z.vol1.difftar.gpg
60 duplicity-inc.20231216T120000Z.to.20231217T120000Z.vol1.difftar.gpg
67 **Directory structure (virtual tapes):**
71 00001.DailySet1 # label file
72 00002.hostname._var.0 # level 0 dump
73 00003.hostname._home.1 # level 1 dump
79 **File naming pattern:**
81 slot02-00000001-server-_etc-20231215-0.data
84 **Format components:**
88 - Directory path (slashes become underscores)
94 - Each "slot" represents a virtual tape
95 - Mimics physical tape behavior
96 - Must be labeled before use
102 **Simple volume naming:**
112 **Naming configuration:**
113 - Configured via `LabelFormat` in Pool config
114 - Default: `LabelFormat = "Vol"` → Vol0001, Vol0002, etc.
115 - Fully customizable: can use date-based, purpose-based names
116 - No indication of full/incremental in filename
117 - Relationships tracked in catalog database (PostgreSQL/MySQL/SQLite)
120 - Volume = single file on disk (or tape)
121 - Catalog database required to understand backup chain
122 - Flexible naming scheme
123 - Professional/enterprise oriented
127 #### GNU tar (--listed-incremental)
129 **Snapshot file + tar archives:**
132 backup.snar # metadata/snapshot file
133 backup-full-2023-12-15.tar.gz # full backup
134 backup-inc-2023-12-16.tar.gz # incremental
135 backup-inc-2023-12-17.tar.gz # incremental
138 **Snapshot file format (.snar):**
139 - Text-based metadata file
140 - Tracks directory timestamps and inode information
141 - Multiple format versions (0, 1, 2)
142 - Format 2 is current (modern tar versions)
143 - First two lines: seconds and nanoseconds since epoch
144 - Followed by directory records (ASCII 0 delimited)
147 - Same `.snar` file used for entire chain
148 - User chooses tar archive filenames
149 - `.snar` file must be preserved with backups
150 - Losing `.snar` file breaks incremental chain
154 #### dar (Disk ARchive)
156 **Sliced archive format:**
159 backup-full-20231215.1.dar
160 backup-full-20231215.2.dar # additional slices
161 backup-inc-20231216.1.dar
162 backup-inc-20231217.1.dar
165 **With reference tracking (wrapper scripts):**
167 server_daily_20231215T0352UTC.1.dar
168 server_daily_20231216T0403UTC_based_on_20231215T0352UTC.1.dar
169 server_daily_20231217T0403UTC_based_on_20231216T0403UTC.1.dar
174 basename.slice_number.dar
178 - Archives split into "slices" for size management
179 - User chooses basename
180 - Reference archive specified via `-A` option
181 - No distinction between differential/incremental (depends on reference)
182 - Wrapper utilities add reference info to basename
188 **User-specified output files:**
191 home-level0-20231215.dump
192 home-level1-20231216.dump
193 home-level2-20231217.dump
194 var-level0-20231215.dump
195 var-level1-20231216.dump
203 **dumpdates format (text file):**
205 /dev/sda1 0 Sun Dec 15 03:00:00 2023
206 /dev/sda1 1 Mon Dec 16 03:00:00 2023
207 /home 0 Sun Dec 15 04:00:00 2023
211 - Dump levels: 0 (full) through 9 (incremental)
212 - Level n backs up changes since last level n-1
213 - Output filename specified by user with `-f` flag
214 - `/etc/dumpdates` tracks: filesystem, level, timestamp
215 - Filesystem-aware (ext2/3/4, UFS)
216 - Very old (1970s Unix)
222 #### Filename Standardization
224 **Self-describing filenames:**
225 - ✅ **duplicity**: Fully standardized, metadata in filename
226 - ✅ **Amanda**: Standardized format with metadata
227 - ✅ **dar** (with wrappers): Can encode reference information
229 **User-defined or generic:**
230 - ❌ **Bacula**: Generic names, requires database
231 - ❌ **GNU tar**: User chooses names, `.snar` tracks metadata
232 - ❌ **dump**: User chooses names, `/etc/dumpdates` tracks metadata
234 #### External Metadata Storage
236 - **duplicity**: None needed (self-contained)
237 - **Amanda**: None needed (filename contains metadata)
238 - **Bacula**: SQL database (PostgreSQL/MySQL/SQLite)
239 - **GNU tar**: `.snar` file (must be preserved)
240 - **dar**: Optional (can use wrapper naming convention)
241 - **dump**: `/etc/dumpdates` file
243 #### Chain Management
245 Tools differ in how they handle "obsolete chain" cleanup:
247 - **duplicity**: Has `remove-all-but-n-full` command
248 - **Amanda**: Tape rotation policy
249 - **Bacula**: Retention periods in configuration
250 - **tar/dar/dump**: Manual management by user
254 ### Design Considerations for New Tools
256 Based on this survey:
258 1. **Self-describing filenames** (duplicity/Amanda model) make tools more robust
259 - No external database dependency
260 - Easy to inspect backups with standard tools
261 - Clear what files belong to which chain
263 2. **Timestamp in filename** helps with:
264 - Sorting and organization
265 - Understanding backup history
266 - Automated retention policies
268 3. **Chain/reference encoding** makes relationships explicit:
269 - Know which full backup an incremental depends on
270 - Can validate chain integrity
271 - Simplifies cleanup of complete chains
274 - Self-describing = longer filenames
275 - Database storage = more flexible queries but adds dependency
276 - Separate metadata file = simpler filenames but extra file to manage
280 ## 2. Design for Database Differential Backup Tool
284 Based on the survey of existing tools, we establish the following terminology:
287 - **Term chosen**: "base" or "base backup"
288 - **Alternative terms in other tools**: "full backup" (duplicity, tar, dar), "level 0" (dump/restore)
289 - **Definition**: A complete backup of the database
291 - "Base" is concise and clearly indicates this is the foundation of the chain
292 - More specific than "full" for database contexts where we're backing up a database instance
293 - Aligns with PostgreSQL's terminology (pg_basebackup)
295 #### Differential Backup
296 - **Term chosen**: "differential" or "diff"
297 - **Alternative terms in other tools**: "incremental" (duplicity, tar, dar), "level 1-9" (dump/restore)
298 - **Definition**: A backup containing only changes since the last **base backup** in this chain
300 - Distinguishes from "incremental" (which typically means changes since the *last backup of any type*)
301 - All differentials in a chain reference the same base backup
302 - Simpler restore process: base + one differential (vs. base + all incrementals in sequence)
303 - Standard term in database backup contexts (SQL Server, PostgreSQL)
306 - **Term chosen**: "chain" or "backup chain"
307 - **Alternative terms in other tools**: "dump cycle" (Amanda), no specific term (Bacula, dump/restore)
308 - **Definition**: A unit consisting of:
310 - Zero or more differential backups that depend on that base
312 - A new successful base backup starts a new chain
313 - The previous chain becomes "outdated" or "superseded"
314 - Outdated chains can be retained for historical purposes or cleaned up
316 - "Chain" clearly conveys the dependency relationship
317 - Widely understood term (duplicity uses this)
318 - Simple and intuitive
320 #### Chain Identifier
321 - **Term chosen**: "chain ID"
322 - **Definition**: A unique identifier for a backup chain, typically the timestamp of the base backup
323 - **Usage**: Links differential backups to their base backup
325 - Enables self-describing filenames
326 - Allows validation of chain integrity
327 - Simplifies automated cleanup of complete chains
333 | Concept | Our Term | Size | Depends On | Typical Frequency |
334 |---------|----------|------|------------|-------------------|
335 | Complete backup | **base** | Large | Nothing | Weekly, monthly |
336 | Changes since base | **differential** | Small-Medium | Base in same chain | Daily, hourly |
337 | Base + its differentials | **chain** | N/A | N/A | Per base backup |
338 | Chain's unique ID | **chain ID** | N/A | N/A | One per chain |
342 ### 2.2 Filesystem Layout
344 #### Design Philosophy
346 Our tool follows the **opinionated, self-organizing** approach:
347 - User specifies only the backup directory
348 - Tool manages all filenames and organization automatically
349 - Self-describing structure (no external database required)
350 - Easy to inspect, validate, and clean up with standard filesystem tools
352 #### Directory Structure
354 **Recommended layout: Subdirectories per chain**
357 /path/to/backups/ # User-specified backup directory
358 chain-20231215T120000Z/ # Chain directory (named by chain ID)
359 base.sql # Base backup
360 diff-20231216T083000Z.sql # Differential backup
361 diff-20231217T083000Z.sql # Differential backup
362 diff-20231218T083000Z.sql # Differential backup
363 chain-20231222T120000Z/ # New chain (new base started)
364 base.sql # Base backup
365 diff-20231223T083000Z.sql # Differential backup
366 chain-20231229T120000Z/ # Current/latest chain
367 base.sql # Base backup
370 #### Naming Conventions
376 - Format: `chain-` prefix + ISO 8601 timestamp (UTC)
377 - Timestamp pattern: `YYYYMMDDTHHMMSSZ`
378 - Example: `chain-20231215T120000Z`
379 - The timestamp is the **chain ID** (taken from when base backup started)
381 **Base backup file:**
385 - Simple, fixed name within each chain directory
386 - No timestamp needed (directory already identifies the chain)
387 - Extension: `.sql` (SQL dump format)
389 **Differential backup files:**
393 - Format: `diff-` prefix + ISO 8601 timestamp (UTC)
394 - Timestamp: when this differential backup was taken
395 - Example: `diff-20231216T083000Z.sql`
396 - Sorts chronologically with standard `ls` or `sort`
400 **Why subdirectories per chain?**
402 1. **Clear grouping**: Each chain is a self-contained unit
403 - Easy to see which backups belong together
404 - Obvious dependencies (all diffs depend on base in same directory)
406 2. **Simple cleanup**: Delete entire chain by removing one directory
407 - No need to parse filenames to find all parts of a chain
408 - Atomic operation for chain removal
410 3. **Better than flat directory** (duplicity style):
411 - Doesn't get cluttered with dozens of chains
412 - Easier to navigate and understand visually
413 - Chain operations (validate, cleanup) work on directories
415 4. **Better than generic names** (Bacula style):
416 - No external database needed
417 - Self-documenting structure
418 - Can inspect backups with `ls`, `tree`, etc.
420 **Why simple filenames within chains?**
422 1. **Base backup** doesn't need timestamp:
423 - Chain directory already encodes the timestamp
425 - Simpler: just `base.sql`
427 2. **Differential backups** include timestamp:
428 - Need to distinguish multiple differentials
429 - Timestamp shows when each differential was taken
430 - Sorts chronologically
432 **Why ISO 8601 timestamps?**
434 - Unambiguous (includes timezone: Z = UTC)
435 - Sortable lexicographically
436 - International standard
437 - Compact format (no dashes in time portion)
438 - Precedent: duplicity, dar wrappers
440 #### Directory Listing Examples
442 **Recent backups first:**
444 $ ls -lt /path/to/backups/
445 drwxr-xr-x 2 backup backup 4096 Dec 29 12:00 chain-20231229T120000Z
446 drwxr-xr-x 2 backup backup 4096 Dec 22 12:00 chain-20231222T120000Z
447 drwxr-xr-x 2 backup backup 4096 Dec 15 12:00 chain-20231215T120000Z
450 **Contents of a chain:**
452 $ ls -lh /path/to/backups/chain-20231222T120000Z/
453 -rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql
454 -rw-r--r-- 1 backup backup 85M Dec 23 08:30 diff-20231223T083000Z.sql
455 -rw-r--r-- 1 backup backup 92M Dec 24 08:30 diff-20231224T083000Z.sql
456 -rw-r--r-- 1 backup backup 78M Dec 25 08:30 diff-20231225T083000Z.sql
461 $ tree -h /path/to/backups/
463 ├── chain-20231215T120000Z
464 │ ├── base.sql [2.0G]
465 │ ├── diff-20231216T083000Z.sql [80M]
466 │ └── diff-20231217T083000Z.sql [88M]
467 ├── chain-20231222T120000Z
468 │ ├── base.sql [2.1G]
469 │ ├── diff-20231223T083000Z.sql [85M]
470 │ └── diff-20231224T083000Z.sql [92M]
471 └── chain-20231229T120000Z
477 ### 2.3 Active vs Sealed Differentials
481 When continuously backing up a database, differentials progress through two states:
483 **Active differential:**
484 - Currently streaming changes from the database
485 - Still being written to; incomplete
486 - Not safe to restore from
488 **Sealed differential:**
489 - Completed differential that is no longer being written to
490 - Immutable and safe to restore
491 - Result of the "sealing" process
494 - The process of finishing an active differential and starting a new one
495 - Happens periodically (e.g., via log rotation, scheduled checkpoint)
496 - Makes a differential complete and immutable
500 **Active differential:**
504 - Fixed name within chain directory
505 - Indicates work in progress
506 - Only one active differential per chain at any time
508 **Sealed differentials:**
512 - When sealed, the active differential is renamed with its timestamp
513 - Timestamp: when the differential was sealed
514 - Then a new `active.sql` begins
516 #### Lifecycle Example
519 # Chain starts with base backup
520 chain-20231215T120000Z/
523 # Active differential streams changes
524 chain-20231215T120000Z/
526 active.sql # currently streaming
528 # First seal (e.g., daily rotation)
529 chain-20231215T120000Z/
531 diff-20231216T083000Z.sql # sealed (renamed from active.sql)
532 active.sql # new active starts
535 chain-20231215T120000Z/
537 diff-20231216T083000Z.sql # sealed
538 diff-20231217T083000Z.sql # sealed (renamed from active.sql)
539 active.sql # new active starts
542 **Directory listing with active backup:**
544 $ ls -lh /path/to/backups/chain-20231222T120000Z/
545 -rw-r--r-- 1 backup backup 2.1G Dec 22 12:00 base.sql
546 -rw-r--r-- 1 backup backup 85M Dec 23 08:30 diff-20231223T083000Z.sql
547 -rw-r--r-- 1 backup backup 92M Dec 24 08:30 diff-20231224T083000Z.sql
548 -rw-r--r-- 1 backup backup 23M Dec 25 06:45 active.sql # growing in size
551 #### Restore Behavior
554 - Uses: `base.sql` + all sealed differentials (`diff-*.sql`)
555 - Excludes: `active.sql`
556 - Restores to: the point when the last differential was sealed
559 - Active differential is incomplete and potentially inconsistent
560 - Sealed differentials are guaranteed complete and consistent
561 - Predictable: users know exactly what restore point they get
563 **To include active differential:**
564 - Must first seal it (manual operation)
565 - Then it becomes a regular `diff-{TIMESTAMP}.sql` file
566 - Now safe to include in restore