Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

IndexedBackend Pattern

SQLite Metadata + Content-Addressed Blob Storage

This document describes the IndexedBackend architecture pattern: separating filesystem metadata (stored in SQLite) from file content (stored as blobs). This enables efficient queries, large file support, and flexible storage backends.

Ecosystem Implementation: The anyfs-indexed crate provides IndexedBackend as a production-ready implementation using local disk blobs. See the Backends Guide for usage. This document covers the underlying design pattern for those building custom implementations (e.g., with S3, cloud storage, or custom blob stores).


Overview

The IndexedBackend pattern separates:

  • Metadata (directory structure, inodes, permissions) → SQLite
  • Content (file bytes) → Content-Addressed Storage (CAS)
┌─────────────────────────────────────────────────────────┐
│            IndexedBackend (pattern)                     │
│  ┌─────────────────────┐    ┌────────────────────────┐  │
│  │   SQLite Metadata   │    │   Blob Store (CAS)     │  │
│  │                     │    │                        │  │
│  │  - inodes           │    │  - content-addressed   │  │
│  │  - dir_entries      │    │  - deduplicated        │  │
│  │  - blob references  │    │  - S3, local, etc.     │  │
│  │  - audit log        │    │                        │  │
│  └─────────────────────┘    └────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Custom backends can use S3, cloud storage, or other blob stores.
IndexedBackend implements a simpler variant with UUID-named local blobs
(optimized for streaming; see note below on storage models).

Why this pattern?

  • SQLite is great for metadata queries (directory listings, stats, audit)
  • Blob stores scale better for large file content
  • Content-addressing enables deduplication
  • Separating concerns enables independent scaling

Storage Model Variants

ModelBlob NamingDedupBest For
Content-AddressedSHA-256 of content✅ YesCloud/S3, archival, multi-tenant
UUID+Timestamp{uuid}-{timestamp}.bin❌ NoStreaming large files, simplicity

IndexedBackend uses UUID+Timestamp naming because:

  • Large files can be streamed without buffering the entire file to compute a hash
  • Write latency is consistent (no hash computation)
  • Simpler garbage collection (delete blob when reference removed)

Custom implementations may prefer content-addressed storage when:

  • Deduplication is valuable (many users uploading same files)
  • Using cloud blob stores with native CAS support (S3, GCS)
  • Building archival systems where write latency is acceptable

Framework Validation

Do Current Traits Support This?

Yes. The Fs traits define operations, not storage implementation.

Trait MethodHybrid Implementation
read(path)SQLite lookup → blob fetch
write(path, data)Blob upload → SQLite update
metadata(path)SQLite query only
read_dir(path)SQLite query only
remove_file(path)SQLite update (refcount–)
rename(from, to)SQLite update only
copy(from, to)SQLite update (refcount++)

The traits don’t care where bytes come from - that’s the backend’s business.

Thread Safety

Current design requires &self methods with interior mutability. For hybrid:

#![allow(unused)]
fn main() {
pub struct CustomIndexedBackend {
    // SQLite needs single-writer (see "Write Queue" below)
    metadata: Arc<Mutex<Connection>>,

    // Blob store is typically already thread-safe
    blobs: Arc<dyn BlobStore>,

    // Write queue for serializing SQLite writes
    write_tx: mpsc::Sender<WriteCmd>,
}
}

This aligns with ADR-023 (interior mutability).


Data Model

SQLite Schema

-- Inode table (one row per file/directory/symlink)
CREATE TABLE nodes (
    inode       INTEGER PRIMARY KEY,
    parent      INTEGER NOT NULL,
    name        TEXT NOT NULL,
    node_type   TEXT NOT NULL,  -- 'file', 'dir', 'symlink'
    size        INTEGER NOT NULL DEFAULT 0,
    mode        INTEGER NOT NULL DEFAULT 420,  -- 0o644
    nlink       INTEGER NOT NULL DEFAULT 1,
    blob_id     TEXT,           -- NULL for directories
    symlink_target TEXT,        -- NULL unless symlink
    created_at  INTEGER NOT NULL,
    modified_at INTEGER NOT NULL,
    accessed_at INTEGER NOT NULL,

    UNIQUE(parent, name)
);

-- Root directory (inode 1)
INSERT INTO nodes (inode, parent, name, node_type, size, mode, created_at, modified_at, accessed_at)
VALUES (1, 1, '', 'dir', 0, 493, strftime('%s', 'now'), strftime('%s', 'now'), strftime('%s', 'now'));

-- Blob reference tracking (for dedup + GC)
CREATE TABLE blobs (
    blob_id     TEXT PRIMARY KEY,  -- sha256 hex
    size        INTEGER NOT NULL,
    refcount    INTEGER NOT NULL DEFAULT 0,
    created_at  INTEGER NOT NULL
);

-- Audit log (optional but recommended)
CREATE TABLE audit (
    seq         INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp   INTEGER NOT NULL,
    operation   TEXT NOT NULL,
    path        TEXT,
    actor       TEXT,
    details     TEXT  -- JSON
);

-- Indexes
CREATE INDEX idx_nodes_parent ON nodes(parent);
CREATE INDEX idx_nodes_blob ON nodes(blob_id) WHERE blob_id IS NOT NULL;
CREATE INDEX idx_blobs_refcount ON blobs(refcount) WHERE refcount = 0;

Blob Store Interface

#![allow(unused)]
fn main() {
/// Content-addressed blob storage.
pub trait BlobStore: Send + Sync {
    /// Store bytes, returns content hash (blob_id).
    fn put(&self, data: &[u8]) -> Result<String, BlobError>;

    /// Retrieve bytes by content hash.
    fn get(&self, blob_id: &str) -> Result<Vec<u8>, BlobError>;

    /// Check if blob exists.
    fn exists(&self, blob_id: &str) -> Result<bool, BlobError>;

    /// Delete blob (only call after refcount reaches 0).
    fn delete(&self, blob_id: &str) -> Result<(), BlobError>;

    /// Streaming read for large files.
    fn open_read(&self, blob_id: &str) -> Result<Box<dyn Read + Send>, BlobError>;

    /// Streaming write, returns blob_id on completion.
    fn open_write(&self) -> Result<Box<dyn BlobWriter>, BlobError>;
}

pub trait BlobWriter: Write + Send {
    /// Finalize the blob and return its content hash.
    fn finalize(self: Box<Self>) -> Result<String, BlobError>;
}
}

Implementations could be:

  • LocalCasBackend - local directory with content-addressed files
  • S3BlobStore - S3-compatible object storage
  • MemoryBlobStore - in-memory for testing

Implementation Sketch

Core Structure

#![allow(unused)]
fn main() {
use anyfs_backend::{FsRead, FsWrite, FsDir, FsError, Metadata, ReadDirIter, DirEntry, FileType};
use rusqlite::Connection;
use std::sync::{Arc, Mutex};
use std::path::{Path, PathBuf};
use tokio::sync::mpsc;

pub struct CustomIndexedBackend {
    /// SQLite connection (metadata)
    db: Arc<Mutex<Connection>>,

    /// Content-addressed blob storage
    blobs: Arc<dyn BlobStore>,

    /// Write command queue (single-writer pattern)
    write_tx: mpsc::UnboundedSender<WriteCmd>,

    /// Background writer handle
    _writer_handle: Arc<WriterHandle>,
}

enum WriteCmd {
    Write {
        path: PathBuf,
        blob_id: String,
        size: u64,
        reply: oneshot::Sender<Result<(), FsError>>,
    },
    Remove {
        path: PathBuf,
        reply: oneshot::Sender<Result<(), FsError>>,
    },
    CreateDir {
        path: PathBuf,
        reply: oneshot::Sender<Result<(), FsError>>,
    },
    // ... other write operations
}
}

Read Operations (Direct)

Read operations can query SQLite and blob store directly (no queue needed):

#![allow(unused)]
fn main() {
impl FsRead for CustomIndexedBackend {
    fn read(&self, path: &Path) -> Result<Vec<u8>, FsError> {
        let path = path.as_ref();

        // 1. Query SQLite for blob_id
        let db = self.db.lock().map_err(|_| FsError::Backend("lock poisoned".into()))?;

        let (blob_id, node_type): (Option<String>, String) = db.query_row(
            "SELECT blob_id, node_type FROM nodes WHERE inode = (
                SELECT inode FROM nodes WHERE parent = ? AND name = ?
            )",
            // ... path resolution params
            |row| Ok((row.get(0)?, row.get(1)?)),
        ).map_err(|_| FsError::NotFound { path: path.to_path_buf() })?;

        if node_type != "file" {
            return Err(FsError::NotAFile { path: path.to_path_buf() });
        }

        let blob_id = blob_id.ok_or_else(|| FsError::NotFound { path: path.to_path_buf() })?;

        drop(db);  // Release lock before blob fetch

        // 2. Fetch from blob store
        self.blobs.get(&blob_id)
            .map_err(|e| FsError::Backend(e.to_string()))
    }

    fn exists(&self, path: &Path) -> Result<bool, FsError> {
        let path = path.as_ref();
        let db = self.db.lock().map_err(|_| FsError::Backend("lock poisoned".into()))?;

        // Pure SQLite query
        let exists: bool = db.query_row(
            "SELECT EXISTS(SELECT 1 FROM nodes WHERE parent = ? AND name = ?)",
            // ... params
            |row| row.get(0),
        ).unwrap_or(false);

        Ok(exists)
    }

    fn metadata(&self, path: &Path) -> Result<Metadata, FsError> {
        let path = path.as_ref();
        let db = self.db.lock().map_err(|_| FsError::Backend("lock poisoned".into()))?;

        // Pure SQLite query - no blob store needed
        db.query_row(
            "SELECT node_type, size, mode, nlink, created_at, modified_at, accessed_at, inode
             FROM nodes WHERE parent = ? AND name = ?",
            // ... params
            |row| {
                let node_type: String = row.get(0)?;
                Ok(Metadata {
                    file_type: match node_type.as_str() {
                        "file" => FileType::File,
                        "dir" => FileType::Directory,
                        "symlink" => FileType::Symlink,
                        _ => FileType::File,
                    },
                    size: row.get(1)?,
                    permissions: Some(row.get(2)?),
                    // ... other fields
                })
            },
        ).map_err(|_| FsError::NotFound { path: path.to_path_buf() })
    }

    // ... other FsRead methods
}
}

Write Operations (Two-Phase Commit)

Writes use a two-phase pattern: upload blob first, then commit SQLite:

#![allow(unused)]
fn main() {
impl FsWrite for CustomIndexedBackend {
    fn write(&self, path: &Path, data: &[u8]) -> Result<(), FsError> {
        let path = path.as_ref().to_path_buf();

        // Phase 1: Upload blob (can fail independently)
        let blob_id = self.blobs.put(data)
            .map_err(|e| FsError::Backend(format!("blob upload failed: {}", e)))?;

        // Phase 2: Commit metadata (via write queue)
        let (tx, rx) = oneshot::channel();

        self.write_tx.send(WriteCmd::Write {
            path,
            blob_id,
            size: data.len() as u64,
            reply: tx,
        }).map_err(|_| FsError::Backend("write queue closed".into()))?;

        // Wait for SQLite commit
        rx.blocking_recv()
            .map_err(|_| FsError::Backend("write cancelled".into()))?
    }

    fn remove_file(&self, path: &Path) -> Result<(), FsError> {
        let path = path.as_ref().to_path_buf();

        // Queue the removal (blob cleanup happens in background via GC)
        let (tx, rx) = oneshot::channel();

        self.write_tx.send(WriteCmd::Remove { path, reply: tx })
            .map_err(|_| FsError::Backend("write queue closed".into()))?;

        rx.blocking_recv()
            .map_err(|_| FsError::Backend("remove cancelled".into()))?
    }

    fn copy(&self, from: &Path, to: &Path) -> Result<(), FsError> {
        // Copy is just a metadata operation - increment refcount, no blob copy!
        let (tx, rx) = oneshot::channel();

        self.write_tx.send(WriteCmd::Copy {
            from: from.as_ref().to_path_buf(),
            to: to.as_ref().to_path_buf(),
            reply: tx,
        }).map_err(|_| FsError::Backend("write queue closed".into()))?;

        rx.blocking_recv()
            .map_err(|_| FsError::Backend("copy cancelled".into()))?
    }

    // ... other FsWrite methods
}
}

Write Queue Worker

The single-writer pattern for SQLite:

#![allow(unused)]
fn main() {
async fn write_worker(
    db: Arc<Mutex<Connection>>,
    blobs: Arc<dyn BlobStore>,
    mut rx: mpsc::UnboundedReceiver<WriteCmd>,
) {
    while let Some(cmd) = rx.recv().await {
        let result = {
            let mut db = db.lock().unwrap();

            match cmd {
                WriteCmd::Write { path, blob_id, size, reply } => {
                    let result = db.execute_batch(&format!(r#"
                        BEGIN;

                        -- Upsert blob record
                        INSERT INTO blobs (blob_id, size, refcount, created_at)
                        VALUES ('{blob_id}', {size}, 1, strftime('%s', 'now'))
                        ON CONFLICT(blob_id) DO UPDATE SET refcount = refcount + 1;

                        -- Update or insert node
                        -- (simplified - real impl needs path resolution)

                        -- Audit log
                        INSERT INTO audit (timestamp, operation, path)
                        VALUES (strftime('%s', 'now'), 'write', '{path}');

                        COMMIT;
                    "#));

                    let _ = reply.send(result.map_err(|e| FsError::Backend(e.to_string())));
                }

                WriteCmd::Remove { path, reply } => {
                    // Decrement refcount (GC cleans up when refcount = 0)
                    let result = db.execute_batch(&format!(r#"
                        BEGIN;

                        -- Get blob_id before delete
                        -- Decrement refcount
                        -- Remove node
                        -- Audit log

                        COMMIT;
                    "#));

                    let _ = reply.send(result.map_err(|e| FsError::Backend(e.to_string())));
                }

                // ... other commands
            }
        };
    }
}
}

Deduplication

Content-addressing gives you dedup for free:

#![allow(unused)]
fn main() {
impl BlobStore for LocalCasBackend {
    fn put(&self, data: &[u8]) -> Result<String, BlobError> {
        // Hash the content
        let hash = sha256(data);
        let blob_id = hex::encode(hash);

        // Check if already exists
        let blob_path = self.root.join(&blob_id[0..2]).join(&blob_id);

        if blob_path.exists() {
            // Already have this content - dedup!
            return Ok(blob_id);
        }

        // Store new blob
        std::fs::create_dir_all(blob_path.parent().unwrap())?;
        std::fs::write(&blob_path, data)?;

        Ok(blob_id)
    }
}
}

Dedup in action:

  • User A writes report.pdf (10 MB) → blob abc123, refcount = 1
  • User B writes identical report.pdf → same blob abc123, refcount = 2
  • Physical storage: 10 MB (not 20 MB)

Refcount Management

-- On file write (new reference to blob)
UPDATE blobs SET refcount = refcount + 1 WHERE blob_id = ?;

-- On file delete
UPDATE blobs SET refcount = refcount - 1 WHERE blob_id = ?;

-- On copy (no blob copy needed!)
UPDATE blobs SET refcount = refcount + 1 WHERE blob_id = ?;

SQLite Performance

The SQLite metadata database benefits from the same tuning as SqliteBackend:

SettingDefaultPurposeTradeoff
journal_modeWALConcurrent reads during writesCreates .wal/.shm files
synchronousFULLIndex integrity on power lossSafe default, opt-in to NORMAL
cache_size16MBSmaller cache for metadata-onlyTune based on index size
busy_timeout5000Gracefully handle lock contentionPrevents SQLITE_BUSY errors
auto_vacuumINCREMENTALReclaim space from deletionsGradual space recovery

Why FULL synchronous: Index corruption means paths no longer resolve to blobs—blobs become orphaned and unreachable. Use FULL as the safe default; opt-in to NORMAL only with battery-backed storage or when index can be rebuilt.

SQL Indexes (critical):

CREATE INDEX idx_nodes_parent ON nodes(parent);
CREATE INDEX idx_nodes_blob ON nodes(blob_id) WHERE blob_id IS NOT NULL;
CREATE INDEX idx_blobs_refcount ON blobs(refcount) WHERE refcount = 0;

Without proper indexes, path lookups become full table scans—catastrophic for large filesystems.

Connection pooling: 4-8 reader connections for concurrent metadata queries; single writer for updates. See SQLite Operations Guide for detailed patterns.


Garbage Collection

Blobs with refcount = 0 are orphans and can be deleted:

#![allow(unused)]
fn main() {
impl CustomIndexedBackend {
    /// Run garbage collection (call periodically or on-demand).
    pub fn gc(&self) -> Result<GcStats, FsError> {
        let db = self.db.lock().map_err(|_| FsError::Backend("lock".into()))?;

        // Find orphaned blobs
        let orphans: Vec<String> = db.prepare(
            "SELECT blob_id FROM blobs WHERE refcount = 0"
        )?.query_map([], |row| row.get(0))?
          .filter_map(|r| r.ok())
          .collect();

        drop(db);

        // Delete from blob store
        let mut deleted = 0;
        for blob_id in &orphans {
            if self.blobs.delete(blob_id).is_ok() {
                deleted += 1;
            }
        }

        // Remove from SQLite
        let db = self.db.lock().unwrap();
        db.execute(
            "DELETE FROM blobs WHERE refcount = 0",
            [],
        )?;

        Ok(GcStats { orphans_found: orphans.len(), blobs_deleted: deleted })
    }
}
}

GC Safety:

  • Never delete blobs referenced by snapshots
  • Add snapshot_refs table or use refcount that includes snapshot references
  • Run GC in background, not during writes

Snapshots and Backup

Creating a Snapshot

#![allow(unused)]
fn main() {
impl CustomIndexedBackend {
    /// Create a point-in-time snapshot.
    pub fn snapshot(&self, name: &str) -> Result<SnapshotId, FsError> {
        let db = self.db.lock().unwrap();

        db.execute_batch(&format!(r#"
            BEGIN;

            -- Record snapshot
            INSERT INTO snapshots (name, created_at, root_manifest)
            VALUES ('{name}', strftime('%s', 'now'),
                    (SELECT json_group_array(blob_id) FROM blobs WHERE refcount > 0));

            -- Pin all current blobs (prevent GC)
            UPDATE blobs SET refcount = refcount + 1
            WHERE blob_id IN (SELECT blob_id FROM nodes WHERE blob_id IS NOT NULL);

            COMMIT;
        "#))?;

        Ok(SnapshotId(name.to_string()))
    }

    /// Export as single portable artifact.
    pub fn export(&self, dest: impl AsRef<Path>) -> Result<(), FsError> {
        // 1. SQLite backup API for metadata
        let db = self.db.lock().unwrap();
        let backup_db = Connection::open(dest.as_ref().join("metadata.db"))?;
        db.backup(rusqlite::DatabaseName::Main, &backup_db, None)?;

        // 2. Copy referenced blobs
        let blob_ids: Vec<String> = db.prepare(
            "SELECT DISTINCT blob_id FROM nodes WHERE blob_id IS NOT NULL"
        )?.query_map([], |row| row.get(0))?
          .filter_map(|r| r.ok())
          .collect();

        drop(db);

        let blobs_dir = dest.as_ref().join("blobs");
        std::fs::create_dir_all(&blobs_dir)?;

        for blob_id in blob_ids {
            let data = self.blobs.get(&blob_id)?;
            std::fs::write(blobs_dir.join(&blob_id), data)?;
        }

        Ok(())
    }
}
}

Middleware Integration

Middleware works unchanged - it wraps the hybrid backend like any other:

#![allow(unused)]
fn main() {
use anyfs::{FileStorage, QuotaLayer, TracingLayer, PathFilterLayer};

let backend = CustomIndexedBackend::open("drive.db", LocalCasBackend::new("./blobs"))?;

// Standard middleware stack
let backend = backend
    .layer(QuotaLayer::builder()
        .max_total_size(50 * 1024 * 1024 * 1024)  // 50 GB
        .build())
    .layer(PathFilterLayer::builder()
        .deny("**/.env")
        .build())
    .layer(TracingLayer::new());

let fs = FileStorage::new(backend);

// Use like any other filesystem
fs.write("/documents/report.pdf", &pdf_bytes)?;
}

Quota tracking note: QuotaLayer tracks logical size (what users see), not physical size (with dedup). For physical tracking, the backend could expose physical_usage() separately.


Async Considerations

The hybrid pattern benefits significantly from async (ADR-024):

OperationSync PainAsync Benefit
Blob upload to S3Blocks threadConcurrent uploads
Multiple readsSequentialParallel fetches
Write queueblocking_recv()Native async channel
GCBlocks all opsBackground task

When AsyncFs traits exist (ADR-024), the hybrid backend can use them naturally:

#![allow(unused)]
fn main() {
#[async_trait]
impl AsyncFsRead for CustomIndexedBackend {
    async fn read(&self, path: &Path) -> Result<Vec<u8>, FsError> {
        let blob_id = self.lookup_blob_id(path).await?;
        self.blobs.get_async(&blob_id).await  // Non-blocking!
    }
}
}

Identified Gaps

Areas where the current framework could be enhanced:

GapCurrent StateRecommendation
Two-phase commit patternNot documentedAdd to backend guide
Refcount/GC patternsNot documentedAdd section
Streaming large filesopen_read/open_write existDocument chunked patterns
Physical vs logical sizeQuota tracks logical onlyConsider PhysicalStats trait
Background tasks (GC)No patternDocument spawn pattern

Summary

Framework validation: PASSED

The current AnyFS trait design supports hybrid backends:

  • Traits define operations, not storage
  • Interior mutability allows single-writer patterns
  • Middleware composes unchanged
  • Async strategy (ADR-024) enhances this pattern

Key patterns for hybrid backends:

  1. Single-writer queue for SQLite
  2. Two-phase commit (blob upload → SQLite commit)
  3. Content-addressing for dedup
  4. Refcounting for GC safety
  5. Snapshot pinning for backup safety

This validates that AnyFS is flexible enough for advanced storage architectures while maintaining its simple middleware composition model.