PortaPack Architecture

Overview

PortaPack is a sophisticated tool that bundles entire websites—HTML, CSS, JavaScript, images, and fonts—into self-contained HTML files for offline access. This document outlines the architectural components that make up the system.

mermaid

graph TD
    CLI[CLI Entry Point] --> Options[Options Parser]
    Options --> Core
    API[API Entry Point] --> Core
    
    subgraph Core ["Core Pipeline"]
        Parser[HTML Parser] --> Extractor[Asset Extractor]
        Extractor --> Minifier[Asset Minifier]
        Minifier --> Packer[HTML Packer]
    end
    
    subgraph Recursive ["Advanced Features"]
        WebFetcher[Web Fetcher] --> MultipageBundler[Multipage Bundler]
    end
    
    WebFetcher --> Parser
    Core --> Output[Bundled HTML]
    MultipageBundler --> Output
    
    subgraph Utilities ["Utilities"]
        Logger[Logger]
        MimeUtils[MIME Utilities]
        BuildTimer[Build Timer]
        Slugify[URL Slugifier]
    end
    
    Logger -.-> CLI
    Logger -.-> Core
    Logger -.-> Recursive
    MimeUtils -.-> Extractor
    MimeUtils -.-> Parser
    BuildTimer -.-> CLI
    BuildTimer -.-> API
    Slugify -.-> MultipageBundler

Entry Points

CLI Interface

The command-line interface provides a convenient way to use PortaPack through terminal commands:

Component	Purpose
`cli-entry.ts`	Executable entry point with shebang support
`cli.ts`	Main runner that processes args and manages execution
`options.ts`	Parses command-line arguments and normalizes options

API Interface

The programmatic API enables developers to integrate PortaPack into their applications:

Component	Purpose
`index.ts`	Exports public functions like `pack()` with TypeScript types
`types.ts`	Defines shared interfaces and types for the entire system

Core Pipeline

The bundling process follows a clear four-stage pipeline:

1. HTML Parser (`parser.ts`)

The parser reads and analyzes the input HTML:

Uses Cheerio for robust HTML parsing
Identifies linked assets through element attributes (href, src, etc.)
Creates an initial asset list with URLs and inferred types
Handles both local file paths and remote URLs

2. Asset Extractor (`extractor.ts`)

The extractor resolves and fetches all referenced assets:

Resolves relative URLs against the base context
Fetches content for all discovered assets
Recursively extracts nested assets from CSS (@import, url())
Handles protocol-relative URLs and different origins
Provides detailed logging of asset discovery

3. Asset Minifier (`minifier.ts`)

The minifier reduces the size of all content:

Minifies HTML using html-minifier-terser
Minifies CSS using clean-css
Minifies JavaScript using terser
Preserves original content if minification fails
Configurable through command-line flags

4. HTML Packer (`packer.ts`)

The packer combines everything into a single file:

Inlines CSS into <style> tags
Embeds JavaScript into <script> tags
Converts binary assets to data URIs
Handles srcset attributes properly
Ensures proper HTML structure with base tag

Advanced Features

Web Fetcher (`web-fetcher.ts`)

For remote content, the web fetcher provides crawling capabilities:

Uses Puppeteer for fully-rendered page capture
Crawls websites recursively to specified depth
Respects same-origin policy by default
Manages browser instances efficiently
Provides detailed logging of the crawl process

Multipage Bundler (`bundler.ts`)

For bundling multiple pages into a single file:

Combines multiple HTML documents into one
Creates a client-side router for navigation
Generates a navigation interface
Uses slugs for routing between pages
Handles page templates and content swapping

Utilities

Logger (`logger.ts`)

Customizable log levels (debug, info, warn, error)
Consistent logging format across the codebase
Optional timestamps and colored output

MIME Utilities (`mime.ts`)

Maps file extensions to correct MIME types
Categorizes assets by type (CSS, JS, image, font)
Provides fallbacks for unknown extensions

Build Timer (`meta.ts`)

Tracks build performance metrics
Records asset counts and page counts
Captures output size and build duration
Collects errors and warnings for reporting

URL Slugifier (`slugify.ts`)

Converts URLs to safe HTML IDs
Handles special characters and normalization
Prevents slug collisions in multipage bundles

Asynchronous Processing

PortaPack uses modern async patterns throughout:

Promise-based Pipeline: Each stage returns promises that are awaited
Sequential Processing: Assets are processed in order to avoid overwhelming resources
Error Boundaries: Individual asset failures don't break the entire pipeline
Resource Management: Browser instances and file handles are properly closed

Build System

PortaPack uses a dual build configuration:

Build Target	Format	Purpose
CLI	CommonJS (.cjs)	Works with Node.js and npx
API	ESModule (.js)	Modern import/export support

TypeScript declarations (.d.ts) are generated for API consumers, and source maps support debugging.

Current Limitations

Script Execution Issues

Inlined scripts with async/defer attributes lose their intended loading behavior
ES Modules with import/export statements may fail after bundling
Script execution order can change, breaking dependencies

Content Limitations

CORS policies may prevent access to some cross-origin resources
Only initially rendered content from SPAs is captured by default
Very large sites produce impractically large HTML files

Technical Constraints

No streaming API or WebSocket support
Service worker capabilities are not preserved
Memory pressure with large sites
Limited support for authenticated content

PortaPack Architecture ​

Overview ​

Entry Points ​

CLI Interface ​

API Interface ​

Core Pipeline ​

1. HTML Parser (parser.ts) ​

2. Asset Extractor (extractor.ts) ​

3. Asset Minifier (minifier.ts) ​

4. HTML Packer (packer.ts) ​

Advanced Features ​

Web Fetcher (web-fetcher.ts) ​

Multipage Bundler (bundler.ts) ​

Utilities ​

Logger (logger.ts) ​

MIME Utilities (mime.ts) ​

Build Timer (meta.ts) ​

URL Slugifier (slugify.ts) ​

Asynchronous Processing ​

Build System ​

Current Limitations ​

Script Execution Issues ​

Content Limitations ​

Technical Constraints ​