Security Model

Dual-Layer Isolation.

Every execution runs in its own VM. If someone escapes the VM, they land in a jailed process with no capabilities, a minimal filesystem, and a syscall filter that kills on violation.

Layer 1: VM isolation via libkrun, Firecracker, Cloud Hypervisor, or QEMU. Separate kernel, memory, process tree.

Layer 2: The VMM process itself is jailed — strict sandbox with syscall filtering, filesystem restrictions, resource limits, and full capability drop.

277+ tests including 37 security tests (18 adversarial escape + 19 isolation verification) that verify escape attempts are blocked.

swap_horiz

VMM Backends

Four backends behind a common trait. Same code, same protocol, same result format — pick the right trade-off per workload.

Default

macOS + Linux

libkrun

Embedded VMM via FFI. No separate process. Transparent socket networking (TSI). The fastest path from code to VM.

~337ms end-to-end

Linux only

Firecracker

AWS Lambda's VMM. Minimal device model, battle-tested jailer. ext4 block device rootfs. The production hardening choice.

~344ms end-to-end

Linux only

Cloud Hypervisor

Modern rust-vmm VMM. virtio-fs rootfs via virtiofsd. Snapshot/restore, warm migration, GPU passthrough via VFIO.

~307ms end-to-end

Linux only

QEMU

Full device emulation. Broadest hardware support. GPU passthrough with OVMF firmware boot. The escape hatch for complex workloads.

~555ms end-to-end

Fastest boot Most features

hotcell::backend::VmmBackend

/// Pluggable VMM backend trait.
/// Everything above the backend -- OCI pipeline, server,
/// result protocol -- is backend-agnostic.
#[async_trait]
pub trait VmmBackend: Send + Sync {
    /// Run a VM and return the result.
    async fn run(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<VmResult, HotcellError>;

    /// Run a VM with streaming console output.
    async fn run_streaming(
        &self, config: &VmConfig, worker_bin: &Path,
        tx: broadcast::Sender<StreamEvent>,
    ) -> Result<VmResult, HotcellError>;

    /// Create a persistent VM that outlives a single request.
    async fn create_persistent(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<Box<dyn PersistentVmHandle>, HotcellError>;

    /// Human-readable backend name.
    fn name(&self) -> &'static str;
}

// Four implementations in separate crates:
// - hotcell_libkrun::LibkrunBackend      (macOS + Linux)
// - hotcell_firecracker::FirecrackerBackend  (Linux only)
// - hotcell_ch::ChBackend                (Linux only)
// - hotcell_qemu::QemuBackend            (Linux only)

Default

libkrun

check_circle Embedded VMM via FFI — worker calls krun_start_enter()
check_circle virtiofs for rootfs and shared directory access
check_circle TSI (Transparent Socket Impersonation) networking — verified on both platforms
check_circle hotcell-jailer sandboxes the VMM process on Linux

Linux only

Firecracker

check_circle Separate VMM binary — configured via REST API over Unix socket
check_circle ext4 block device images created from OCI rootfs
check_circle Firecracker's own jailer — battle-tested in AWS Lambda
check_circle Serial console output streamed from file for real-time monitoring

Linux only

Cloud Hypervisor

check_circle Separate VMM binary — configured via REST API over Unix socket
check_circle virtio-fs rootfs via virtiofsd — host directory mounted directly, no ext4 images
check_circle Manages virtiofsd daemon processes for each virtio-fs mount
check_circle Requires cloud-hypervisor and virtiofsd binaries + KVM

Linux only

QEMU

check_circle Separate VMM binary — configured via CLI args (no QMP)
check_circle VFIO GPU passthrough — pass physical GPUs to guest VMs for CUDA/ROCm workloads
check_circle virtio-fs rootfs via virtiofsd — same shared directory model as Cloud Hypervisor
check_circle q35 machine type with IOMMU for interrupt remapping and device isolation

memory

Layer 1: VM Isolation

All Platforms

Kernel Virtualization

Each execution runs inside its own virtual machine with a separate Linux kernel. With libkrun, the kernel is compiled into libkrunfw. With Firecracker and Cloud Hypervisor, a separate vmlinux image is used. Either way, this eliminates the shared-kernel attack surface found in traditional containerization.

Guest Properties

check_circle Own kernel, process table, and memory space
check_circle No access to the host filesystem — the guest sees only its own root filesystem and any explicitly shared directories
check_circle No network access by default — networking must be explicitly enabled per-execution, with optional egress filtering to restrict which hosts the guest can reach
check_circle Configurable resource limits — memory, CPU, and execution timeout per VM

VM boundary

GUEST_OS_KERNEL

deployed_code USER_WORKLOAD

shared directories only

Linux Hardening

Layer 2: VMM Process Jail

On Linux, the VMM process is sandboxed before it boots the VM. With the libkrun backend, hotcell-libkrun-worker is sandboxed by hotcell-jailer before it configures libkrun. With the Firecracker backend, Firecracker's own jailer (battle-tested in AWS Lambda) handles sandboxing. With the Cloud Hypervisor backend, Landlock-based sandboxing is planned. The jail steps below describe hotcell-jailer for the libkrun backend.

LINUX ONLY

DEFENSE-IN-DEPTH

The jail executes 14 operations internally, grouped here into 8 categories. Each category removes a class of capability from the process: file descriptors, environment, resources, filesystem visibility, syscall access, and privileges. The steps are ordered so that each one assumes the previous steps might have been bypassed — defense-in-depth means no single step is a single point of failure.

Close Inherited FDs

close_range(3, MAX, 0) prevents leaking host file descriptors into the jail.

Clear Environment

All environment variables are removed. Only LD_LIBRARY_PATH=/lib remains (required for the dynamic linker to find libkrun inside the jail).

Join Cgroup

Dedicated cgroup with memory.max, pids.max (256), and cpu.max limits applied.

Namespace Isolation

unshare() creates mount, PID, IPC, UTS, and network namespaces (network only when TSI is disabled).

pivot_root

New root filesystem via pivot_root(), old root unmounted and removed. Host filesystem entirely invisible.

Landlock Restrictions

Mandatory access control via Landlock ABI v3 (filesystem) with optional v4 network restrictions (Linux 6.7+). The process can only access explicitly listed paths. This step is fatal — if Landlock is not enforced, the jail fails.

Drop Capabilities

Two-phase capability drop: bounding set cleared via PR_CAPBSET_DROP, then all remaining sets (ambient, effective, permitted, inheritable) cleared after setuid to nobody.

Seccomp BPF

Dual BPF filter with SECCOMP_FILTER_FLAG_TSYNC: an audit-log filter records violations, then a Kill-mode filter terminates on any syscall not in the allowlist. Both applied atomically across all threads.

Jail Filesystem

After pivot_root(), the worker's entire filesystem view is:

/
├── dev/
│   ├── kvm          # bind-mount, for VM creation
│   ├── urandom      # bind-mount, for randomness
│   └── null         # bind-mount
├── lib/             # bind-mount read-only: libkrun.so, libkrunfw.so, libc, ld-linux
├── proc/            # procfs, mounted after pivot_root
├── rootfs/          # bind-mount read-only: the OCI root filesystem
├── shares/          # bind-mount read-write: host shared directories
├── tmp/             # writable, world-writable with sticky bit
├── result/          # writable: result file directory
├── config.json      # read-only: VM configuration
├── console.log      # writable: console output
└── worker           # bind-mount read-only: hotcell-libkrun-worker binary

warning

Networking Trade-off

libkrun's TSI (Transparent Socket Impersonation) proxies guest socket calls through the VMM process on the host via vsock. When TSI is enabled, the worker does not unshare the network namespace (CLONE_NEWNET is skipped), socket syscalls are added to the seccomp allowlist, and Landlock network restrictions are skipped. The remaining layers (namespace isolation, seccomp allowlist, capability drop) still constrain the process.

Technical Specs

Cgroup memory.max guest + 256 MiB min 512 MiB

Cgroup pids.max 256 prevents fork bombs

Seccomp Mode Kill immediate termination

Landlock ABI v3 required Linux 6.2+

Landlock Network v4 optional Linux 6.7+

Cgroup cpu.max unlimited configurable per-execution

Seccomp Filters Dual BPF audit log + kill, TSYNC

Vsock Auth HMAC-SHA256 prevents cross-VM injection

Hardening Items 20 total host + guest combined

enhanced_encryption

Security Hardening

20 items across host & guest

filter_alt

Guest Seccomp Filter

Optional hotcell-seccomp binary installs a BPF filter inside the guest, blocking ptrace, mount, unshare, and other privilege-escalation syscalls before the user workload runs.

vpn_key

Vsock HMAC-SHA256 Auth

Each VM receives a unique HMAC-SHA256 token over vsock. Result payloads are authenticated before acceptance, preventing cross-VM result injection attacks.

lan

Network Egress Filtering

--allow-host restricts guest network access to specific destinations via iptables rules. All other egress is dropped.

verified

Rootfs Integrity

fs-verity and dm-verity provide cryptographic verification of rootfs contents, detecting any tampering of the filesystem image before or during execution.

admin_panel_settings

AppArmor Profile

An AppArmor profile adds kernel-level mandatory access control on top of Landlock and seccomp.

folder_limited

virtiofsd Sandboxing

virtiofsd runs with --sandbox=chroot --seccomp=kill. The daemon is locked to a chroot with its own seccomp kill-mode filter.

build_circle

Binary Hardening CI

Worker binaries are verified in CI via checksec: Full RELRO, PIE, Stack Canary, and NX are enforced on every build.

verified_user

OCI Pipeline Security

Image Download & Extraction

shield_locked

Path Traversal Protection

Tar extraction rejects entries containing ../ to prevent directory escape attacks.

link

Symlink Escape Guards

Symlinks are resolved within the rootfs boundary using guest semantics. Absolute symlinks are rebased into the rootfs, not followed on the host.

block

Shell Injection Prevention

guest_tag values are validated to contain only [a-zA-Z0-9_-].

fingerprint

Digest Verification

Layer blob digest verification is delegated to the oci-client library, which checks SHA-256 digests during download.

lock_clock

Temp-File Downloads

Blobs download to random temp files before extraction. This shrinks the substitution window, though atomic rename isn't used yet — partial TOCTOU mitigation, not a full guarantee.

check_circle

Test Coverage

277+ tests across unit tests, integration tests (real VMs), and adversarial security boundary tests. The jailer is verified working on Linux+KVM with seccomp in Kill mode. TSI networking is verified on both macOS and Linux.

Jailer Escape Tests

Each test simulates an attacker inside the jail attempting a known escape technique. Can they break out of the filesystem? Traverse /proc? Create new namespaces? Regain dropped capabilities? Every escape attempt must be blocked for the build to pass.

Guest Isolation Tests

Real VMs boot with adversarial probe scripts that attempt to observe or reach the host. Tests run across all four backends (libkrun, Firecracker, Cloud Hypervisor, QEMU) and verify hostname, filesystem, process, and network isolation.

E2E

End-to-End Verified

Jailed VM boot validated on Linux+KVM with the strictest seccomp mode (Kill). Networking verified on both macOS and Linux through the full sandbox stack. The complete jail sequence runs end-to-end in production configuration.