Dual-Layer Isolation.
Hotcell provides two layers of isolation with pluggable VMM backends. Layer 1 (VM isolation) runs on all platforms via libkrun or Firecracker. Layer 2 (VMM process jail) adds defense-in-depth on Linux — so even if an attacker escapes the VM, they land in a restricted process with no useful capabilities, a minimal filesystem view, and a tight syscall filter. The jailer has been verified end-to-end on Linux+KVM with seccomp in Kill mode. TSI networking has been verified on both platforms. 127+ tests including 28 security boundary tests.
VMM Backends
Hotcell supports pluggable VMM backends behind a common VmmBackend trait. Everything above the backend — OCI pipeline, server, result protocol, streaming — is backend-agnostic. Select per-request via the backend parameter.
/// Pluggable VMM backend trait.
/// Everything above the backend -- OCI pipeline, server,
/// result protocol -- is backend-agnostic.
#[async_trait]
pub trait VmmBackend: Send + Sync {
/// Prepare the rootfs for this backend (directory or ext4 image).
async fn prepare_rootfs(
&self, rootfs_dir: &Path, config: &VmConfig,
) -> Result<RootfsHandle, HotcellError>;
/// Run a VM and return the result.
async fn run(
&self, config: &VmConfig, rootfs: &RootfsHandle,
) -> Result<VmResult, HotcellError>;
/// Run a VM with streaming console output.
async fn run_streaming(
&self, config: &VmConfig, rootfs: &RootfsHandle,
tx: mpsc::Sender<StreamEvent>,
) -> Result<VmResult, HotcellError>;
}
// Two implementations in separate crates:
// - hotcell_libkrun::LibkrunBackend (macOS + Linux)
// - hotcell_firecracker::FirecrackerBackend (Linux only) Backend Comparison
libkrun
- check_circle Embedded VMM via FFI — worker calls
krun_start_enter() - check_circle virtiofs for rootfs and shared directory access
- check_circle TSI (Transparent Socket Impersonation) networking — verified on both platforms
- check_circle hotcell-jailer sandboxes the VMM process on Linux
Firecracker
- check_circle Separate VMM binary — configured via REST API over Unix socket
- check_circle ext4 block device images created from OCI rootfs
- check_circle Firecracker's own jailer — battle-tested in AWS Lambda
- check_circle Serial console output streamed from file for real-time monitoring
Layer 1: VM Isolation
Kernel Virtualization
Each execution runs inside its own virtual machine with a separate Linux kernel. With libkrun, the kernel is compiled into libkrunfw. With Firecracker, a separate vmlinux image is used. Either way, this eliminates the shared-kernel attack surface found in traditional containerization.
Guest Properties
- check_circle Own kernel, process table, and memory space
- check_circle No access to the host filesystem (libkrun: rootfs + shared directories via virtio-fs; Firecracker: ext4 block device)
- check_circle No network access by default (libkrun: TSI must be explicitly enabled; Firecracker: TAP planned)
- check_circle Resource limits via libkrun's rlimit support or Firecracker's machine config
Layer 2: VMM Process Jail
On Linux, the VMM process is sandboxed before it boots the VM. With the libkrun backend, hotcell-libkrun-worker is sandboxed by hotcell-jailer before it configures libkrun. With the Firecracker backend, Firecracker's own jailer (battle-tested in AWS Lambda) handles sandboxing. The jail steps below describe hotcell-jailer for the libkrun backend.
Close Inherited FDs
close_range(3, MAX, 0) prevents leaking host file descriptors into the jail.
Clear Environment
All environment variables are removed. Only LD_LIBRARY_PATH=/lib remains (required for the dynamic linker to find libkrun inside the jail).
Join Cgroup
Dedicated cgroup with memory.max, pids.max (256), and cpu.max limits applied.
Namespace Isolation
unshare() creates mount, PID, IPC, UTS, and network namespaces (network only when TSI is disabled).
pivot_root
New root filesystem via pivot_root(), old root unmounted and removed. Host filesystem entirely invisible.
Landlock Restrictions
Mandatory access control via Landlock ABI v3. The process can only access explicitly listed paths. This step is fatal — if Landlock is not enforced, the jail fails.
Drop Capabilities
Two-phase capability drop: bounding set cleared via PR_CAPBSET_DROP, then all remaining sets (ambient, effective, permitted, inheritable) cleared after setuid to nobody.
Seccomp BPF
Allowlist-based BPF filter in Kill mode. Any syscall not in the list triggers immediate process termination.
Jail Filesystem
After pivot_root(), the worker's entire filesystem view is:
/ ├── dev/ │ ├── kvm # bind-mount, for VM creation │ ├── urandom # bind-mount, for randomness │ └── null # bind-mount ├── lib/ # bind-mount read-only: libkrun.so, libkrunfw.so, libc, ld-linux ├── proc/ # procfs, mounted after pivot_root ├── rootfs/ # bind-mount read-only: the OCI root filesystem ├── shares/ # bind-mount read-write: host shared directories ├── tmp/ # writable, world-writable with sticky bit ├── result/ # writable: result file directory ├── config.json # read-only: VM configuration ├── console.log # writable: console output └── worker # bind-mount read-only: hotcell-libkrun-worker binary
Networking Trade-off
libkrun's TSI (Transparent Socket Impersonation) proxies guest socket calls through the VMM process on the host via vsock. When TSI is enabled, the worker does not unshare the network namespace (CLONE_NEWNET is skipped), socket syscalls are added to the seccomp allowlist, and Landlock network restrictions are skipped. The remaining layers (namespace isolation, seccomp allowlist, capability drop) still constrain the process.
Technical Specs
OCI Pipeline Security
Path Traversal Protection
Tar extraction rejects entries containing ../ to prevent directory escape attacks.
Symlink Escape Guards
Symlinks are resolved within the rootfs boundary using guest semantics. Absolute symlinks are rebased into the rootfs, not followed on the host.
Shell Injection Prevention
guest_tag values are validated to contain only [a-zA-Z0-9_-].
Digest Verification
Downloaded layer blobs are verified against their SHA-256 digest before use.
TOCTOU-Safe Downloads
Blobs go to random temp files, are verified, then extracted from the same file — no window for substitution.
Test Coverage
127+ tests across unit tests, integration tests (real VMs), and adversarial security boundary tests. The jailer is verified working on Linux+KVM with seccomp in Kill mode. TSI networking is verified on both macOS and Linux.
Security Boundary Tests
Adversarial escape attempts that must fail: filesystem escape, proc traversal, namespace breakout, seccomp bypass (unshare, setns, ptrace, personality, bpf, keyctl, clone with CLONE_NEW*, ioctl TIOCSTI, prctl), capability regain, fork bomb limits, and the full jail sequence.
Guest Isolation Tests
Primitive subsystem verification: cgroup creation and enforcement, capability dropping, namespace + pivot_root isolation, Landlock filesystem restrictions, seccomp filter installation (log and kill modes), and FD close_range.
End-to-End Verified
Jailed VM boot validated on Linux+KVM with seccomp Kill mode. TSI networking verified on both macOS (Hypervisor.framework) and Linux (jailed, seccomp Kill mode). Full sandbox sequence: namespaces, pivot_root, Landlock, seccomp, cgroup, capability drop.