Experimental — Hotcell is under active development and should not be used in production.

Dual-Layer Isolation.

Every execution runs in its own VM. If someone escapes the VM, they land in a jailed process with no capabilities, a minimal filesystem, and a syscall filter that kills on violation.

Layer 1: VM isolation via libkrun, Firecracker, Cloud Hypervisor, or QEMU. Separate kernel, memory, process tree.

Layer 2: The VMM process itself is jailed — strict sandbox with syscall filtering, filesystem restrictions, resource limits, and full capability drop.

277+ tests including 37 security tests (18 adversarial escape + 19 isolation verification) that verify escape attempts are blocked.

swap_horiz

VMM Backends

Four backends behind a common trait. Same code, same protocol, same result format — pick the right trade-off per workload.

Default
macOS + Linux

libkrun

Embedded VMM via FFI. No separate process. Transparent socket networking (TSI). The fastest path from code to VM.

~337ms end-to-end
 
Linux only

Firecracker

AWS Lambda's VMM. Minimal device model, battle-tested jailer. ext4 block device rootfs. The production hardening choice.

~344ms end-to-end
 
Linux only

Cloud Hypervisor

Modern rust-vmm VMM. virtio-fs rootfs via virtiofsd. Snapshot/restore, warm migration, GPU passthrough via VFIO.

~307ms end-to-end
 
Linux only

QEMU

Full device emulation. Broadest hardware support. GPU passthrough with OVMF firmware boot. The escape hatch for complex workloads.

~555ms end-to-end
Fastest boot Most features
hotcell::backend::VmmBackend
/// Pluggable VMM backend trait.
/// Everything above the backend -- OCI pipeline, server,
/// result protocol -- is backend-agnostic.
#[async_trait]
pub trait VmmBackend: Send + Sync {
    /// Run a VM and return the result.
    async fn run(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<VmResult, HotcellError>;

    /// Run a VM with streaming console output.
    async fn run_streaming(
        &self, config: &VmConfig, worker_bin: &Path,
        tx: broadcast::Sender<StreamEvent>,
    ) -> Result<VmResult, HotcellError>;

    /// Create a persistent VM that outlives a single request.
    async fn create_persistent(
        &self, config: &VmConfig, worker_bin: &Path,
    ) -> Result<Box<dyn PersistentVmHandle>, HotcellError>;

    /// Human-readable backend name.
    fn name(&self) -> &'static str;
}

// Four implementations in separate crates:
// - hotcell_libkrun::LibkrunBackend      (macOS + Linux)
// - hotcell_firecracker::FirecrackerBackend  (Linux only)
// - hotcell_ch::ChBackend                (Linux only)
// - hotcell_qemu::QemuBackend            (Linux only)
Default

libkrun

  • check_circle Embedded VMM via FFI — worker calls krun_start_enter()
  • check_circle virtiofs for rootfs and shared directory access
  • check_circle TSI (Transparent Socket Impersonation) networking — verified on both platforms
  • check_circle hotcell-jailer sandboxes the VMM process on Linux
Linux only

Firecracker

  • check_circle Separate VMM binary — configured via REST API over Unix socket
  • check_circle ext4 block device images created from OCI rootfs
  • check_circle Firecracker's own jailer — battle-tested in AWS Lambda
  • check_circle Serial console output streamed from file for real-time monitoring
Linux only

Cloud Hypervisor

  • check_circle Separate VMM binary — configured via REST API over Unix socket
  • check_circle virtio-fs rootfs via virtiofsd — host directory mounted directly, no ext4 images
  • check_circle Manages virtiofsd daemon processes for each virtio-fs mount
  • check_circle Requires cloud-hypervisor and virtiofsd binaries + KVM
Linux only

QEMU

  • check_circle Separate VMM binary — configured via CLI args (no QMP)
  • check_circle VFIO GPU passthrough — pass physical GPUs to guest VMs for CUDA/ROCm workloads
  • check_circle virtio-fs rootfs via virtiofsd — same shared directory model as Cloud Hypervisor
  • check_circle q35 machine type with IOMMU for interrupt remapping and device isolation
memory

Layer 1: VM Isolation

Kernel Virtualization

Each execution runs inside its own virtual machine with a separate Linux kernel. With libkrun, the kernel is compiled into libkrunfw. With Firecracker and Cloud Hypervisor, a separate vmlinux image is used. Either way, this eliminates the shared-kernel attack surface found in traditional containerization.

Guest Properties

  • check_circle Own kernel, process table, and memory space
  • check_circle No access to the host filesystem — the guest sees only its own root filesystem and any explicitly shared directories
  • check_circle No network access by default — networking must be explicitly enabled per-execution, with optional egress filtering to restrict which hosts the guest can reach
  • check_circle Configurable resource limits — memory, CPU, and execution timeout per VM
VM boundary
GUEST_OS_KERNEL
deployed_code USER_WORKLOAD
shared directories only

Layer 2: VMM Process Jail

On Linux, the VMM process is sandboxed before it boots the VM. With the libkrun backend, hotcell-libkrun-worker is sandboxed by hotcell-jailer before it configures libkrun. With the Firecracker backend, Firecracker's own jailer (battle-tested in AWS Lambda) handles sandboxing. With the Cloud Hypervisor backend, Landlock-based sandboxing is planned. The jail steps below describe hotcell-jailer for the libkrun backend.

LINUX ONLY
DEFENSE-IN-DEPTH

The jail executes 14 operations internally, grouped here into 8 categories. Each category removes a class of capability from the process: file descriptors, environment, resources, filesystem visibility, syscall access, and privileges. The steps are ordered so that each one assumes the previous steps might have been bypassed — defense-in-depth means no single step is a single point of failure.

01

Close Inherited FDs

close_range(3, MAX, 0) prevents leaking host file descriptors into the jail.

02

Clear Environment

All environment variables are removed. Only LD_LIBRARY_PATH=/lib remains (required for the dynamic linker to find libkrun inside the jail).

03

Join Cgroup

Dedicated cgroup with memory.max, pids.max (256), and cpu.max limits applied.

04

Namespace Isolation

unshare() creates mount, PID, IPC, UTS, and network namespaces (network only when TSI is disabled).

05

pivot_root

New root filesystem via pivot_root(), old root unmounted and removed. Host filesystem entirely invisible.

06

Landlock Restrictions

Mandatory access control via Landlock ABI v3 (filesystem) with optional v4 network restrictions (Linux 6.7+). The process can only access explicitly listed paths. This step is fatal — if Landlock is not enforced, the jail fails.

07

Drop Capabilities

Two-phase capability drop: bounding set cleared via PR_CAPBSET_DROP, then all remaining sets (ambient, effective, permitted, inheritable) cleared after setuid to nobody.

08

Seccomp BPF

Dual BPF filter with SECCOMP_FILTER_FLAG_TSYNC: an audit-log filter records violations, then a Kill-mode filter terminates on any syscall not in the allowlist. Both applied atomically across all threads.

Jail Filesystem

After pivot_root(), the worker's entire filesystem view is:

/
├── dev/
│   ├── kvm          # bind-mount, for VM creation
│   ├── urandom      # bind-mount, for randomness
│   └── null         # bind-mount
├── lib/             # bind-mount read-only: libkrun.so, libkrunfw.so, libc, ld-linux
├── proc/            # procfs, mounted after pivot_root
├── rootfs/          # bind-mount read-only: the OCI root filesystem
├── shares/          # bind-mount read-write: host shared directories
├── tmp/             # writable, world-writable with sticky bit
├── result/          # writable: result file directory
├── config.json      # read-only: VM configuration
├── console.log      # writable: console output
└── worker           # bind-mount read-only: hotcell-libkrun-worker binary
warning

Networking Trade-off

libkrun's TSI (Transparent Socket Impersonation) proxies guest socket calls through the VMM process on the host via vsock. When TSI is enabled, the worker does not unshare the network namespace (CLONE_NEWNET is skipped), socket syscalls are added to the seccomp allowlist, and Landlock network restrictions are skipped. The remaining layers (namespace isolation, seccomp allowlist, capability drop) still constrain the process.

Technical Specs

Cgroup memory.max guest + 256 MiB min 512 MiB
Cgroup pids.max 256 prevents fork bombs
Seccomp Mode Kill immediate termination
Landlock ABI v3 required Linux 6.2+
Landlock Network v4 optional Linux 6.7+
Cgroup cpu.max unlimited configurable per-execution
Seccomp Filters Dual BPF audit log + kill, TSYNC
Vsock Auth HMAC-SHA256 prevents cross-VM injection
Hardening Items 20 total host + guest combined
enhanced_encryption

Security Hardening

filter_alt

Guest Seccomp Filter

Optional hotcell-seccomp binary installs a BPF filter inside the guest, blocking ptrace, mount, unshare, and other privilege-escalation syscalls before the user workload runs.

vpn_key

Vsock HMAC-SHA256 Auth

Each VM receives a unique HMAC-SHA256 token over vsock. Result payloads are authenticated before acceptance, preventing cross-VM result injection attacks.

lan

Network Egress Filtering

--allow-host restricts guest network access to specific destinations via iptables rules. All other egress is dropped.

verified

Rootfs Integrity

fs-verity and dm-verity provide cryptographic verification of rootfs contents, detecting any tampering of the filesystem image before or during execution.

admin_panel_settings

AppArmor Profile

An AppArmor profile adds kernel-level mandatory access control on top of Landlock and seccomp.

folder_limited

virtiofsd Sandboxing

virtiofsd runs with --sandbox=chroot --seccomp=kill. The daemon is locked to a chroot with its own seccomp kill-mode filter.

build_circle

Binary Hardening CI

Worker binaries are verified in CI via checksec: Full RELRO, PIE, Stack Canary, and NX are enforced on every build.

verified_user

OCI Pipeline Security

shield_locked

Path Traversal Protection

Tar extraction rejects entries containing ../ to prevent directory escape attacks.

link

Symlink Escape Guards

Symlinks are resolved within the rootfs boundary using guest semantics. Absolute symlinks are rebased into the rootfs, not followed on the host.

block

Shell Injection Prevention

guest_tag values are validated to contain only [a-zA-Z0-9_-].

fingerprint

Digest Verification

Layer blob digest verification is delegated to the oci-client library, which checks SHA-256 digests during download.

lock_clock

Temp-File Downloads

Blobs download to random temp files before extraction. This shrinks the substitution window, though atomic rename isn't used yet — partial TOCTOU mitigation, not a full guarantee.

check_circle

Test Coverage

277+ tests across unit tests, integration tests (real VMs), and adversarial security boundary tests. The jailer is verified working on Linux+KVM with seccomp in Kill mode. TSI networking is verified on both macOS and Linux.

18

Jailer Escape Tests

Each test simulates an attacker inside the jail attempting a known escape technique. Can they break out of the filesystem? Traverse /proc? Create new namespaces? Regain dropped capabilities? Every escape attempt must be blocked for the build to pass.

19

Guest Isolation Tests

Real VMs boot with adversarial probe scripts that attempt to observe or reach the host. Tests run across all four backends (libkrun, Firecracker, Cloud Hypervisor, QEMU) and verify hostname, filesystem, process, and network isolation.

E2E

End-to-End Verified

Jailed VM boot validated on Linux+KVM with the strictest seccomp mode (Kill). Networking verified on both macOS and Linux through the full sandbox stack. The complete jail sequence runs end-to-end in production configuration.

Experimental — Hotcell is under active development and should not be used in production.