ESXi to Proxmox Migration Plan


ESXi to Proxmox Migration Plan - UPDATED WITH APPLICATION LAYER

Created: 2025-12-28 Updated: 2025-12-28 (Application Layer Integration) Status: Ready for Execution


Executive Summary

Current Situation:

  • 2x ESXi hosts (NUC9i9QNX) with production workloads
  • 1x Proxmox staging server already running (Home Assistant + Frigate with Coral TPU)
  • Complete application stack: Plex, media automation (Radarr/Sonarr/SABnzbd), Pi-hole DNS, Palo Alto firewall
  • Critical External Dependency: NFS storage at 10.1.1.150 (27TB, 21TB used) - ALL media stored here

End Goal:

  • 3-node Proxmox VE cluster with HA capability
  • All applications migrated with zero data loss
  • Hardware passthrough working (iGPU for Plex, Coral TPU for Frigate)
  • NFS mounts reconfigured on all VMs
  • Traefik reverse proxy with Cloudflare SSL working

Key Application Insights:

  • VM Naming Clarified: “platinum” (10.1.1.125) = Plex, “iridium” (10.1.1.126) = Support services
  • No Media Migration Needed: All media on NFS (10.1.1.150), only config/database migrations
  • Docker Stack: Complete media automation on single VM with Traefik reverse proxy
  • Service Dependencies: Mapped complete data flow from user request → Plex streaming
  • Network Management: UniFi Controller on iridium VM manages network infrastructure

Migration Architecture - Application View

End-State Application Distribution

Node 1: proxmox-01 (10.1.1.120)

Production Workloads:

  • Plex Media Server (platinum VM → renamed “plex”)

    • Intel iGPU passthrough for Quick Sync transcoding
    • NFS mount: 10.1.1.150:/volume1/datastore/media → /mnt/media
    • Database: Local on VM (~500MB)
  • Plex Support Services (iridium VM → renamed “plex-support”)

    • Tautulli (Plex monitoring and statistics)
    • Cloudflared (remote access tunnel)
    • UniFi Controller (network management)
    • No iGPU needed (can reclaim passthrough)
  • Media Automation Stack (docker VM → renamed “docker-media”)

    • Radarr, Sonarr, SABnzbd (media management)
    • Traefik reverse proxy (Cloudflare SSL)
    • Ombi, Overseerr (user requests)
    • Portainer, Watchtower, Uptime Kuma
    • NFS mount: 10.1.1.150:/volume1/datastore/media → /mnt/media
  • Pi-hole DNS (pihole VM)

    • Network-wide DNS and ad-blocking
    • Critical: All network clients depend on this
  • Palo Alto Firewall (jarnetfw VM)

    • Inter-VLAN routing
    • NAT, security policies
    • Critical: Network outage if offline

Node 2: proxmox-02 (10.1.1.121)

Production Workloads:

  • Home Assistant + Frigate (home-sec VM, migrated from staging)
    • USB controller passthrough for Coral TPU
    • Docker containers: Home Assistant, Frigate, Mosquitto
    • Local storage for Frigate recordings

Spare Capacity:

  • Room for growth and lab VMs

Node 3: pve-staging (10.1.1.123)

Role: HA Witness + Light Workloads

  • Quorum node for 3-node cluster
  • K8s lab, templates, Docker services

Pre-Migration Decisions Required

1. Hardware Upgrade Timing ⚠️ CRITICAL DECISION

RECOMMENDATION: Install 2TB NVMe drives BEFORE migration (Option A)

  • Clean Proxmox install on larger drives
  • No storage migration later
  • Order drives now, install before Phase 1

2. Proxmox Storage Backend

RECOMMENDATION: ZFS for NUCs (better for VMs, snapshots, replication)

  • Built-in compression and checksums
  • Native VM snapshot support
  • Better for Plex database and Docker volumes

Current naming is confusing due to ESXi VM names vs hostnames. Recommend renaming during migration:

  • platinum (10.1.1.125) → “plex” (Plex Media Server)
  • iridium (10.1.1.126) → “plex-support” (Tautulli, Cloudflared, UniFi Controller)
  • docker (10.1.1.32) → “docker-media” (Media automation stack)
  • pihole (10.1.1.35) → “pihole” (keep same)
  • jarnetfw (10.1.1.103) → “paloalto-fw” (keep functionality same)

4. VMs to Decommission

Confirm these can be deleted (all currently offline on Host 2):

  • server-2019 (old Blue Iris host - replaced by Frigate)
  • home-security (old VM - replaced by Frigate on Proxmox)
  • xsoar (purpose unknown, offline)
  • win11-sse, win-10 (lab VMs, offline)

Phase 0: Preparation & Backups

0.1: Hardware Preparation

  • Order 2x 2TB WD Blue SN580 NVMe drives
  • Create Proxmox VE 8.x bootable USB installer
  • Prepare external backup storage (USB drive or network backup location)

0.2: Pre-Migration Backups - CRITICAL

Backup Script

Create and run this backup script BEFORE any migration:

#!/bin/bash
# Pre-Migration Backup Script
# Run this BEFORE starting migration

BACKUP_DIR="/tmp/esxi-migration-backup-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

echo "=== Starting Pre-Migration Backups ==="
echo "Backup directory: $BACKUP_DIR"

# 1. Plex Database Backup
echo "1. Backing up Plex database..."
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 \
  "sudo tar czf /tmp/plex-backup.tar.gz \
  '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Plug-in Support/Databases/' \
  '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Preferences.xml'"
scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125:/tmp/plex-backup.tar.gz $BACKUP_DIR/
echo "✓ Plex backup complete"

# 2. Docker Compose + Env Files
echo "2. Backing up Docker compose and configs..."
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 \
  "tar czf /tmp/docker-backup.tar.gz /home/luke/docker/"
scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32:/tmp/docker-backup.tar.gz $BACKUP_DIR/
echo "✓ Docker backup complete"

# 3. Pi-hole Config
echo "3. Backing up Pi-hole configuration..."
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.35 \
  "sudo tar czf /tmp/pihole-backup.tar.gz /etc/pihole/"
scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.35:/tmp/pihole-backup.tar.gz $BACKUP_DIR/
echo "✓ Pi-hole backup complete"

# 4. Iridium VM (Tautulli, Cloudflared, UniFi)
echo "4. Backing up iridium services (Tautulli, UniFi, Cloudflared)..."
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 \
  "sudo tar czf /tmp/iridium-backup.tar.gz /config /data 2>/dev/null || sudo tar czf /tmp/iridium-backup.tar.gz /opt /etc/cloudflared 2>/dev/null || echo 'Partial backup'"
scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126:/tmp/iridium-backup.tar.gz $BACKUP_DIR/
echo "✓ Iridium backup complete (Tautulli, UniFi, Cloudflared configs)"

# 5. Home Assistant Backup
echo "5. Backing up Home Assistant..."
ssh -i ~/.ssh/esxi_migration_rsa ubuntu@10.1.1.208 \
  "docker exec homeassistant ha backups new --name pre-migration"
echo "✓ Home Assistant backup created (stored in container)"

# 6. List all VMs for reference
echo "6. Documenting VM inventory..."
ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 \
  "vim-cmd vmsvc/getallvms" > $BACKUP_DIR/esxi-host1-vms.txt
ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.121 \
  "vim-cmd vmsvc/getallvms" > $BACKUP_DIR/esxi-host2-vms.txt
echo "✓ VM inventory documented"

# 7. Export ESXi network configs
echo "7. Exporting ESXi network configurations..."
ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 \
  "esxcli network vswitch standard list" > $BACKUP_DIR/esxi-host1-network.txt
ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.121 \
  "esxcli network vswitch standard list" > $BACKUP_DIR/esxi-host2-network.txt
echo "✓ Network configs exported"

echo ""
echo "=== Backup Complete ==="
echo "All backups saved to: $BACKUP_DIR"
echo ""
echo "Next steps:"
echo "1. Review backups in $BACKUP_DIR"
echo "2. Copy to external storage: cp -r $BACKUP_DIR /path/to/external/drive/"
echo "3. Verify Palo Alto firewall config is exported via web UI"
echo "4. Document current IP addresses and credentials"

Palo Alto Firewall Backup (Manual)

  • Log into Palo Alto web UI (https://10.1.1.103)
  • Export configuration: Device → Setup → Operations → Export named configuration
  • Save to $BACKUP_DIR/paloalto-config.xml
  • Screenshot all firewall rules (Policies → Security)
  • Screenshot NAT policies (Policies → NAT)
  • Document interface → VLAN mappings:
    Example mapping:
    ethernet1/1 → VLAN 300 (Public)
    ethernet1/2 → VLAN 50 (Lab)
    ethernet1/3 → VLAN 0 (Management)

NFS Mount Verification

# Verify NFS is accessible from current VMs
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "df -h | grep 10.1.1.150"
ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "df -h | grep 10.1.1.150"

# Test NFS from Proxmox staging (verify Proxmox can access NFS)
ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.123 "mount -t nfs4 10.1.1.150:/volume1/datastore/media /mnt/test && ls /mnt/test && umount /mnt/test"

0.3: Environment Variables & Credentials Documentation

Create a secure document with:

  • Cloudflare API key (for Traefik)
  • Cloudflare email
  • Domain name (${DOMAINNAME} from docker-compose.yml)
  • SABnzbd Usenet credentials
  • Palo Alto firewall admin credentials
  • All VM root/admin passwords
  • NFS server credentials (if any)

0.4: Copy External Dependencies

  • Download docker-compose.yml from docker VM:
    scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32:/home/luke/docker/docker-compose.yml $BACKUP_DIR/
  • Download .env file (if exists):
    scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32:/home/luke/docker/.env $BACKUP_DIR/ || echo "No .env file"
  • Download Traefik configs:
    scp -r -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32:/home/luke/docker/traefik/ $BACKUP_DIR/

Duration: 2-3 hours Validation: Verify all backups exist and are not empty before proceeding


Phase 1: Install Proxmox on Host 2

1.1: Pre-Install Checks

  • All Phase 0 backups completed and verified
  • 2TB NVMe drive physically installed (if doing hardware upgrade)
  • Proxmox USB installer prepared
  • All offline VMs on Host 2 confirmed safe to delete

1.2: Proxmox Installation

Same as original plan - install Proxmox on ghost-esx-02 (10.1.1.121):

  • Hostname: proxmox-02
  • IP: 10.1.1.121/24
  • Gateway: 10.1.1.1
  • Filesystem: ZFS (RAID0) with compression

1.3: Network Configuration

Configure dual 10GbE bond with VLAN-aware bridge (same as original plan)

1.4: Intel iGPU Passthrough Configuration

(Same as original plan - enable IOMMU, load VFIO, blacklist i915)

1.5: NFS Storage Testing - NEW

Critical: Test NFS mount BEFORE migrating VMs

# On proxmox-02:
# Install NFS client
apt install -y nfs-common

# Test NFS mount
mkdir -p /mnt/nfs-test
mount -t nfs4 10.1.1.150:/volume1/datastore/media /mnt/nfs-test

# Verify mount
df -h | grep 10.1.1.150
ls -lah /mnt/nfs-test

# Check read/write permissions
touch /mnt/nfs-test/proxmox-write-test.txt
rm /mnt/nfs-test/proxmox-write-test.txt

# Unmount test
umount /mnt/nfs-test

Expected Results:

  • NFS mount successful
  • Can read existing media files
  • Can create/delete test files
  • ~27TB total, ~21TB used

Validation Checklist:

  • Proxmox web UI accessible at https://10.1.1.121:8006
  • SSH access working
  • Network connectivity (ping 10.1.1.1, 8.8.8.8)
  • Intel iGPU shows as available for passthrough
  • NFS mount working with read/write access
  • Test VM boots successfully

Duration: 3-4 hours


Phase 2: Migrate VMs from Host 1 to Proxmox Host 2

Migration Strategy: Incremental approach, lowest to highest risk

2.1: Test Migration - “pihole” VM (DNS Service) - UPDATED

Why First:

  • Relatively simple (single application, no storage dependencies)
  • Can use DNS fallback (8.8.8.8) during migration
  • Good test of migration process

Pre-Migration:

# Update DHCP to use fallback DNS temporarily (on Palo Alto or DHCP server)
# Add secondary DNS: 8.8.8.8

# Backup Pi-hole config (already done in Phase 0)
# Verify backup exists
ls -lh $BACKUP_DIR/pihole-backup.tar.gz

Migration Steps:

  1. Stop Pi-hole VM on ESXi:

    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vim-cmd vmsvc/power.off <VMID>"
  2. Export VM disk:

    # On ESXi host:
    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vmkfstools -i /vmfs/volumes/<datastore>/pihole/pihole.vmdk /vmfs/volumes/<datastore>/pihole/pihole-flat.vmdk"
    
    # Copy to Proxmox:
    scp -i ~/.ssh/esxi_migration_rsa root@10.1.1.120:/vmfs/volumes/<datastore>/pihole/*.vmdk /var/lib/vz/images/
  3. Create VM on Proxmox:

    # On proxmox-02:
    qm create 101 --name pihole --memory 1024 --cores 1 --net0 virtio,bridge=vmbr0,tag=0
    
    # Import disk
    qm importdisk 101 /var/lib/vz/images/pihole.vmdk local-zfs
    qm set 101 --scsi0 local-zfs:vm-101-disk-0
    qm set 101 --boot order=scsi0
    qm set 101 --ostype l26
  4. Start VM and validate:

    qm start 101
    
    # Wait 30 seconds, then test
    ping 10.1.1.35
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.35
    
    # Verify Pi-hole running
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.35 "pihole status"
    
    # Test DNS resolution
    nslookup google.com 10.1.1.35
  5. Revert DHCP to use Pi-hole as primary DNS

Validation:

  • Pi-hole VM boots on Proxmox
  • Network connectivity (ping gateway, internet)
  • Pi-hole web UI accessible (http://10.1.1.35/admin)
  • DNS queries working
  • Ad blocking functional

Rollback: Restart Pi-hole VM on ESXi if issues Downtime: ~15-20 minutes Duration: 1-2 hours with testing


2.2: Migrate “iridium” VM (Plex Support Services) - UPDATED ⭐ MEDIUM PRIORITY

Why Second:

  • Important services but not critical infrastructure
  • Supports Plex (should be migrated before Plex itself)
  • UniFi Controller manages network devices
  • Good complexity test without critical dependencies

Applications:

  • Tautulli (Plex monitoring)
  • Cloudflared (remote access tunnel)
  • UniFi Controller (network management)

Pre-Migration Checklist:

  • Tautulli/UniFi/Cloudflared configs backed up (Phase 0)
  • Document Cloudflare tunnel token
  • Note UniFi devices (will temporarily lose management)

Migration Steps:

  1. Document current configuration:

    # Check what's running:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "ps aux | grep -E 'tautulli|cloudflare|unifi' | grep -v grep"
    
    # Check Tautulli port:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "curl -I http://localhost:8181"
    
    # Document Cloudflare tunnel token (visible in process list)
  2. Stop services gracefully (if possible):

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "sudo s6-svc -d /run/s6-rc/servicedirs/svc-tautulli || echo 'Manual stop failed, VM shutdown will stop services'"
  3. Shut down VM on ESXi:

    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vim-cmd vmsvc/power.off <VMID>"
  4. Export VM disk to Proxmox: (Same process as Pi-hole - export VMDK, copy to Proxmox, import)

  5. Create VM on Proxmox:

    # On proxmox-02:
    qm create 104 --name plex-support --memory 8192 --cores 4 --net0 virtio,bridge=vmbr0,tag=0
    qm importdisk 104 /var/lib/vz/images/iridium.vmdk local-zfs
    qm set 104 --scsi0 local-zfs:vm-104-disk-0
    qm set 104 --boot order=scsi0
    qm set 104 --ostype l26
  6. Start VM and verify services:

    qm start 104
    
    # Wait 60 seconds for boot
    sleep 60
    
    # SSH into VM
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126
  7. Validate Tautulli:

    # Check Tautulli is running:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "ps aux | grep tautulli | grep -v grep"
    
    # Access Tautulli web UI:
    curl -I http://10.1.1.126:8181
    # Or open in browser: http://10.1.1.126:8181
    
    # Verify Tautulli can connect to Plex (10.1.1.125):
    # Check Tautulli UI → Settings → Plex Media Server
  8. Validate Cloudflared:

    # Check Cloudflared tunnel is running:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "ps aux | grep cloudflared | grep -v grep"
    
    # Test remote access (if configured):
    # Try accessing services via Cloudflare tunnel URL
  9. Validate UniFi Controller:

    # Check UniFi Controller is running:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.126 "ps aux | grep unifi | grep -v grep"
    
    # Access UniFi Controller web UI:
    # https://10.1.1.126:8443
    
    # Verify UniFi devices are reconnecting:
    # UniFi UI → Devices → Check all devices show "Connected"
    # (May take 2-5 minutes for devices to reconnect)

Validation Checklist:

  • iridium VM boots on Proxmox
  • Network connectivity (ping gateway, internet)
  • Tautulli web UI accessible (http://10.1.1.126:8181)
  • Tautulli can connect to Plex server (10.1.1.125)
  • Cloudflared tunnel running (check process)
  • Remote access working (if configured)
  • UniFi Controller web UI accessible (https://10.1.1.126:8443)
  • UniFi devices reconnected (check controller UI)
  • All network devices managed and healthy

Important Notes:

  • UniFi Devices: Will briefly lose controller connection during migration
  • Devices will auto-reconnect when controller comes back online
  • No network outage (devices continue forwarding traffic)
  • Management features temporarily unavailable during migration

Rollback: Restart iridium VM on ESXi if issues Downtime: ~30-45 minutes (UniFi management only) Duration: 1-2 hours with full validation


2.3: Migrate “docker” VM (Media Automation Stack) - UPDATED ⭐ HIGH PRIORITY

Complexity: HIGH - Multiple dependencies (NFS, Traefik, Cloudflare, Docker network) Risk: Media management offline during migration

Critical Dependencies:

  • NFS mount for media (10.1.1.150)
  • Cloudflare API credentials for Traefik SSL
  • Docker volumes for all container configs
  • Environment variables (.env file)

Pre-Migration Checklist:

  • Docker compose file backed up
  • .env file backed up (contains Cloudflare credentials, domain)
  • All container config volumes backed up
  • NFS mount tested on Proxmox (done in Phase 1.5)
  • Cloudflare API key documented

Migration Steps:

  1. Stop all Docker containers gracefully:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "cd /home/luke/docker && docker-compose down"
  2. Shut down VM on ESXi:

    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vim-cmd vmsvc/power.off <VMID>"
  3. Export VM disk to Proxmox: (Same process as Pi-hole - export VMDK, copy to Proxmox, import)

  4. Create VM on Proxmox:

    # On proxmox-02:
    qm create 102 --name docker-media --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0,tag=0
    qm importdisk 102 /var/lib/vz/images/docker.vmdk local-zfs
    qm set 102 --scsi0 local-zfs:vm-102-disk-0
    qm set 102 --boot order=scsi0
    qm set 102 --ostype l26
  5. Start VM and reconfigure NFS mount:

    qm start 102
    
    # SSH into VM
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32
    
    # Verify /etc/fstab has NFS mount:
    cat /etc/fstab | grep 10.1.1.150
    
    # Should show:
    # 10.1.1.150:/volume1/datastore/media /mnt/media nfs4 defaults 0 0
    
    # Mount NFS (should auto-mount from fstab, but verify):
    mount -a
    df -h | grep 10.1.1.150
    
    # Verify media files accessible:
    ls /mnt/media/Movies
    ls /mnt/media/TV
  6. Verify Docker Compose and Environment:

    # Verify docker-compose.yml exists:
    cat /home/luke/docker/docker-compose.yml | head -20
    
    # Verify .env file exists (contains Cloudflare credentials):
    cat /home/luke/docker/.env | grep CLOUDFLARE
    
    # Should show:
    # CLOUDFLARE_EMAIL=your@email.com
    # CLOUDFLARE_API_KEY=your_api_key
    # DOMAINNAME=your.domain.com
  7. Start Docker containers:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "cd /home/luke/docker && docker-compose up -d"
    
    # Monitor container startup:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "docker ps -a"
    
    # Check logs for any errors:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "docker-compose logs -f --tail=50"
  8. Validate ALL services:

    Traefik (Reverse Proxy):

    # Check Traefik is running and has SSL certs:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.32 "docker logs traefik 2>&1 | grep -i certificate"
    
    # Verify Traefik web UI accessible:
    curl -k https://traefik.${DOMAINNAME}

    Radarr (Movies):

    # Verify Radarr is accessible:
    curl -I https://cyan.${DOMAINNAME}
    
    # Verify NFS mount visible in Radarr:
    # Access Radarr UI → Settings → Media Management → Root Folders
    # Should show: /mnt/media/Movies

    Sonarr (TV):

    # Verify Sonarr is accessible:
    curl -I https://teal.${DOMAINNAME}
    
    # Verify NFS mount visible in Sonarr:
    # Access Sonarr UI → Settings → Media Management → Root Folders
    # Should show: /mnt/media/TV

    SABnzbd (Downloads):

    # Verify SABnzbd is running:
    curl -I http://10.1.1.32:2099
    
    # Verify download directories:
    # Access SABnzbd UI → Config → Folders
    # Should show: /downloads, /incomplete-downloads

    Portainer:

    # Access Portainer web UI:
    # http://10.1.1.32:9000

    Uptime Kuma:

    curl -I https://status.${DOMAINNAME}
  9. Test Complete Data Flow:

    • Add test movie request in Ombi/Overseerr
    • Verify Radarr picks up request
    • Verify SABnzbd can download (or start download)
    • Verify files saved to NFS mount (/mnt/media/Downloads)
    • Verify Radarr can move to /mnt/media/Movies

Validation Checklist:

  • Docker VM boots on Proxmox
  • NFS mount working (/mnt/media accessible)
  • All Docker containers running (docker ps shows all as “Up”)
  • Traefik reverse proxy working (SSL certs valid)
  • Radarr accessible and can see /mnt/media/Movies
  • Sonarr accessible and can see /mnt/media/TV
  • SABnzbd accessible and can download
  • Ombi/Overseerr accessible for user requests
  • Portainer accessible for Docker management
  • Uptime Kuma monitoring working
  • Cloudflare SSL certificates auto-renewing (check Traefik logs)

Rollback: Restart docker VM on ESXi, docker-compose up -d Downtime: ~30-60 minutes Duration: 2-3 hours with full validation


2.4: Migrate “platinum” VM (Plex Media Server) - UPDATED ⭐ CRITICAL

Complexity: HIGHEST - Requires iGPU passthrough + NFS mount Risk: Media streaming offline, hardware transcoding must work

Critical Dependencies:

  • Intel UHD 630 iGPU passthrough (for Quick Sync transcoding)
  • NFS mount for media libraries (10.1.1.150)
  • Plex database (local on VM)

Pre-Migration Checklist:

  • Plex database backed up (Phase 0)
  • iGPU passthrough verified working on Proxmox Host 2 (Phase 1.4)
  • NFS mount tested on Proxmox (Phase 1.5)
  • Current transcoding settings documented

Migration Steps:

  1. Document current Plex settings:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "cat '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Preferences.xml' | grep -i transcode"
    
    # Note settings:
    # - HardwareAcceleratedCodecs (should be enabled)
    # - TranscoderH264BackgroundPreset
    # - TranscoderTempDirectory
  2. Stop Plex service gracefully:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "sudo systemctl stop plexmediaserver"
    
    # Wait 30 seconds for graceful shutdown
    sleep 30
  3. Final Plex database backup:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 \
      "sudo tar czf /tmp/plex-final-backup.tar.gz \
      '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/'"
    scp -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125:/tmp/plex-final-backup.tar.gz $BACKUP_DIR/
  4. Shut down VM on ESXi:

    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vim-cmd vmsvc/power.off <VMID>"
  5. Export VM disk to Proxmox: (Same process - export VMDK, copy to Proxmox)

  6. Create VM on Proxmox with iGPU passthrough:

    # On proxmox-02:
    qm create 103 --name plex --memory 8192 --cores 4 --net0 virtio,bridge=vmbr0,tag=0
    qm importdisk 103 /var/lib/vz/images/iridium.vmdk local-zfs
    qm set 103 --scsi0 local-zfs:vm-103-disk-0
    qm set 103 --boot order=scsi0
    qm set 103 --ostype l26
    
    # CRITICAL: Set machine type to q35 (required for PCIe passthrough):
    qm set 103 --machine q35
    
    # Add Intel iGPU passthrough:
    # First, identify iGPU PCI address:
    lspci -nnk | grep -i vga
    # Should show: 00:02.0 VGA compatible controller: Intel Corporation ... [8086:3e9b]
    
    # Add PCI device to VM:
    qm set 103 --hostpci0 00:02.0,pcie=1,rombar=0
  7. Start VM and verify boot:

    qm start 103
    
    # Monitor boot via console (if needed):
    qm terminal 103
    
    # Wait 60 seconds for boot, then SSH:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125
  8. Verify iGPU is visible in guest OS:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "lspci | grep -i vga"
    # Should show: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
    
    # Verify /dev/dri devices exist:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "ls -l /dev/dri"
    # Should show: renderD128, card0
    
    # Install/verify vainfo (Intel GPU tools):
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "sudo apt install -y vainfo"
    
    # Verify Intel Quick Sync capabilities:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "vainfo"
    # Should show supported encode/decode profiles (H.264, HEVC, etc.)
  9. Verify NFS mount:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "df -h | grep 10.1.1.150"
    # Should show: 10.1.1.150:/volume1/datastore/media mounted on /mnt/media
    
    # Verify media files accessible:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "ls /mnt/media/Movies | head -10"
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "ls /mnt/media/TV | head -10"
  10. Start Plex and verify hardware transcoding:

    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "sudo systemctl start plexmediaserver"
    
    # Wait 30 seconds for Plex to start:
    sleep 30
    
    # Verify Plex is running:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "sudo systemctl status plexmediaserver"
    
    # Access Plex web UI:
    # http://10.1.1.125:32400/web
    
    # Verify libraries are visible (media from NFS mount)
  11. Test hardware transcoding:

    • Access Plex web UI
    • Go to Settings → Transcoder
    • Verify “Use hardware acceleration when available” is enabled
    • Verify “Hardware transcoding device” shows Intel iGPU
    • Play a video that requires transcoding (adjust quality to force transcode)
    • Check transcoding session:
      # While video is playing, check transcode session:
      ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "ps aux | grep 'Plex Transcoder'"
      # Should show transcoder process
      
      # Verify hardware transcoding in Plex dashboard:
      # Settings → Status → Now Playing
      # Should show "(hw)" next to video codec if using hardware transcode
      
      # Verify low CPU usage (hardware transcoding offloads to iGPU):
      ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 "top -bn1 | head -20"
      # CPU usage should be <20% during 4K transcode if hw acceleration working
  12. Validate Plex Media Scanner:

    # Trigger manual library scan:
    # Plex UI → Libraries → [Library Name] → Scan Library Files
    
    # Verify scan works (can read NFS mount)
    # Check Plex logs if scan fails:
    ssh -i ~/.ssh/esxi_migration_rsa luke@10.1.1.125 \
      "tail -100 '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Logs/Plex Media Scanner.log'"

Validation Checklist:

  • Plex VM boots on Proxmox
  • Intel iGPU visible in guest OS (lspci shows iGPU)
  • /dev/dri devices exist (renderD128, card0)
  • vainfo shows Intel Quick Sync encode/decode profiles
  • NFS mount working (/mnt/media accessible)
  • All Plex libraries visible in UI
  • Plex can scan media files from NFS
  • Hardware transcoding enabled in Plex settings
  • Test transcode session shows “(hw)” indicator
  • CPU usage low during transcode (<20% for 4K)
  • Plex accessible from local network
  • Plex accessible from internet (if remote access configured)

Troubleshooting iGPU Passthrough: If iGPU not visible or hardware transcoding not working:

  1. Verify IOMMU enabled: dmesg | grep -i iommu
  2. Verify i915 driver blacklisted on host: lsmod | grep i915 (should be empty)
  3. Verify vfio-pci driver loaded: lspci -nnk | grep -A3 VGA
  4. Check VM machine type is q35: qm config 103 | grep machine
  5. Check permissions on /dev/dri in guest: ls -l /dev/dri
  6. Add Plex user to render group in guest: sudo usermod -aG render plex

Rollback: Restart Plex VM on ESXi Downtime: ~1-2 hours Duration: 3-4 hours with full testing


2.5: Migrate “jarnetfw” VM (Palo Alto Firewall) - UPDATED ⭐ CRITICAL

Complexity: CRITICAL - Network outage affects ALL services Risk: HIGHEST - Inter-VLAN routing down during migration Timing: OFF-HOURS / PLANNED MAINTENANCE WINDOW REQUIRED

Critical Requirements:

  • Document ALL interface → VLAN mappings
  • Export Palo Alto configuration
  • Plan communication (network outage window)
  • Keep ESXi Host 1 available for emergency rollback

Pre-Migration Checklist:

  • ALL other VMs successfully migrated to Proxmox (Pi-hole, Docker, Plex)
  • Palo Alto config exported (done in Phase 0)
  • Interface mappings documented
  • Firewall rules screenshot
  • NAT policies screenshot
  • Scheduled maintenance window communicated
  • Rollback plan ready (ESXi Host 1 kept online for 48 hours)

Interface Mapping Documentation (CRITICAL): Before migration, document EXACT interface mappings:

# On Palo Alto (via CLI or web UI), document:
# Interface | VLAN | IP Address | Role
# --------- | ---- | ---------- | ----
# ethernet1/1 | VLAN 300 | 10.1.300.1/24 | Public DMZ
# ethernet1/2 | VLAN 50 | 10.1.50.1/24 | Lab Network
# ethernet1/3 | VLAN 0 | 10.1.1.103/24 | Management
# ethernet1/4 | VLAN 4095 | Trunk | Internal
# (Example - document YOUR actual mappings)

Migration Steps:

  1. Final configuration export:

    # Via Palo Alto web UI:
    # Device → Setup → Operations → Export named configuration snapshot
    # Save to: $BACKUP_DIR/paloalto-final-config-$(date +%Y%m%d).xml
  2. Announce maintenance window:

    • Network outage expected: 30-60 minutes
    • All inter-VLAN traffic will be down
    • Internet access will be down
    • Schedule during lowest usage period
  3. Shut down Palo Alto VM (network outage begins):

    ssh -i ~/.ssh/esxi_migration_rsa root@10.1.1.120 "vim-cmd vmsvc/power.off <VMID>"
    # Network outage begins NOW
  4. Export VM disk to Proxmox: (Same process - but work quickly, network is down)

  5. Create VM on Proxmox with MULTIPLE network interfaces:

    # On proxmox-02:
    qm create 104 --name paloalto-fw --memory 7168 --cores 4 --ostype other
    qm importdisk 104 /var/lib/vz/images/jarnetfw.vmdk local-zfs
    qm set 104 --scsi0 local-zfs:vm-104-disk-0
    qm set 104 --boot order=scsi0
    
    # CRITICAL: Add network interfaces with correct VLAN tags
    # Match your documented interface mappings!
    
    # Management interface (VLAN 0):
    qm set 104 --net0 virtio,bridge=vmbr0,tag=0
    
    # Public interface (VLAN 300):
    qm set 104 --net1 virtio,bridge=vmbr0,tag=300
    
    # Lab interface (VLAN 50):
    qm set 104 --net2 virtio,bridge=vmbr0,tag=50
    
    # Trunk interface (VLAN 4095):
    qm set 104 --net3 virtio,bridge=vmbr0,tag=4095
    
    # Add more interfaces as needed to match your config
  6. Start Palo Alto VM:

    qm start 104
    
    # Wait 2-3 minutes for Palo Alto to boot (slow boot)
    sleep 180
  7. Verify management interface accessible:

    # Try to ping Palo Alto management IP:
    ping -c 5 10.1.1.103
    
    # Access web UI:
    # https://10.1.1.103
    # Login and verify
  8. Verify ALL interfaces are UP:

    # Via Palo Alto web UI:
    # Network → Interfaces
    # Verify all ethernet interfaces show "up" status
    
    # Verify VLAN assignments match documented mappings
  9. Test inter-VLAN routing:

    # From a client on VLAN 0, ping a device on VLAN 50:
    ping 10.1.50.x
    
    # From a client on VLAN 50, ping internet:
    ping 8.8.8.8
    
    # Test each VLAN can reach:
    # - Other VLANs (if policy allows)
    # - Internet (if NAT configured)
    # - Gateway (Palo Alto interface IP)
  10. Verify firewall rules working:

    # Via Palo Alto web UI:
    # Monitor → Traffic
    # Generate test traffic and verify rules are being hit
    
    # Verify NAT working (if configured):
    # Monitor → Session Browser
    # Check outbound sessions show NAT translation
  11. Verify all migrated VMs can communicate:

    • Plex VM can reach internet (for metadata, posters)
    • Docker VM can reach internet (for container updates, Cloudflare API)
    • Pi-hole can reach internet (for DNS resolution)
    • All VMs can reach NFS server (10.1.1.150)
    • Client devices can reach all services

Validation Checklist:

  • Palo Alto VM boots on Proxmox
  • Management web UI accessible (https://10.1.1.103)
  • ALL network interfaces show “up” status
  • VLAN tags correctly assigned to interfaces
  • Inter-VLAN routing working (test each VLAN pair)
  • Internet access working from all VLANs
  • NAT policies working (if configured)
  • Firewall rules working (monitor traffic logs)
  • All previously migrated VMs still accessible
  • No network errors in Palo Alto logs

Emergency Rollback Plan: If Palo Alto migration fails and network is down:

  1. DO NOT DELETE ESXi HOST 1 YET
  2. Power on ESXi Host 1
  3. Start jarnetfw VM on ESXi
  4. Wait 2-3 minutes for boot
  5. Network should restore
  6. Investigate Proxmox issue before retry

Downtime: ~30-60 minutes (network outage) Duration: 2-3 hours with full validation Recommendation: Do NOT proceed to Phase 3 until Palo Alto is stable for 24-48 hours


Phase 3: Install Proxmox on Host 1

Prerequisites:

  • ALL critical VMs running successfully on Proxmox Host 2 for 24-48 hours
  • Plex, Docker, Pi-hole, Palo Alto all stable and validated
  • No network issues
  • No service degradation

Steps: Same as Phase 1, but for ghost-esxi-01 (10.1.1.120)

  • Hostname: proxmox-01
  • IP: 10.1.1.120/24
  • Same network config, iGPU passthrough setup, NFS testing

Duration: 3-4 hours


Phase 4: Create 3-Node Proxmox Cluster

Prerequisites:

  • Both NUCs running Proxmox successfully
  • Stable network connectivity between all 3 nodes
  • All VMs operational on proxmox-02

Steps: (Same as original plan)

  1. Initialize cluster on proxmox-01
  2. Join proxmox-02 to cluster
  3. Join pve-staging to cluster
  4. Verify 3-node cluster
  5. Test VM migration between nodes

Application Considerations:

  • Before migrating VMs between nodes: Stop the VM, migrate, then start
  • For Plex: Test iGPU passthrough on destination node first
  • For Docker: Verify NFS mount on destination node first

Duration: 2-3 hours


Phase 5: Rebalance Workloads

Why: Free up resources on Node 2 for Home Assistant/Frigate Steps:

  1. Stop Plex VM on proxmox-02
  2. Migrate VM to proxmox-01 (via Proxmox UI or qm migrate)
  3. Verify iGPU passthrough still works on Node 1
  4. Verify NFS mount still works
  5. Test hardware transcoding
  6. Start Plex and validate

Duration: 1-2 hours

5.2: Migrate Home Assistant + Frigate from Staging to Node 2

Complexity: HIGH - Coral TPU USB passthrough required Prerequisites:

  • Plex migrated to Node 1 (or sufficient resources on Node 2)
  • USB controller passthrough tested on Node 2

USB Passthrough Preparation:

# On proxmox-02:
lsusb
# Identify Coral TPU: Bus 002 Device 003: ID 1a6e:089a Global Unichip Corp.

# Identify USB controller:
lspci | grep USB
# Note PCI address (e.g., 00:14.0)

Migration Steps:

  1. Stop home-sec VM on staging:

    # On pve-staging:
    qm stop 103
  2. Backup VM (via Proxmox backup):

    vzdump 103 --storage local --mode stop
  3. Migrate to Node 2:

    # Method 1: Restore from backup
    qmrestore /var/lib/vz/dump/vzdump-qemu-103-*.vma.zst 203 --storage local-zfs
    
    # OR Method 2: If cluster created, use qm migrate:
    qm migrate 103 proxmox-02
  4. Reconfigure USB passthrough on Node 2:

    # On proxmox-02:
    # Update VM config to pass through USB controller
    qm set 203 --hostpci0 00:14.0
    
    # OR pass through specific USB device:
    qm set 203 --usb0 host=1a6e:089a
  5. Start VM and verify Coral TPU:

    qm start 203
    
    # SSH into VM:
    ssh -i ~/.ssh/esxi_migration_rsa ubuntu@10.1.1.208
    
    # Verify Coral TPU visible:
    lsusb | grep "Global Unichip"
    # Should show: Bus 002 Device 003: ID 1a6e:089a Global Unichip Corp.
    
    # Verify Frigate detects Coral:
    docker logs frigate 2>&1 | grep -i coral
    # Should show: "Coral detected"
  6. Validate Home Assistant and Frigate:

    • Home Assistant web UI accessible (http://10.1.1.208:8123)
    • Frigate web UI accessible
    • Frigate detects Coral TPU (check Frigate logs)
    • Camera streams visible
    • Object detection working (person, car, etc.)
    • Recordings working

Duration: 2-3 hours with testing


Phase 6: Final Cleanup and Validation

6.1: VM Cleanup

  • Remove decommissioned VMs from Proxmox inventory
  • Remove old ESXi VMs from ESXi hosts (if keeping ESXi as backup)
  • Clean up test VMs

6.2: Documentation Updates

  • Update network documentation with new VM IDs
  • Document final cluster architecture
  • Update credentials document with any new passwords
  • Document lessons learned

6.3: Backup Configuration

# Backup Proxmox cluster config:
tar czf /root/proxmox-cluster-backup-$(date +%Y%m%d).tar.gz /etc/pve

# Copy to external storage:
scp /root/proxmox-cluster-backup-*.tar.gz user@backup-server:/backups/

6.4: Final Application Validation

Complete Service Test:

  1. Media Pipeline End-to-End Test:

    • User requests movie via Overseerr
    • Radarr searches and sends to SABnzbd
    • SABnzbd downloads to NFS (/mnt/media/Downloads)
    • Radarr imports to /mnt/media/Movies
    • Plex scans and adds movie
    • User plays movie with hardware transcoding
    • Expected Duration: 5-30 minutes (depending on download speed)
  2. Network Services:

    • Pi-hole DNS working (test from client: nslookup google.com 10.1.1.35)
    • Palo Alto inter-VLAN routing working
    • All VLANs can reach internet
    • Firewall rules enforced
  3. Smart Home:

    • Home Assistant responsive
    • Frigate object detection working
    • Camera recordings saving
    • Coral TPU inference working (check Frigate stats)
  4. Reverse Proxy:

    • Traefik SSL certificates valid
    • All services accessible via domain names
    • Cloudflare DNS-01 challenge working

6.5: Monitoring Setup

  • Configure Uptime Kuma monitoring for all services
  • Set up Proxmox email alerts (optional)
  • Configure backup schedules in Proxmox

Risk Assessment & Mitigation - Application Layer

RiskLikelihoodImpactMitigation
NFS mount fails on ProxmoxLowCriticalTest NFS in Phase 1.5 before VM migration
Plex iGPU passthrough failsMediumHighTest on staging first; keep ESXi available for rollback
Docker containers fail to startMediumHighBackup docker-compose.yml and .env; test individually
Traefik SSL certificates failMediumMediumVerify Cloudflare API key; manual cert generation possible
Coral TPU passthrough failsMediumHighKeep Home Assistant on staging until validated
Palo Alto network config wrongLowCriticalDocument ALL interface mappings; test each VLAN
Plex database corruptionVery LowHighMultiple backups before/during migration
SABnzbd loses download queueLowMediumExport queue before migration; can re-add manually

Application-Specific Rollback Plans

Plex Rollback

If Plex fails on Proxmox:

  1. Stop Plex VM on Proxmox
  2. Start original iridium VM on ESXi Host 1
  3. Restore Plex database from backup (if needed)
  4. Users can resume streaming immediately

Docker Stack Rollback

If Docker stack fails:

  1. Stop docker VM on Proxmox
  2. Start original docker VM on ESXi Host 1
  3. Run docker-compose up -d
  4. Services restored within 5 minutes

Palo Alto Rollback

If network fails:

  1. Shut down Palo Alto VM on Proxmox
  2. Start jarnetfw VM on ESXi Host 1
  3. Network restored within 2-3 minutes

Timeline Estimate - Application-Focused

WeekPhaseActivitiesTimeCritical Path
1Phase 0Backups (Plex DB, Docker configs, Pi-hole)3-4 hoursPre-req for all
2Phase 1Install Proxmox on Host 2, test NFS4-5 hoursNFS test critical
3Phase 2.1Migrate Pi-hole1-2 hoursTest migration process
4Phase 2.2Migrate Docker stack (test Traefik/NFS)3-4 hoursComplex, high risk
5Phase 2.3Migrate Plex (test iGPU + NFS)3-4 hoursHighest complexity
6Phase 2.4Migrate Palo Alto (MAINTENANCE WINDOW)2-3 hoursNetwork outage
7ValidationMonitor all services for stabilityOngoing1 week stability
8Phase 3Install Proxmox on Host 13-4 hours-
9Phase 4Create cluster, migrate Plex to Node 13-4 hours-
10Phase 5Migrate Home Assistant/Frigate to Node 23-4 hoursCoral TPU test
11Phase 6Final validation and cleanup2-3 hoursEnd-to-end test

Total: ~30-40 hours over 11 weeks (comfortable pace) Fast-Track: 4-5 weekends (~25-30 hours total)


Critical Success Factors - Application Layer

Must-Have Before Starting:

  1. NFS accessible from Proxmox - Test in Phase 1.5
  2. Cloudflare API credentials documented - Needed for Traefik SSL
  3. Plex database backed up - Multiple backups
  4. Docker compose and .env files backed up - Critical for stack restore
  5. Palo Alto config exported - Network restoration depends on this
  6. iGPU passthrough working on Proxmox - Plex depends on this

Validation Gates (Do Not Proceed Until Complete):

  • After Phase 1: NFS mount working on Proxmox
  • After Phase 2.2: All Docker containers running, Traefik SSL working
  • After Phase 2.3: Plex hardware transcoding working with iGPU
  • After Phase 2.4: Network stable for 24-48 hours, all VLANs working
  • After Phase 5.2: Coral TPU working, Frigate object detection confirmed

Next Steps

  1. Review this updated plan and confirm:

    • VM renaming strategy acceptable
    • Downtime windows identified for Palo Alto migration
    • Cloudflare API credentials available
    • Understand NFS dependency (no media migration needed)
  2. Answer remaining questions:

    • Decommission server-2019, xsoar, home-security VMs (all offline)?
    • Preferred cluster name and node naming convention?
    • UniFi network devices inventory (for validation after migration)
  3. Order hardware:

    • 2x 2TB NVMe drives (if doing hardware upgrade)
  4. Schedule maintenance windows:

    • Palo Alto migration (30-60 min network outage)
    • Plex migration (1-2 hour streaming outage)

Once approved, we can begin Phase 0 preparation!