ESXi to Proxmox Migration Plan REVISED


ESXi to Proxmox Migration Plan - REVISED

Created: 2025-12-28 Revised: 2025-12-28 Status: Draft - Ready for User Review


Executive Summary

Current Situation:

  • 2x ESXi hosts (NUC9i9QNX) with production workloads
  • 1x Proxmox staging server already running (Home Assistant + Frigate with Coral TPU)
  • Blue Iris → Frigate migration already complete ✅
  • Proxmox platform validated ✅

End Goal:

  • 3-node Proxmox VE cluster with HA capability
  • Nodes 1 & 2 (NUCs): Production workhorses
  • Node 3 (staging): HA witness (QDevice) + light workloads

Migration Approach:

  • Install Proxmox on Host 2 first (all VMs offline - lowest risk)
  • Migrate critical VMs from Host 1 → Proxmox Host 2
  • Install Proxmox on Host 1
  • Create 3-node cluster
  • Migrate Home Assistant/Frigate from staging → NUC
  • Rebalance workloads

Migration Architecture

End-State Configuration

Node 1: proxmox-01 (was ghost-esxi-01, 10.1.1.120)

Hardware:

  • Intel NUC9i9QNX
  • 8C/16T i9-9980HK
  • 64GB RAM
  • 2TB NVMe (upgrade from 1TB)
  • Dual 10GbE + 1GbE
  • Intel UHD 630 iGPU (passthrough capable)

Workloads:

  • Plex (iGPU passthrough for Quick Sync)
  • Docker stack (Radarr, Sonarr, SABnzbd, etc.)
  • Pi-hole
  • Palo Alto firewall
  • Lab VMs (as needed)

Node 2: proxmox-02 (was ghost-esx-02, 10.1.1.121)

Hardware:

  • Intel NUC9i9QNX
  • 8C/16T i9-9980HK
  • 64GB RAM
  • 2TB NVMe (upgrade from 1TB)
  • Dual 10GbE + 1GbE
  • Intel UHD 630 iGPU (passthrough capable)

Workloads:

  • Home Assistant + Frigate (migrated from staging)
  • Spare capacity for growth
  • Lab VMs (as needed)

Node 3: pve-staging (remains, 10.1.1.123)

Hardware:

  • Intel Core i5-8400T (6 cores)
  • 32GB RAM
  • ~900GB storage (LVM-Thin)
  • Single 1GbE
  • USB controller for Coral TPU (until Frigate migrates)

Role:

  • Quorum/witness for 3-node cluster (QDevice)
  • Docker services
  • K8s lab
  • Templates
  • Temporary home for Frigate until NUC migration

Cluster Topology:

┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│  proxmox-01     │◄─────►│  proxmox-02     │◄─────►│  pve-staging    │
│  10.1.1.120     │       │  10.1.1.121     │       │  10.1.1.123     │
│  NUC (Prod)     │       │  NUC (Prod)     │       │  Witness/QDev   │
│  HA Member      │       │  HA Member      │       │  Quorum only    │
└─────────────────┘       └─────────────────┘       └─────────────────┘
         ▲                         ▲                         ▲
         └─────────────────────────┴─────────────────────────┘
                        10GbE Network

Pre-Migration Decisions Required

1. Hardware Upgrade Timing ⚠️ CRITICAL DECISION

Option A: Install 2TB drives BEFORE migration (RECOMMENDED)

  • ✅ Clean Proxmox install on larger drives
  • ✅ No need to resize/migrate storage later
  • ✅ More headroom during migration
  • ❌ Adds time/complexity upfront
  • Timeline: Order drives now, install before starting

Option B: Upgrade DURING migration (Hybrid)

  • Install 2TB in Host 2 when wiping for Proxmox
  • Keep 1TB in Host 1 temporarily
  • Upgrade Host 1 drive later (requires re-migration)
  • ⚠️ Inconsistent storage capacity during migration

Option C: Upgrade AFTER migration

  • ❌ More complex - requires VM migration again
  • ❌ Less space during critical migration phase
  • ❌ Not recommended

RECOMMENDATION: Option A - Install 2TB drives first

2. Proxmox Storage Backend

Option A: ZFS (single disk)

  • ✅ Built-in snapshots and compression
  • ✅ Data integrity (checksums)
  • ✅ Better VM performance
  • ✅ Native replication support
  • ❌ ~5-10% overhead for single disk
  • RECOMMENDED for your use case

Option B: LVM-Thin (like your staging server)

  • ✅ Familiar (already using on staging)
  • ✅ Slightly more usable space
  • ✅ Thin provisioning
  • ❌ No native compression
  • ❌ Fewer snapshot features
  • Alternative if you want consistency with staging

RECOMMENDATION: ZFS for NUCs, keep LVM-Thin on staging

3. Network Configuration

Proxmox vmbr0 Configuration (both NUCs):

  • Bond: 2x 10GbE in balance-alb or LACP (if switch supports)
  • VLAN-aware bridge: Yes
  • VLANs: 0 (mgmt), 50 (lab), 300 (public), 4095 (trunk)
  • MTU: 9000 (jumbo frames, matching current ESXi)

Proxmox vmbr1 Configuration (optional):

  • 1x 10GbE dedicated for Corosync/migration traffic
  • Low latency, isolated from VM traffic
  • RECOMMENDED for cluster stability

Migration Phases

Phase 0: Preparation (1-2 weeks)

Hardware:

  • Order 2x 2TB WD Blue SN580 NVMe drives
  • Create Proxmox VE 8.x bootable USB installer
  • Prepare backup storage for critical VMs

Validation:

  • Test iGPU passthrough on pve-staging (if possible - different CPU though)
  • Document current Plex transcoding settings
  • Document Palo Alto firewall configuration
  • Export all ESXi VM configurations
  • Take screenshots of ESXi network/storage configs

Information Gathering:

  • Clarify “iridium” VM purpose (currently unknown)
  • Confirm server-2019 and xsoar VMs can be deleted/offline
  • Identify acceptable downtime windows for critical services
  • Verify NFS backup (10.1.1.150) is not needed

Backups:

  • Export critical VM disk images (Plex, Docker, Palo Alto)
  • Backup Plex database externally
  • Document all VM IP addresses and network configs
  • Save Docker compose files / container configs

Phase 1: Install Proxmox on Host 2 (ghost-esx-02)

Why Host 2 First:

  • ✅ All VMs currently powered off (zero downtime)
  • ✅ Larger storage (2.79TB vs 931GB) - better for receiving migrated VMs
  • ✅ Lower risk - no critical services running

Steps:

  1. Pre-install backup (if needed):

    • home-security VM is replaced by Frigate (can delete)
    • server-2019: Likely old Blue Iris host (confirm, then delete)
    • xsoar, win11-sse, win-10: Confirm these can be offline permanently
  2. Install new 2TB NVMe drive (if doing Option A)

  3. Boot Proxmox installer:

    • Hostname: proxmox-02 (or pve-02)
    • IP: 10.1.1.121/24 (keep same IP)
    • Gateway: 10.1.1.1
    • DNS: 10.1.1.1 (or current DNS server)
    • Disk: Select 2TB NVMe
    • Filesystem: ZFS (RAID0) - single disk with compression
    • Country/Timezone: Set appropriately
    • Root password: Secure password
  4. Post-install configuration:

    # Update system
    apt update && apt full-upgrade -y
    
    # Configure repositories (remove enterprise repo if no subscription)
    # Edit /etc/apt/sources.list.d/pve-enterprise.list and comment out
    # Add no-subscription repo
    echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list
    apt update
    
    # Install useful packages
    apt install -y vim htop iotop tmux
  5. Configure networking:

    # Edit /etc/network/interfaces
    # Create vmbr0 with dual 10GbE bond + VLAN-aware
    # Create vmbr1 for migration/Corosync (optional but recommended)

    Example /etc/network/interfaces:

    auto lo
    iface lo inet loopback
    
    # 1GbE - emergency access only
    auto eno1
    iface eno1 inet manual
    
    # 10GbE bond for VM traffic
    auto bond0
    iface bond0 inet manual
        bond-slaves enp1s0f0 enp1s0f1
        bond-miimon 100
        bond-mode balance-alb
        bond-xmit-hash-policy layer2+3
        mtu 9000
    
    # VLAN-aware bridge for VMs
    auto vmbr0
    iface vmbr0 inet static
        address 10.1.1.121/24
        gateway 10.1.1.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000
  6. Configure Intel iGPU for passthrough:

    # Enable IOMMU
    vi /etc/default/grub
    # Add to GRUB_CMDLINE_LINUX_DEFAULT:
    # intel_iommu=on iommu=pt
    
    update-grub
    
    # Add VFIO modules
    vi /etc/modules
    # Add:
    # vfio
    # vfio_iommu_type1
    # vfio_pci
    # vfio_virqfd
    
    # Blacklist i915 driver (iGPU)
    echo "blacklist i915" >> /etc/modprobe.d/blacklist.conf
    
    # Update initramfs
    update-initramfs -u -k all
    
    # Reboot
    reboot
  7. Verify iGPU is available for passthrough:

    lspci -nnk | grep -i vga
    # Should show vfio-pci driver for Intel UHD 630
  8. Test VM creation:

    • Create small test VM
    • Verify networking works
    • Test VLAN tagging
    • Validate storage performance

Validation Checklist:

  • Proxmox web UI accessible at https://10.1.1.121:8006
  • SSH access working
  • Network connectivity (ping gateway, internet)
  • Intel iGPU shows as available for passthrough (lspci)
  • Storage pool visible and healthy
  • Test VM boots and has network connectivity

Duration: 2-4 hours


Phase 2: Migrate VMs from Host 1 to Proxmox Host 2

Migration Order (lowest to highest risk):

2.1: Test Migration - “iridium” (Low Risk, Unknown Purpose)

  • Export from ESXi via OVF or disk copy
  • Import to Proxmox Host 2
  • Test boot and functionality
  • Validate migration process
  • Downtime: ~15-30 minutes

2.2: Pi-hole (Medium Risk, DNS Service)

  • Risk: DNS disruption during migration
  • Mitigation: Update DHCP to use backup DNS (8.8.8.8) temporarily
  • Export VM from ESXi
  • Import to Proxmox Host 2
  • Reconfigure network (static IP, VLAN tag if needed)
  • Test DNS resolution
  • Downtime: ~15-20 minutes

2.3: Docker Stack (High Risk, Media Services)

  • Risk: Radarr, Sonarr, SABnzbd, Overseerr, etc. offline
  • Impact: Downloads/media management paused
  • Export VM from ESXi (may be large if Docker volumes are on VM disk)
  • Import to Proxmox Host 2
  • Start VM and verify all containers come up
  • Test media stack functionality
  • Downtime: ~30-60 minutes

2.4: Plex Server “xeon” (HIGH RISK, iGPU Passthrough Required)

  • Risk: Media streaming offline, iGPU passthrough must work
  • Complexity: HIGHEST - hardware passthrough critical
  • Prerequisites:
    • Verify iGPU passthrough working on Proxmox Host 2
    • Backup Plex database to external storage
    • Document current transcoding settings

Migration Steps:

  1. Prepare Plex for migration:

    • Stop Plex service on ESXi VM
    • Backup Plex database: /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/
    • Note transcoder settings (Hardware acceleration: Quick Sync)
  2. Export VM:

    • Shut down “xeon” VM on ESXi
    • Export VMDK to Proxmox Host 2 via SCP/NFS
  3. Import and configure on Proxmox:

    • Create new VM on Proxmox (match CPU/RAM: 4 vCPU, 8GB RAM)
    • Machine type: q35 (required for PCIe passthrough)
    • Import disk image
    • Add iGPU passthrough:
      • Add PCI device: Intel UHD 630 (00:02.0)
      • Enable “All Functions”, “Primary GPU”, “PCI-Express”
    • Network: Configure bridge with appropriate VLAN
  4. Boot and validate:

    • Start VM
    • Check iGPU is visible in guest OS: lspci | grep VGA
    • Install/update Intel Graphics drivers in Ubuntu
    • Start Plex
    • Verify hardware transcoding: Play media and check transcoder shows “(hw)”
  5. Test transcoding:

    • Play 4K video and verify Quick Sync is being used
    • Check CPU usage (should be low with hw transcoding)
    • Verify vainfo shows available encode/decode profiles

Downtime: ~1-2 hours Rollback: Keep ESXi host 1 bootable until validated

2.5: Palo Alto Firewall “jarnetfw” (CRITICAL - LAST)

  • Risk: CRITICAL - Network outage for all VLANs
  • Impact: All inter-VLAN routing down
  • Migration Window: Off-hours/planned outage required

Prerequisites:

  • ALL other VMs successfully migrated and validated
  • Document complete firewall configuration
  • Export Palo Alto config backup
  • Plan communication to users (if applicable)

Migration Steps:

  1. Backup current state:

    • Export Palo Alto configuration via web UI
    • Screenshot all firewall rules/NAT/routing
    • Document interface → VLAN mappings
  2. Export VM:

    • Shut down jarnetfw VM (network outage begins)
    • Export VMDK to Proxmox Host 2
  3. Import and configure:

    • Create VM on Proxmox (4 vCPU, 7GB RAM)
    • Import disk
    • Critical: Map network interfaces correctly
      • Match VLAN tags to ESXi port groups
      • VM network interface 1 → vmbr0.300 (Public)
      • VM network interface 2 → vmbr0.50 (Lab)
      • etc.
  4. Boot and validate:

    • Start VM
    • Check Palo Alto web UI is accessible
    • Verify all interfaces are UP
    • Test inter-VLAN routing
    • Test internet connectivity from each VLAN
    • Verify firewall rules are working

Downtime: ~30-60 minutes (network outage) Rollback Plan: If fails, revert to ESXi host 1 (keep it available for 24-48 hours)


Phase 3: Install Proxmox on Host 1 (ghost-esxi-01)

Why After Host 2:

  • All critical VMs now running on Proxmox Host 2
  • Host 1 can be wiped with confidence
  • Lower pressure/risk

Steps:

  1. Final validation:

    • Verify ALL migrated VMs running successfully on Host 2
    • Confirm Plex transcoding working
    • Confirm Palo Alto firewall routing working
    • Confirm Docker services accessible
  2. Export remaining VMs (if needed):

    • Win-11, Win7-Victim (lab VMs) - only if you want to keep them
  3. Install new 2TB NVMe drive (if not done yet)

  4. Install Proxmox VE:

    • Same process as Host 2
    • Hostname: proxmox-01 (or pve-01)
    • IP: 10.1.1.120/24
    • Gateway: 10.1.1.1
    • Filesystem: ZFS (RAID0)
  5. Configure networking:

    • Match Host 2 configuration (dual 10GbE bond, VLAN-aware bridge)
    • MTU 9000 for jumbo frames
  6. Configure iGPU passthrough:

    • Same steps as Host 2
    • Enable IOMMU, load VFIO modules, blacklist i915
  7. Test and validate:

    • Create test VM
    • Verify iGPU available for passthrough
    • Verify network connectivity

Duration: 2-4 hours


Phase 4: Create 3-Node Proxmox Cluster

Prerequisites:

  • Both NUCs running Proxmox successfully
  • All critical VMs operational on Host 2
  • Stable network connectivity between all 3 nodes

Steps:

  1. Initialize cluster on Node 1:

    # On proxmox-01 (10.1.1.120):
    pvecm create homelab-cluster
    
    # Verify cluster status
    pvecm status
  2. Join Node 2 to cluster:

    # On proxmox-02 (10.1.1.121):
    pvecm add 10.1.1.120
    
    # Enter root password for proxmox-01
    # Wait for join to complete
  3. Join Node 3 (staging) to cluster:

    # On pve-staging (10.1.1.123):
    pvecm add 10.1.1.120
    
    # Enter root password
    # Wait for join to complete
  4. Verify 3-node cluster:

    # On any node:
    pvecm status
    pvecm nodes
    
    # Should show all 3 nodes online
    # Quorum should be 2/3
  5. Configure node priorities (optional but recommended):

    • Set staging node as lower priority for resource allocation
    • Configure HA groups if desired
  6. Test cluster:

    • View all nodes in web UI
    • Test VM migration between Node 1 and Node 2
    • Verify quorum works if one node goes down

Notes on 3-Node Cluster:

  • Quorum: Need 2/3 nodes online for cluster to function
  • HA: Can survive 1 node failure
  • No QDevice needed with 3 nodes (odd number provides quorum)
  • Staging node: Can have lower specs, only needs to vote for quorum

Duration: 1-2 hours


Why:

  • Free up resources on Node 2 for Home Assistant migration
  • Better workload distribution
  • Node 1 has iGPU, ideal for Plex

Steps:

  1. Stop Plex VM on Node 2
  2. Migrate VM to Node 1 via Proxmox (live or offline migration)
  3. Verify iGPU passthrough still works on Node 1
  4. Test Plex transcoding
  5. Start Plex and validate

Duration: 30-60 minutes


Phase 6: Migrate Home Assistant + Frigate to Node 2

Why:

  • Move production workload from staging to more powerful NUC
  • Free up staging server for witness role + light workloads
  • Better long-term architecture

Challenges:

  • Coral TPU: Staging has USB controller passthrough (PCI 00:14)
    • NUCs may have different PCI topology
    • May need to pass through entire USB controller or individual USB port
  • Static IP: Home Assistant has static IP (10.1.1.208)
  • Uptime: High-priority service (CCTV)

Steps:

  1. Prepare Node 2 for USB passthrough:

    # On proxmox-02:
    lsusb
    # Identify Coral TPU device
    
    # Find USB controller PCI ID
    lspci | grep USB
    
    # Configure USB controller for passthrough (similar to staging)
    # OR use USB device ID passthrough (easier but less stable)
  2. Stop home-sec VM on staging:

    # On pve-staging:
    qm stop 103
  3. Export and migrate VM:

    # Method 1: Backup/Restore via Proxmox
    vzdump 103 --storage local --mode stop
    # Copy backup to Node 2
    # Restore on Node 2
    
    # Method 2: qm migrate (if cluster is already created)
    qm migrate 103 proxmox-02 --online
  4. Reconfigure VM on Node 2:

    • Update PCI passthrough to match Node 2’s USB controller PCI ID
    • Verify network configuration (static IP 10.1.1.208)
    • Ensure cloud-init settings preserved
  5. Boot and validate:

    • Start VM on Node 2
    • Check Coral TPU is visible: lsusb in guest OS
    • Check Home Assistant accessible via web UI
    • Verify Frigate detects Coral TPU:
      • Check Frigate logs
      • Verify “Coral” detector is active
      • Test object detection on camera feeds
  6. Monitor for 24-48 hours:

    • Verify camera streams stable
    • Verify object detection working
    • Check for any USB passthrough issues

Downtime: ~30-60 minutes Fallback: Can revert to staging server if issues

Duration: 2-3 hours with testing


Phase 7: Final Rebalancing and Cleanup

Workload Distribution Review:

NodeVMsvCPU TotalRAM TotalNotes
proxmox-01Plex, Docker, Pi-hole, Palo Alto~11 vCPU~20GBiGPU for Plex
proxmox-02Home Assistant/Frigate~4 vCPU~8GBUSB for Coral TPU, room to grow
pve-stagingDocker-host-1, K8s lab, templates~4 vCPU active~8GB activeWitness + dev/test

Cleanup Tasks:

  • Delete old ESXi VMs from inventory (if safe)
  • Remove home-security, server-2019 VMs from ESXi host 2
  • Clean up iridium VM if purpose unknown/no longer needed
  • Update documentation with final IP addresses
  • Update DNS records if any changed
  • Configure Proxmox backups (PBS or external)
  • Test HA failover (optional)

Risk Assessment & Mitigation

RiskLikelihoodImpactMitigation
iGPU passthrough fails on ProxmoxMediumHighTest on staging first; rollback to ESXi if needed
Coral TPU passthrough fails on NUCMediumHighKeep Home Assistant on staging until validated
Palo Alto firewall migration failsLowCriticalThorough testing; off-hours migration; keep ESXi Host 1 available for rollback
Network misconfiguration breaks VLANsMediumHighDocument all VLAN mappings; test each VLAN post-migration
Data loss during VM export/importVery LowCriticalBackup all VMs before migration; verify checksums
Cluster split-brainLowMedium3 nodes provide quorum; monitor cluster health
Storage performance degradationLowMediumBenchmark ZFS before/after; tune ZFS ARC if needed

Rollback Plans

Phase 2 Rollback (VMs on Proxmox Host 2)

  • If any VM migration fails: Keep ESXi Host 1 running
  • If Plex iGPU fails: Revert Plex to ESXi Host 1
  • If Palo Alto fails: Emergency reboot of ESXi Host 1 to restore networking

Phase 6 Rollback (Home Assistant/Frigate)

  • If Coral TPU fails on Node 2: Migrate VM back to staging server
  • Fallback: Staging server remains operational during migration

Complete Rollback

  • Worst case: Both NUCs available, can reinstall ESXi if total failure
  • Likelihood: Very low - incremental approach minimizes risk

Timeline Estimate

Conservative Timeline (Recommended):

WeekPhaseActivitiesTime
1Phase 0Hardware prep, backups, planning4-6 hours
2Phase 1Install Proxmox on Host 2, configure networking4-6 hours
3Phase 2.1-2.3Migrate iridium, Pi-hole, Docker2-3 hours
4Phase 2.4Migrate Plex (iGPU passthrough)2-4 hours
5Phase 2.5Migrate Palo Alto (critical - careful!)2-3 hours
6Phase 3Install Proxmox on Host 13-4 hours
7Phase 4Create cluster, migrate Plex back to Node 12-3 hours
8Phase 6Migrate Home Assistant/Frigate to Node 23-4 hours
9Phase 7Final rebalancing, testing, documentation2-3 hours

Total: ~25-36 hours over 9 weeks (comfortable pace with validation)

Fast-Track: Could compress to 3-4 weekends (~20-25 hours total) if comfortable with risk


Questions for User - Action Items

Before proceeding, please answer:

Critical Decisions

  1. NVMe upgrade timing: Order 2TB drives now and install before migration? (RECOMMENDED: Yes)
  2. Storage backend: ZFS or LVM-Thin for NUC nodes? (RECOMMENDED: ZFS)
  3. Downtime windows: What times are acceptable for Plex/Palo Alto/Home Assistant outages?

VM Clarifications

  1. “iridium” VM: What is this for? Can it be offline temporarily?
  2. “server-2019” VM: Old Blue Iris host? Can we delete it?
  3. “xsoar” VM: Keep or decommission?
  4. NFS backup (10.1.1.150): Still in use? Purpose?

Nice-to-Have

  1. Cluster name: “homelab-cluster” or different name?
  2. Node naming: proxmox-01/02 or different convention?
  3. HA preferences: Do you want automatic HA failover or manual?

Next Steps

Once you answer the questions above, I can:

  1. ✅ Create detailed step-by-step runbooks for each phase
  2. ✅ Generate network configuration templates
  3. ✅ Document iGPU passthrough configuration scripts
  4. ✅ Build VM migration checklists
  5. ✅ Create validation/testing scripts
  6. ✅ Design backup strategy for Proxmox cluster

Let’s align on the critical decisions, then we can start Phase 0!