ESXi to Proxmox Migration Plan REVISED
ESXi to Proxmox Migration Plan - REVISED
Created: 2025-12-28 Revised: 2025-12-28 Status: Draft - Ready for User Review
Executive Summary
Current Situation:
- 2x ESXi hosts (NUC9i9QNX) with production workloads
- 1x Proxmox staging server already running (Home Assistant + Frigate with Coral TPU)
- Blue Iris → Frigate migration already complete ✅
- Proxmox platform validated ✅
End Goal:
- 3-node Proxmox VE cluster with HA capability
- Nodes 1 & 2 (NUCs): Production workhorses
- Node 3 (staging): HA witness (QDevice) + light workloads
Migration Approach:
- Install Proxmox on Host 2 first (all VMs offline - lowest risk)
- Migrate critical VMs from Host 1 → Proxmox Host 2
- Install Proxmox on Host 1
- Create 3-node cluster
- Migrate Home Assistant/Frigate from staging → NUC
- Rebalance workloads
Migration Architecture
End-State Configuration
Node 1: proxmox-01 (was ghost-esxi-01, 10.1.1.120)
Hardware:
- Intel NUC9i9QNX
- 8C/16T i9-9980HK
- 64GB RAM
- 2TB NVMe (upgrade from 1TB)
- Dual 10GbE + 1GbE
- Intel UHD 630 iGPU (passthrough capable)
Workloads:
- Plex (iGPU passthrough for Quick Sync)
- Docker stack (Radarr, Sonarr, SABnzbd, etc.)
- Pi-hole
- Palo Alto firewall
- Lab VMs (as needed)
Node 2: proxmox-02 (was ghost-esx-02, 10.1.1.121)
Hardware:
- Intel NUC9i9QNX
- 8C/16T i9-9980HK
- 64GB RAM
- 2TB NVMe (upgrade from 1TB)
- Dual 10GbE + 1GbE
- Intel UHD 630 iGPU (passthrough capable)
Workloads:
- Home Assistant + Frigate (migrated from staging)
- Spare capacity for growth
- Lab VMs (as needed)
Node 3: pve-staging (remains, 10.1.1.123)
Hardware:
- Intel Core i5-8400T (6 cores)
- 32GB RAM
- ~900GB storage (LVM-Thin)
- Single 1GbE
- USB controller for Coral TPU (until Frigate migrates)
Role:
- Quorum/witness for 3-node cluster (QDevice)
- Docker services
- K8s lab
- Templates
- Temporary home for Frigate until NUC migration
Cluster Topology:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ proxmox-01 │◄─────►│ proxmox-02 │◄─────►│ pve-staging │
│ 10.1.1.120 │ │ 10.1.1.121 │ │ 10.1.1.123 │
│ NUC (Prod) │ │ NUC (Prod) │ │ Witness/QDev │
│ HA Member │ │ HA Member │ │ Quorum only │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
└─────────────────────────┴─────────────────────────┘
10GbE Network
Pre-Migration Decisions Required
1. Hardware Upgrade Timing ⚠️ CRITICAL DECISION
Option A: Install 2TB drives BEFORE migration (RECOMMENDED)
- ✅ Clean Proxmox install on larger drives
- ✅ No need to resize/migrate storage later
- ✅ More headroom during migration
- ❌ Adds time/complexity upfront
- Timeline: Order drives now, install before starting
Option B: Upgrade DURING migration (Hybrid)
- Install 2TB in Host 2 when wiping for Proxmox
- Keep 1TB in Host 1 temporarily
- Upgrade Host 1 drive later (requires re-migration)
- ⚠️ Inconsistent storage capacity during migration
Option C: Upgrade AFTER migration
- ❌ More complex - requires VM migration again
- ❌ Less space during critical migration phase
- ❌ Not recommended
RECOMMENDATION: Option A - Install 2TB drives first
2. Proxmox Storage Backend
Option A: ZFS (single disk)
- ✅ Built-in snapshots and compression
- ✅ Data integrity (checksums)
- ✅ Better VM performance
- ✅ Native replication support
- ❌ ~5-10% overhead for single disk
- RECOMMENDED for your use case
Option B: LVM-Thin (like your staging server)
- ✅ Familiar (already using on staging)
- ✅ Slightly more usable space
- ✅ Thin provisioning
- ❌ No native compression
- ❌ Fewer snapshot features
- Alternative if you want consistency with staging
RECOMMENDATION: ZFS for NUCs, keep LVM-Thin on staging
3. Network Configuration
Proxmox vmbr0 Configuration (both NUCs):
- Bond: 2x 10GbE in balance-alb or LACP (if switch supports)
- VLAN-aware bridge: Yes
- VLANs: 0 (mgmt), 50 (lab), 300 (public), 4095 (trunk)
- MTU: 9000 (jumbo frames, matching current ESXi)
Proxmox vmbr1 Configuration (optional):
- 1x 10GbE dedicated for Corosync/migration traffic
- Low latency, isolated from VM traffic
- RECOMMENDED for cluster stability
Migration Phases
Phase 0: Preparation (1-2 weeks)
Hardware:
- Order 2x 2TB WD Blue SN580 NVMe drives
- Create Proxmox VE 8.x bootable USB installer
- Prepare backup storage for critical VMs
Validation:
- Test iGPU passthrough on pve-staging (if possible - different CPU though)
- Document current Plex transcoding settings
- Document Palo Alto firewall configuration
- Export all ESXi VM configurations
- Take screenshots of ESXi network/storage configs
Information Gathering:
- Clarify “iridium” VM purpose (currently unknown)
- Confirm server-2019 and xsoar VMs can be deleted/offline
- Identify acceptable downtime windows for critical services
- Verify NFS backup (10.1.1.150) is not needed
Backups:
- Export critical VM disk images (Plex, Docker, Palo Alto)
- Backup Plex database externally
- Document all VM IP addresses and network configs
- Save Docker compose files / container configs
Phase 1: Install Proxmox on Host 2 (ghost-esx-02)
Why Host 2 First:
- ✅ All VMs currently powered off (zero downtime)
- ✅ Larger storage (2.79TB vs 931GB) - better for receiving migrated VMs
- ✅ Lower risk - no critical services running
Steps:
-
Pre-install backup (if needed):
- home-security VM is replaced by Frigate (can delete)
- server-2019: Likely old Blue Iris host (confirm, then delete)
- xsoar, win11-sse, win-10: Confirm these can be offline permanently
-
Install new 2TB NVMe drive (if doing Option A)
-
Boot Proxmox installer:
- Hostname:
proxmox-02(orpve-02) - IP:
10.1.1.121/24(keep same IP) - Gateway:
10.1.1.1 - DNS:
10.1.1.1(or current DNS server) - Disk: Select 2TB NVMe
- Filesystem: ZFS (RAID0) - single disk with compression
- Country/Timezone: Set appropriately
- Root password: Secure password
- Hostname:
-
Post-install configuration:
# Update system apt update && apt full-upgrade -y # Configure repositories (remove enterprise repo if no subscription) # Edit /etc/apt/sources.list.d/pve-enterprise.list and comment out # Add no-subscription repo echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list apt update # Install useful packages apt install -y vim htop iotop tmux -
Configure networking:
# Edit /etc/network/interfaces # Create vmbr0 with dual 10GbE bond + VLAN-aware # Create vmbr1 for migration/Corosync (optional but recommended)Example
/etc/network/interfaces:auto lo iface lo inet loopback # 1GbE - emergency access only auto eno1 iface eno1 inet manual # 10GbE bond for VM traffic auto bond0 iface bond0 inet manual bond-slaves enp1s0f0 enp1s0f1 bond-miimon 100 bond-mode balance-alb bond-xmit-hash-policy layer2+3 mtu 9000 # VLAN-aware bridge for VMs auto vmbr0 iface vmbr0 inet static address 10.1.1.121/24 gateway 10.1.1.1 bridge-ports bond0 bridge-stp off bridge-fd 0 bridge-vlan-aware yes bridge-vids 2-4094 mtu 9000 -
Configure Intel iGPU for passthrough:
# Enable IOMMU vi /etc/default/grub # Add to GRUB_CMDLINE_LINUX_DEFAULT: # intel_iommu=on iommu=pt update-grub # Add VFIO modules vi /etc/modules # Add: # vfio # vfio_iommu_type1 # vfio_pci # vfio_virqfd # Blacklist i915 driver (iGPU) echo "blacklist i915" >> /etc/modprobe.d/blacklist.conf # Update initramfs update-initramfs -u -k all # Reboot reboot -
Verify iGPU is available for passthrough:
lspci -nnk | grep -i vga # Should show vfio-pci driver for Intel UHD 630 -
Test VM creation:
- Create small test VM
- Verify networking works
- Test VLAN tagging
- Validate storage performance
Validation Checklist:
- Proxmox web UI accessible at https://10.1.1.121:8006
- SSH access working
- Network connectivity (ping gateway, internet)
- Intel iGPU shows as available for passthrough (lspci)
- Storage pool visible and healthy
- Test VM boots and has network connectivity
Duration: 2-4 hours
Phase 2: Migrate VMs from Host 1 to Proxmox Host 2
Migration Order (lowest to highest risk):
2.1: Test Migration - “iridium” (Low Risk, Unknown Purpose)
- Export from ESXi via OVF or disk copy
- Import to Proxmox Host 2
- Test boot and functionality
- Validate migration process
- Downtime: ~15-30 minutes
2.2: Pi-hole (Medium Risk, DNS Service)
- Risk: DNS disruption during migration
- Mitigation: Update DHCP to use backup DNS (8.8.8.8) temporarily
- Export VM from ESXi
- Import to Proxmox Host 2
- Reconfigure network (static IP, VLAN tag if needed)
- Test DNS resolution
- Downtime: ~15-20 minutes
2.3: Docker Stack (High Risk, Media Services)
- Risk: Radarr, Sonarr, SABnzbd, Overseerr, etc. offline
- Impact: Downloads/media management paused
- Export VM from ESXi (may be large if Docker volumes are on VM disk)
- Import to Proxmox Host 2
- Start VM and verify all containers come up
- Test media stack functionality
- Downtime: ~30-60 minutes
2.4: Plex Server “xeon” (HIGH RISK, iGPU Passthrough Required)
- Risk: Media streaming offline, iGPU passthrough must work
- Complexity: HIGHEST - hardware passthrough critical
- Prerequisites:
- Verify iGPU passthrough working on Proxmox Host 2
- Backup Plex database to external storage
- Document current transcoding settings
Migration Steps:
-
Prepare Plex for migration:
- Stop Plex service on ESXi VM
- Backup Plex database:
/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/ - Note transcoder settings (Hardware acceleration: Quick Sync)
-
Export VM:
- Shut down “xeon” VM on ESXi
- Export VMDK to Proxmox Host 2 via SCP/NFS
-
Import and configure on Proxmox:
- Create new VM on Proxmox (match CPU/RAM: 4 vCPU, 8GB RAM)
- Machine type: q35 (required for PCIe passthrough)
- Import disk image
- Add iGPU passthrough:
- Add PCI device: Intel UHD 630 (00:02.0)
- Enable “All Functions”, “Primary GPU”, “PCI-Express”
- Network: Configure bridge with appropriate VLAN
-
Boot and validate:
- Start VM
- Check iGPU is visible in guest OS:
lspci | grep VGA - Install/update Intel Graphics drivers in Ubuntu
- Start Plex
- Verify hardware transcoding: Play media and check transcoder shows “(hw)”
-
Test transcoding:
- Play 4K video and verify Quick Sync is being used
- Check CPU usage (should be low with hw transcoding)
- Verify
vainfoshows available encode/decode profiles
Downtime: ~1-2 hours Rollback: Keep ESXi host 1 bootable until validated
2.5: Palo Alto Firewall “jarnetfw” (CRITICAL - LAST)
- Risk: CRITICAL - Network outage for all VLANs
- Impact: All inter-VLAN routing down
- Migration Window: Off-hours/planned outage required
Prerequisites:
- ALL other VMs successfully migrated and validated
- Document complete firewall configuration
- Export Palo Alto config backup
- Plan communication to users (if applicable)
Migration Steps:
-
Backup current state:
- Export Palo Alto configuration via web UI
- Screenshot all firewall rules/NAT/routing
- Document interface → VLAN mappings
-
Export VM:
- Shut down jarnetfw VM (network outage begins)
- Export VMDK to Proxmox Host 2
-
Import and configure:
- Create VM on Proxmox (4 vCPU, 7GB RAM)
- Import disk
- Critical: Map network interfaces correctly
- Match VLAN tags to ESXi port groups
- VM network interface 1 → vmbr0.300 (Public)
- VM network interface 2 → vmbr0.50 (Lab)
- etc.
-
Boot and validate:
- Start VM
- Check Palo Alto web UI is accessible
- Verify all interfaces are UP
- Test inter-VLAN routing
- Test internet connectivity from each VLAN
- Verify firewall rules are working
Downtime: ~30-60 minutes (network outage) Rollback Plan: If fails, revert to ESXi host 1 (keep it available for 24-48 hours)
Phase 3: Install Proxmox on Host 1 (ghost-esxi-01)
Why After Host 2:
- All critical VMs now running on Proxmox Host 2
- Host 1 can be wiped with confidence
- Lower pressure/risk
Steps:
-
Final validation:
- Verify ALL migrated VMs running successfully on Host 2
- Confirm Plex transcoding working
- Confirm Palo Alto firewall routing working
- Confirm Docker services accessible
-
Export remaining VMs (if needed):
- Win-11, Win7-Victim (lab VMs) - only if you want to keep them
-
Install new 2TB NVMe drive (if not done yet)
-
Install Proxmox VE:
- Same process as Host 2
- Hostname:
proxmox-01(orpve-01) - IP:
10.1.1.120/24 - Gateway:
10.1.1.1 - Filesystem: ZFS (RAID0)
-
Configure networking:
- Match Host 2 configuration (dual 10GbE bond, VLAN-aware bridge)
- MTU 9000 for jumbo frames
-
Configure iGPU passthrough:
- Same steps as Host 2
- Enable IOMMU, load VFIO modules, blacklist i915
-
Test and validate:
- Create test VM
- Verify iGPU available for passthrough
- Verify network connectivity
Duration: 2-4 hours
Phase 4: Create 3-Node Proxmox Cluster
Prerequisites:
- Both NUCs running Proxmox successfully
- All critical VMs operational on Host 2
- Stable network connectivity between all 3 nodes
Steps:
-
Initialize cluster on Node 1:
# On proxmox-01 (10.1.1.120): pvecm create homelab-cluster # Verify cluster status pvecm status -
Join Node 2 to cluster:
# On proxmox-02 (10.1.1.121): pvecm add 10.1.1.120 # Enter root password for proxmox-01 # Wait for join to complete -
Join Node 3 (staging) to cluster:
# On pve-staging (10.1.1.123): pvecm add 10.1.1.120 # Enter root password # Wait for join to complete -
Verify 3-node cluster:
# On any node: pvecm status pvecm nodes # Should show all 3 nodes online # Quorum should be 2/3 -
Configure node priorities (optional but recommended):
- Set staging node as lower priority for resource allocation
- Configure HA groups if desired
-
Test cluster:
- View all nodes in web UI
- Test VM migration between Node 1 and Node 2
- Verify quorum works if one node goes down
Notes on 3-Node Cluster:
- Quorum: Need 2/3 nodes online for cluster to function
- HA: Can survive 1 node failure
- No QDevice needed with 3 nodes (odd number provides quorum)
- Staging node: Can have lower specs, only needs to vote for quorum
Duration: 1-2 hours
Phase 5: Migrate Plex Back to Node 1 (Optional but Recommended)
Why:
- Free up resources on Node 2 for Home Assistant migration
- Better workload distribution
- Node 1 has iGPU, ideal for Plex
Steps:
- Stop Plex VM on Node 2
- Migrate VM to Node 1 via Proxmox (live or offline migration)
- Verify iGPU passthrough still works on Node 1
- Test Plex transcoding
- Start Plex and validate
Duration: 30-60 minutes
Phase 6: Migrate Home Assistant + Frigate to Node 2
Why:
- Move production workload from staging to more powerful NUC
- Free up staging server for witness role + light workloads
- Better long-term architecture
Challenges:
- Coral TPU: Staging has USB controller passthrough (PCI 00:14)
- NUCs may have different PCI topology
- May need to pass through entire USB controller or individual USB port
- Static IP: Home Assistant has static IP (10.1.1.208)
- Uptime: High-priority service (CCTV)
Steps:
-
Prepare Node 2 for USB passthrough:
# On proxmox-02: lsusb # Identify Coral TPU device # Find USB controller PCI ID lspci | grep USB # Configure USB controller for passthrough (similar to staging) # OR use USB device ID passthrough (easier but less stable) -
Stop home-sec VM on staging:
# On pve-staging: qm stop 103 -
Export and migrate VM:
# Method 1: Backup/Restore via Proxmox vzdump 103 --storage local --mode stop # Copy backup to Node 2 # Restore on Node 2 # Method 2: qm migrate (if cluster is already created) qm migrate 103 proxmox-02 --online -
Reconfigure VM on Node 2:
- Update PCI passthrough to match Node 2’s USB controller PCI ID
- Verify network configuration (static IP 10.1.1.208)
- Ensure cloud-init settings preserved
-
Boot and validate:
- Start VM on Node 2
- Check Coral TPU is visible:
lsusbin guest OS - Check Home Assistant accessible via web UI
- Verify Frigate detects Coral TPU:
- Check Frigate logs
- Verify “Coral” detector is active
- Test object detection on camera feeds
-
Monitor for 24-48 hours:
- Verify camera streams stable
- Verify object detection working
- Check for any USB passthrough issues
Downtime: ~30-60 minutes Fallback: Can revert to staging server if issues
Duration: 2-3 hours with testing
Phase 7: Final Rebalancing and Cleanup
Workload Distribution Review:
| Node | VMs | vCPU Total | RAM Total | Notes |
|---|---|---|---|---|
| proxmox-01 | Plex, Docker, Pi-hole, Palo Alto | ~11 vCPU | ~20GB | iGPU for Plex |
| proxmox-02 | Home Assistant/Frigate | ~4 vCPU | ~8GB | USB for Coral TPU, room to grow |
| pve-staging | Docker-host-1, K8s lab, templates | ~4 vCPU active | ~8GB active | Witness + dev/test |
Cleanup Tasks:
- Delete old ESXi VMs from inventory (if safe)
- Remove home-security, server-2019 VMs from ESXi host 2
- Clean up iridium VM if purpose unknown/no longer needed
- Update documentation with final IP addresses
- Update DNS records if any changed
- Configure Proxmox backups (PBS or external)
- Test HA failover (optional)
Risk Assessment & Mitigation
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| iGPU passthrough fails on Proxmox | Medium | High | Test on staging first; rollback to ESXi if needed |
| Coral TPU passthrough fails on NUC | Medium | High | Keep Home Assistant on staging until validated |
| Palo Alto firewall migration fails | Low | Critical | Thorough testing; off-hours migration; keep ESXi Host 1 available for rollback |
| Network misconfiguration breaks VLANs | Medium | High | Document all VLAN mappings; test each VLAN post-migration |
| Data loss during VM export/import | Very Low | Critical | Backup all VMs before migration; verify checksums |
| Cluster split-brain | Low | Medium | 3 nodes provide quorum; monitor cluster health |
| Storage performance degradation | Low | Medium | Benchmark ZFS before/after; tune ZFS ARC if needed |
Rollback Plans
Phase 2 Rollback (VMs on Proxmox Host 2)
- If any VM migration fails: Keep ESXi Host 1 running
- If Plex iGPU fails: Revert Plex to ESXi Host 1
- If Palo Alto fails: Emergency reboot of ESXi Host 1 to restore networking
Phase 6 Rollback (Home Assistant/Frigate)
- If Coral TPU fails on Node 2: Migrate VM back to staging server
- Fallback: Staging server remains operational during migration
Complete Rollback
- Worst case: Both NUCs available, can reinstall ESXi if total failure
- Likelihood: Very low - incremental approach minimizes risk
Timeline Estimate
Conservative Timeline (Recommended):
| Week | Phase | Activities | Time |
|---|---|---|---|
| 1 | Phase 0 | Hardware prep, backups, planning | 4-6 hours |
| 2 | Phase 1 | Install Proxmox on Host 2, configure networking | 4-6 hours |
| 3 | Phase 2.1-2.3 | Migrate iridium, Pi-hole, Docker | 2-3 hours |
| 4 | Phase 2.4 | Migrate Plex (iGPU passthrough) | 2-4 hours |
| 5 | Phase 2.5 | Migrate Palo Alto (critical - careful!) | 2-3 hours |
| 6 | Phase 3 | Install Proxmox on Host 1 | 3-4 hours |
| 7 | Phase 4 | Create cluster, migrate Plex back to Node 1 | 2-3 hours |
| 8 | Phase 6 | Migrate Home Assistant/Frigate to Node 2 | 3-4 hours |
| 9 | Phase 7 | Final rebalancing, testing, documentation | 2-3 hours |
Total: ~25-36 hours over 9 weeks (comfortable pace with validation)
Fast-Track: Could compress to 3-4 weekends (~20-25 hours total) if comfortable with risk
Questions for User - Action Items
Before proceeding, please answer:
Critical Decisions
- NVMe upgrade timing: Order 2TB drives now and install before migration? (RECOMMENDED: Yes)
- Storage backend: ZFS or LVM-Thin for NUC nodes? (RECOMMENDED: ZFS)
- Downtime windows: What times are acceptable for Plex/Palo Alto/Home Assistant outages?
VM Clarifications
- “iridium” VM: What is this for? Can it be offline temporarily?
- “server-2019” VM: Old Blue Iris host? Can we delete it?
- “xsoar” VM: Keep or decommission?
- NFS backup (10.1.1.150): Still in use? Purpose?
Nice-to-Have
- Cluster name: “homelab-cluster” or different name?
- Node naming: proxmox-01/02 or different convention?
- HA preferences: Do you want automatic HA failover or manual?
Next Steps
Once you answer the questions above, I can:
- ✅ Create detailed step-by-step runbooks for each phase
- ✅ Generate network configuration templates
- ✅ Document iGPU passthrough configuration scripts
- ✅ Build VM migration checklists
- ✅ Create validation/testing scripts
- ✅ Design backup strategy for Proxmox cluster
Let’s align on the critical decisions, then we can start Phase 0!