Proxmox and VMware in NFS Environments & Performance Testing

Network File System (NFS) is a distributed file system protocol allowing a user on a client computer to access files over a computer network much like local storage is accessed. Both Proxmox VE and VMware vSphere, leading virtualization platforms, can leverage NFS for flexible and scalable storage solutions. This document outlines key features and use cases for Proxmox and VMware in NFS environments, and details how to approach NFS performance testing.

Proxmox VE with NFS

Proxmox Virtual Environment (VE) is an open-source server virtualization management platform. It integrates KVM hypervisor and LXC containers, software-defined storage, and networking functionality on a single platform.

Key Features of Proxmox VE

  • Open-source: No licensing fees, extensive community support.
  • Integrated KVM and LXC: Supports both full virtualization (virtual machines) and lightweight containerization.
  • Web-based management interface: Provides a centralized control panel for all management tasks.
  • Clustering and High Availability (HA): Allows for the creation of resilient infrastructure by grouping multiple Proxmox VE servers.
  • Live migration: Enables moving running virtual machines between physical hosts in a cluster without downtime.
  • Built-in backup and restore tools: Offers integrated solutions for data protection.
  • Support for various storage types: Including NFS, iSCSI, Ceph, ZFS, LVM, and local directories.

Use Cases for Proxmox VE

  • Small to medium-sized businesses (SMBs) seeking a cost-effective and powerful virtualization solution.
  • Home labs and development/testing environments due to its flexibility and lack of licensing costs.
  • Hosting a variety of workloads such as web servers, databases, application servers, and network services.
  • Implementing private clouds and virtualized infrastructure.

Configuring NFS with Proxmox VE

Proxmox VE can easily integrate with NFS shares for storing VM disk images, ISO files, container templates, and backups.

  1. To add NFS storage in Proxmox VE, navigate to the “Datacenter” section in the web UI, then select “Storage”.
  2. Click the “Add” button and choose “NFS” from the dropdown menu.
  3. In the dialog box, provide the following:
    • ID: A unique name for this storage in Proxmox.
    • Server: The IP address or hostname of your NFS server.
    • Export: The exported directory path from the NFS server (e.g., /exports/data).
    • Content: Select the types of data you want to store on this NFS share (e.g., Disk image, ISO image, Container template, Backups).
  4. Adjust advanced options like NFS version if necessary, then click “Add”.

VMware vSphere with NFS

VMware vSphere is a comprehensive suite of virtualization products, with ESXi as the hypervisor and vCenter Server for centralized management. It is a widely adopted, enterprise-grade virtualization platform known for its robustness and extensive feature set.

Key Features of VMware vSphere

  • Robust and mature hypervisor (ESXi): Provides a stable and high-performance virtualization layer.
  • Advanced features: Includes vMotion (live migration of VMs), Storage vMotion (live migration of VM storage), Distributed Resource Scheduler (DRS) for load balancing, High Availability (HA) for automatic VM restart, and Fault Tolerance (FT) for continuous availability.
  • Comprehensive management with vCenter Server: A centralized platform for managing all aspects of the vSphere environment.
  • Strong ecosystem and third-party integrations: Wide support from hardware vendors and software developers.
  • Wide range of supported guest operating systems and hardware.
  • Advanced networking (vSphere Distributed Switch, NSX) and security features.

Use Cases for VMware vSphere

  • Enterprise data centers and hosting mission-critical applications requiring high availability and performance.
  • Large-scale virtualization deployments managing hundreds or thousands of VMs.
  • Virtual Desktop Infrastructure (VDI) deployments.
  • Implementing robust disaster recovery and business continuity solutions.
  • Building private, public, and hybrid cloud computing environments.

Configuring NFS with VMware vSphere

vSphere supports NFS version 3 and 4.1 for creating datastores. NFS datastores can be used to store virtual machine files (VMDKs), templates, and ISO images.

  1. Ensure your ESXi hosts have a VMkernel port configured for NFS traffic (typically on the management network or a dedicated storage network).
  2. Using the vSphere Client connected to vCenter Server (or directly to an ESXi host):
    1. Navigate to the host or cluster where you want to add the datastore.
    2. Go to the “Configure” tab, then select “Datastores” under Storage, and click “New Datastore”.
    3. In the New Datastore wizard, select “NFS” as the type of datastore.
    4. Choose the NFS version (NFS 3 or NFS 4.1). NFS 4.1 offers enhancements like Kerberos security.
    5. Enter a name for the datastore.
    6. Provide the NFS server’s IP address or hostname and the folder/share path (e.g., /vol/datastore1).
    7. Choose whether to mount the NFS share as read-only or read/write (default).
    8. Review the settings and click “Finish”.

NFS Performance Testing

Testing the performance of your NFS storage is crucial to ensure it meets the demands of your virtualized workloads and to identify potential bottlenecks before they impact production.

Why test NFS performance?

  • To validate that the NFS storage solution can deliver the required IOPS (Input/Output Operations Per Second) and throughput for your virtual machines.
  • To identify bottlenecks in the storage infrastructure, network configuration (switches, NICs, cabling), or NFS server settings.
  • To establish a performance baseline before making changes (e.g., software upgrades, hardware changes, network modifications) and to verify improvements after changes.
  • To ensure a satisfactory user experience for applications running on VMs that rely on NFS storage.
  • For capacity planning and to understand storage limitations.

Common tools for NFS performance testing

  • fio (Flexible I/O Tester): A powerful and versatile open-source I/O benchmarking tool that can simulate various workload types (sequential, random, different block sizes, read/write mixes). Highly recommended.
  • iozone: Another popular filesystem benchmark tool that can test various aspects of file system performance.
  • dd: A basic Unix utility that can be used for simple sequential read/write tests, but it’s less comprehensive for detailed performance analysis.
  • VM-level tools: Guest OS specific tools (e.g., CrystalDiskMark on Windows, or `fio` within a Linux VM) can also be used from within a virtual machine accessing the NFS datastore to measure performance from the application’s perspective.

What the test does (explaining a generic NFS performance test)

A typical NFS performance test involves a client (e.g., a Proxmox host, an ESXi host, or a VM running on one of these platforms) generating I/O operations (reads and writes) of various sizes and patterns (sequential, random) to files located on the NFS share. The primary goal is to measure:

  • Throughput: The rate at which data can be transferred, usually measured in MB/s or GB/s. This is important for large file transfers or streaming workloads.
  • IOPS (Input/Output Operations Per Second): The number of read or write operations that can be performed per second. This is critical for transactional workloads like databases or applications with many small I/O requests.
  • Latency: The time taken for an I/O operation to complete, usually measured in milliseconds (ms) or microseconds (µs). Low latency is crucial for responsive applications.

The test simulates different workload profiles (e.g., mimicking a database server, web server, or file server) to understand how the NFS storage performs under conditions relevant to its intended use.

Key metrics to observe

  • Read/Write IOPS for various block sizes (e.g., 4KB, 8KB, 64KB, 1MB).
  • Read/Write throughput (bandwidth) for sequential and random operations.
  • Average, 95th percentile, and maximum latency for I/O operations.
  • CPU utilization on both the NFS client (hypervisor or VM) and the NFS server during the test.
  • Network utilization and potential congestion points (e.g., packet loss, retransmits).

Steps to run a (generic) NFS performance test

  1. Define Objectives and Scope: Clearly determine what you want to measure (e.g., maximum sequential throughput, random 4K IOPS, latency under specific load). Identify the specific NFS share and client(s) for testing.
  2. Prepare the Test Environment:
    • Ensure the NFS share is correctly mounted on the test client(s).
    • Minimize other activities on the NFS server, client, and network during the test to get clean results.
    • Verify network connectivity and configuration (e.g., jumbo frames if used, correct VLANs).
  3. Choose and Install a Benchmarking Tool: For example, install `fio` on the Linux-based hypervisor (Proxmox VE) or a Linux VM.
  4. Configure Test Parameters in the Tool:
    • Test file size: Should be significantly larger than the NFS server’s cache and the client’s RAM to avoid misleading results due to caching (e.g., 2-3 times the RAM of the NFS server).
    • Block size (bs): Vary this to match expected workloads (e.g., bs=4k for database-like random I/O, bs=1M for sequential streaming).
    • Read/Write mix (rw): Examples: read (100% read), write (100% write), randread, randwrite, rw (50/50 read/write), randrw (50/50 random read/write), or specific mixes like rwmixread=70 (70% read, 30% write).
    • Workload type: Sequential (rw=read or rw=write) or random (rw=randread or rw=randwrite).
    • Number of threads/jobs (numjobs): To simulate concurrent access from multiple applications or VMs.
    • I/O depth (iodepth): Number of outstanding I/O operations, simulating queue depth.
    • Duration of the test (runtime): Run long enough to reach a steady state (e.g., 5-15 minutes per test case).
    • Target directory: Point to a directory on the mounted NFS share.
  5. Execute the Test: Run the benchmark tool from the client machine, targeting a file or directory on the NFS share.

Example fio command (conceptual for a random read/write test):

  1. (Note: /mnt/nfs_share_mountpoint should be replaced with the actual mount point of your NFS share. Parameters like size, numjobs, iodepth should be adjusted based on specific needs, available resources, and the NFS server’s capabilities. direct=1 attempts to bypass client-side caching.)
  2. Collect and Analyze Results: Gather the output from the tool (IOPS, throughput, latency figures). Also, monitor CPU, memory, and network utilization on both the client and the NFS server during the test using tools like top, htop, vmstat, iostat, nfsstat, sar, or platform-specific monitoring tools (Proxmox VE dashboard, ESXTOP).
  3. Document and Iterate: Record the test configuration and results. If performance is not as expected, investigate potential bottlenecks (NFS server tuning, network, client settings), make adjustments, and re-test to measure the impact of changes. Repeat with different test parameters to cover various workload profiles.

Conclusion

Both Proxmox VE and VMware vSphere offer robust support for NFS, providing flexible and scalable storage solutions for virtual environments. Understanding their respective key features, use cases, and configuration methods helps in architecting efficient virtualized infrastructures. Regardless of the chosen virtualization platform, performing diligent and methodical NFS performance testing is essential. It allows you to validate your storage design, ensure optimal operation, proactively identify and resolve bottlenecks, and ultimately guarantee that your storage infrastructure can effectively support the demands of your virtualized workloads and applications.


fio --name=nfs_randrw_test \
--directory=/mnt/nfs_share_mountpoint \
--ioengine=libaio \
--direct=1 \
--rw=randrw \
--rwmixread=70 \
--bs=4k \
--size=20G \
--numjobs=8 \
--iodepth=32 \
--runtime=300 \
--group_reporting \
--output=nfs_test_results.txt

NVIDIA GPU Test Script and Setup Guide for Ubuntu VM

This document provides a comprehensive guide to setting up and testing an NVIDIA GPU within an Ubuntu Virtual Machine (VM). Proper configuration is crucial for leveraging GPU acceleration in tasks such as machine learning, data processing, and scientific computing. Following these steps will help you confirm GPU accessibility, install necessary drivers and software, and verify the setup using a TensorFlow test script.

Prerequisites

Before you begin, ensure you have the following:

  • An Ubuntu Virtual Machine with GPU passthrough correctly configured from your hypervisor (e.g., Proxmox, ESXi, KVM). The GPU should be visible to the guest OS.
  • Sudo (administrator) privileges within the Ubuntu VM to install packages and drivers.
  • A stable internet connection to download drivers, CUDA toolkit, and Python packages.
  • Basic familiarity with the Linux command line interface.

Step 1: Confirm GPU is Assigned to VM

The first step is to verify that the Ubuntu VM can detect the NVIDIA GPU assigned to it. This ensures that the PCI passthrough is functioning correctly at the hypervisor level.

Open a terminal in your Ubuntu VM and run the following command to list PCI devices, filtering for NVIDIA hardware:

lspci | grep -i nvidia

You should see an output line describing your NVIDIA GPU. For example, it might display something like “VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080]” or similar, depending on your specific GPU model. If this command doesn’t show your GPU, you need to revisit your VM’s passthrough settings in the hypervisor.

Next, attempt to use the NVIDIA System Management Interface (nvidia-smi) command. This tool provides monitoring and management capabilities for NVIDIA GPUs. If the NVIDIA drivers are already installed and functioning, it will display detailed information about your GPU, including its name, temperature, memory usage, and driver version.

nvidia-smi

If nvidia-smi runs successfully and shows your GPU statistics, it’s a good sign. You might be able to skip to Step 3 or 4 if your drivers are already compatible with your intended workload (e.g., TensorFlow). However, if it outputs an error such as “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver,” it indicates that the necessary NVIDIA drivers are not installed correctly or are missing. In this case, proceed to Step 2 to install or reinstall them.

Step 2: Install NVIDIA GPU Driver + CUDA Toolkit

For GPU-accelerated applications like TensorFlow, you need the appropriate NVIDIA drivers and the CUDA Toolkit. The CUDA Toolkit enables developers to use NVIDIA GPUs for general-purpose processing.

First, update your package list and install essential packages for building kernel modules:

sudo apt update
sudo apt install build-essential dkms -y

build-essential installs compilers and other utilities needed for compiling software. dkms (Dynamic Kernel Module Support) helps in rebuilding kernel modules, such as the NVIDIA driver, when the kernel is updated.

Next, download the NVIDIA driver. The version specified here (535.154.05) is an example. You should visit the NVIDIA driver download page to find the latest recommended driver for your specific GPU model and Linux x86_64 architecture. For server environments or specific CUDA versions, you might need a particular driver branch.

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.154.05/NVIDIA-Linux-x86_64-535.154.05.run

Once downloaded, make the installer file executable:

chmod +x NVIDIA-Linux-*.run

Now, run the installer. It’s often recommended to do this from a text console (TTY) without an active X server, but for many modern systems and VMs, it can work from within a desktop session. If you encounter issues, try switching to a TTY (e.g., Ctrl+Alt+F3), logging in, and stopping your display manager (e.g., sudo systemctl stop gdm or lightdm) before running the installer.

sudo ./NVIDIA-Linux-*.run

Follow the on-screen prompts during the installation. You’ll typically need to:

  • Accept the license agreement.
  • Choose whether to register the kernel module sources with DKMS (recommended, select “Yes”).
  • Install 32-bit compatibility libraries (optional, usually not needed for TensorFlow server workloads but can be installed if unsure).
  • Allow the installer to update your X configuration file (usually “Yes”, though less critical for server/headless VMs).

After the driver installation is complete, you must reboot the VM for the new driver to load correctly:

sudo reboot

After rebooting, re-run nvidia-smi. It should now display your GPU information without errors.

Step 3: Install Python + Virtual Environment

Python is the primary language for TensorFlow. It’s highly recommended to use Python virtual environments to manage project dependencies and avoid conflicts between different projects or system-wide Python packages.

Install Python 3, pip (Python package installer), and the venv module for creating virtual environments:

sudo apt install python3-pip python3-venv -y

Create a new virtual environment. We’ll name it tf-gpu-env, but you can choose any name:

python3 -m venv tf-gpu-env

This command creates a directory named tf-gpu-env in your current location, containing a fresh Python installation and tools.

Activate the virtual environment:

source tf-gpu-env/bin/activate

Your command prompt should change to indicate that the virtual environment is active (e.g., it might be prefixed with (tf-gpu-env)). All Python packages installed hereafter will be local to this environment.

Step 4: Install TensorFlow with GPU Support

With the virtual environment activated, you can now install TensorFlow. Ensure your NVIDIA drivers and CUDA toolkit (often bundled with or compatible with the drivers you installed) meet the version requirements for the TensorFlow version you intend to install. You can check TensorFlow’s official documentation for these prerequisites.

First, upgrade pip within the virtual environment to ensure you have the latest version:

pip install --upgrade pip

Now, install TensorFlow. The pip package for tensorflow typically includes GPU support by default and will utilize it if a compatible NVIDIA driver and CUDA environment are detected.

pip install tensorflow

This command will download and install TensorFlow and its dependencies. The size can be substantial, so it might take some time.

To verify that TensorFlow can recognize and use your GPU, run the following Python one-liner:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If TensorFlow is correctly configured to use the GPU, the output should look similar to this:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

This output confirms that TensorFlow has identified at least one GPU (GPU:0) that it can use for computations. If you see an empty list ([]), TensorFlow cannot detect your GPU. This could be due to driver issues, CUDA compatibility problems, or an incorrect TensorFlow installation. Double-check your driver installation (nvidia-smi), CUDA version, and ensure you are in the correct virtual environment where TensorFlow was installed.

Step 5: Run Test TensorFlow GPU Script

To perform a more concrete test, you can run a simple TensorFlow script that performs a basic computation on the GPU.

Create a new Python file, for example, test_tf_gpu.py, using a text editor like nano or vim, and paste the following code into it:

# Save this as test_tf_gpu.py
import tensorflow as tf

# Check for available GPUs and print TensorFlow version
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
print("TensorFlow version:", tf.__version__)

# Explicitly place the computation on the first GPU
# If you have multiple GPUs, you can select them by index (e.g., /GPU:0, /GPU:1)
if tf.config.list_physical_devices('GPU'):
    print("Running a sample computation on the GPU.")
    try:
        with tf.device('/GPU:0'):
            a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
            b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
            c = tf.matmul(a, b)
        print("Matrix multiplication result on GPU:", c)
    except RuntimeError as e:
        print(e)
else:
    print("No GPU available, cannot run GPU-specific test.")

# Example of a simple operation that will run on GPU if available, or CPU otherwise
print("\nRunning another simple operation:")
x = tf.random.uniform([3, 3])
print("Device for x:", x.device)
if "GPU" in x.device:
    print("The operation ran on the GPU.")
else:
    print("The operation ran on the CPU.")

This script first prints the number of available GPUs and the TensorFlow version. Then, it attempts to perform a matrix multiplication specifically on /GPU:0. The tf.device('/GPU:0') context manager ensures that the operations defined within its block are assigned to the specified GPU.

Save the file and run it from your terminal (ensure your virtual environment tf-gpu-env is still active):

python test_tf_gpu.py

If everything is set up correctly, you should see output indicating:

  • The number of GPUs available (e.g., “Num GPUs Available: 1”).
  • Your TensorFlow version.
  • The result of the matrix multiplication, confirming the computation was executed.
  • Confirmation that subsequent operations are also running on the GPU.

An example output might look like:Num GPUs Available:  1 TensorFlow version: 2.x.x Running a sample computation on the GPU. Matrix multiplication result on GPU: tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) Running another simple operation: Device for x: /job:localhost/replica:0/task:0/device:GPU:0 The operation ran on the GPU.    This successful execution confirms that your NVIDIA GPU is properly configured and usable by TensorFlow within your Ubuntu VM.

Step 6: Optional Cleanup

Once you are done working in your TensorFlow GPU environment, you can deactivate it:

deactivate

This will return you to your system’s default Python environment, and your command prompt will revert to its normal state. The virtual environment (tf-gpu-env directory and its contents) remains on your system, and you can reactivate it anytime by running source tf-gpu-env/bin/activate from the directory containing tf-gpu-env.

Conclusion

Successfully completing these steps means you have configured your Ubuntu VM to utilize an NVIDIA GPU for accelerated computing with TensorFlow. This setup is foundational for machine learning development, model training, and other GPU-intensive tasks. If you encounter issues, re-check each step, ensuring driver compatibility, correct CUDA versions for your TensorFlow installation, and proper VM passthrough configuration. Refer to NVIDIA and TensorFlow documentation for more advanced configurations or troubleshooting specific error messages.