artificial-intelligence

This document provides a comprehensive guide to setting up and testing an NVIDIA GPU within an Ubuntu Virtual Machine (VM). Proper configuration is crucial for leveraging GPU acceleration in tasks such as machine learning, data processing, and scientific computing. Following these steps will help you confirm GPU accessibility, install necessary drivers and software, and verify the setup using a TensorFlow test script.

Prerequisites

Before you begin, ensure you have the following:

An Ubuntu Virtual Machine with GPU passthrough correctly configured from your hypervisor (e.g., Proxmox, ESXi, KVM). The GPU should be visible to the guest OS.
Sudo (administrator) privileges within the Ubuntu VM to install packages and drivers.
A stable internet connection to download drivers, CUDA toolkit, and Python packages.
Basic familiarity with the Linux command line interface.

Step 1: Confirm GPU is Assigned to VM

The first step is to verify that the Ubuntu VM can detect the NVIDIA GPU assigned to it. This ensures that the PCI passthrough is functioning correctly at the hypervisor level.

Open a terminal in your Ubuntu VM and run the following command to list PCI devices, filtering for NVIDIA hardware:

lspci | grep -i nvidia

You should see an output line describing your NVIDIA GPU. For example, it might display something like “VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080]” or similar, depending on your specific GPU model. If this command doesn’t show your GPU, you need to revisit your VM’s passthrough settings in the hypervisor.

Next, attempt to use the NVIDIA System Management Interface (nvidia-smi) command. This tool provides monitoring and management capabilities for NVIDIA GPUs. If the NVIDIA drivers are already installed and functioning, it will display detailed information about your GPU, including its name, temperature, memory usage, and driver version.

nvidia-smi

If nvidia-smi runs successfully and shows your GPU statistics, it’s a good sign. You might be able to skip to Step 3 or 4 if your drivers are already compatible with your intended workload (e.g., TensorFlow). However, if it outputs an error such as “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver,” it indicates that the necessary NVIDIA drivers are not installed correctly or are missing. In this case, proceed to Step 2 to install or reinstall them.

Step 2: Install NVIDIA GPU Driver + CUDA Toolkit

For GPU-accelerated applications like TensorFlow, you need the appropriate NVIDIA drivers and the CUDA Toolkit. The CUDA Toolkit enables developers to use NVIDIA GPUs for general-purpose processing.

First, update your package list and install essential packages for building kernel modules:

sudo apt update
sudo apt install build-essential dkms -y

build-essential installs compilers and other utilities needed for compiling software. dkms (Dynamic Kernel Module Support) helps in rebuilding kernel modules, such as the NVIDIA driver, when the kernel is updated.

Next, download the NVIDIA driver. The version specified here (535.154.05) is an example. You should visit the NVIDIA driver download page to find the latest recommended driver for your specific GPU model and Linux x86_64 architecture. For server environments or specific CUDA versions, you might need a particular driver branch.

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.154.05/NVIDIA-Linux-x86_64-535.154.05.run

Once downloaded, make the installer file executable:

chmod +x NVIDIA-Linux-*.run

Now, run the installer. It’s often recommended to do this from a text console (TTY) without an active X server, but for many modern systems and VMs, it can work from within a desktop session. If you encounter issues, try switching to a TTY (e.g., Ctrl+Alt+F3), logging in, and stopping your display manager (e.g., sudo systemctl stop gdm or lightdm) before running the installer.

sudo ./NVIDIA-Linux-*.run

Follow the on-screen prompts during the installation. You’ll typically need to:

Accept the license agreement.
Choose whether to register the kernel module sources with DKMS (recommended, select “Yes”).
Install 32-bit compatibility libraries (optional, usually not needed for TensorFlow server workloads but can be installed if unsure).
Allow the installer to update your X configuration file (usually “Yes”, though less critical for server/headless VMs).

After the driver installation is complete, you must reboot the VM for the new driver to load correctly:

sudo reboot

After rebooting, re-run nvidia-smi. It should now display your GPU information without errors.

Step 3: Install Python + Virtual Environment

Python is the primary language for TensorFlow. It’s highly recommended to use Python virtual environments to manage project dependencies and avoid conflicts between different projects or system-wide Python packages.

Install Python 3, pip (Python package installer), and the venv module for creating virtual environments:

sudo apt install python3-pip python3-venv -y

Create a new virtual environment. We’ll name it tf-gpu-env, but you can choose any name:

python3 -m venv tf-gpu-env

This command creates a directory named tf-gpu-env in your current location, containing a fresh Python installation and tools.

Activate the virtual environment:

source tf-gpu-env/bin/activate

Your command prompt should change to indicate that the virtual environment is active (e.g., it might be prefixed with (tf-gpu-env)). All Python packages installed hereafter will be local to this environment.

Step 4: Install TensorFlow with GPU Support

With the virtual environment activated, you can now install TensorFlow. Ensure your NVIDIA drivers and CUDA toolkit (often bundled with or compatible with the drivers you installed) meet the version requirements for the TensorFlow version you intend to install. You can check TensorFlow’s official documentation for these prerequisites.

First, upgrade pip within the virtual environment to ensure you have the latest version:

pip install --upgrade pip

Now, install TensorFlow. The pip package for tensorflow typically includes GPU support by default and will utilize it if a compatible NVIDIA driver and CUDA environment are detected.

pip install tensorflow

This command will download and install TensorFlow and its dependencies. The size can be substantial, so it might take some time.

To verify that TensorFlow can recognize and use your GPU, run the following Python one-liner:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If TensorFlow is correctly configured to use the GPU, the output should look similar to this:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

This output confirms that TensorFlow has identified at least one GPU (GPU:0) that it can use for computations. If you see an empty list ([]), TensorFlow cannot detect your GPU. This could be due to driver issues, CUDA compatibility problems, or an incorrect TensorFlow installation. Double-check your driver installation (nvidia-smi), CUDA version, and ensure you are in the correct virtual environment where TensorFlow was installed.

Step 5: Run Test TensorFlow GPU Script

To perform a more concrete test, you can run a simple TensorFlow script that performs a basic computation on the GPU.

Create a new Python file, for example, test_tf_gpu.py, using a text editor like nano or vim, and paste the following code into it:

# Save this as test_tf_gpu.py
import tensorflow as tf

# Check for available GPUs and print TensorFlow version
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
print("TensorFlow version:", tf.__version__)

# Explicitly place the computation on the first GPU
# If you have multiple GPUs, you can select them by index (e.g., /GPU:0, /GPU:1)
if tf.config.list_physical_devices('GPU'):
    print("Running a sample computation on the GPU.")
    try:
        with tf.device('/GPU:0'):
            a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
            b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
            c = tf.matmul(a, b)
        print("Matrix multiplication result on GPU:", c)
    except RuntimeError as e:
        print(e)
else:
    print("No GPU available, cannot run GPU-specific test.")

# Example of a simple operation that will run on GPU if available, or CPU otherwise
print("\nRunning another simple operation:")
x = tf.random.uniform([3, 3])
print("Device for x:", x.device)
if "GPU" in x.device:
    print("The operation ran on the GPU.")
else:
    print("The operation ran on the CPU.")

This script first prints the number of available GPUs and the TensorFlow version. Then, it attempts to perform a matrix multiplication specifically on /GPU:0. The tf.device('/GPU:0') context manager ensures that the operations defined within its block are assigned to the specified GPU.

Save the file and run it from your terminal (ensure your virtual environment tf-gpu-env is still active):

python test_tf_gpu.py

If everything is set up correctly, you should see output indicating:

The number of GPUs available (e.g., “Num GPUs Available: 1”).
Your TensorFlow version.
The result of the matrix multiplication, confirming the computation was executed.
Confirmation that subsequent operations are also running on the GPU.

An example output might look like:Num GPUs Available: 1 TensorFlow version: 2.x.x Running a sample computation on the GPU. Matrix multiplication result on GPU: tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) Running another simple operation: Device for x: /job:localhost/replica:0/task:0/device:GPU:0 The operation ran on the GPU. This successful execution confirms that your NVIDIA GPU is properly configured and usable by TensorFlow within your Ubuntu VM.

Step 6: Optional Cleanup

Once you are done working in your TensorFlow GPU environment, you can deactivate it:

deactivate

This will return you to your system’s default Python environment, and your command prompt will revert to its normal state. The virtual environment (tf-gpu-env directory and its contents) remains on your system, and you can reactivate it anytime by running source tf-gpu-env/bin/activate from the directory containing tf-gpu-env.

Conclusion

Successfully completing these steps means you have configured your Ubuntu VM to utilize an NVIDIA GPU for accelerated computing with TensorFlow. This setup is foundational for machine learning development, model training, and other GPU-intensive tasks. If you encounter issues, re-check each step, ensuring driver compatibility, correct CUDA versions for your TensorFlow installation, and proper VM passthrough configuration. Refer to NVIDIA and TensorFlow documentation for more advanced configurations or troubleshooting specific error messages.

VMwareBlogs

"Unlocking the Power of Virtualization: Explore the Latest Insights and Innovations with VMware Blogs"

Tag: artificial-intelligence

NVIDIA GPU Test Script and Setup Guide for Ubuntu VM