System Setup Guide

This document is designed for customers participating in the Software Development Platform for Intel® Data Center GPU Max 1100 Series program who receive the following system configuration:

  • D50DNP server with two Intel® Xeon 8480+ CPUs (Sapphire Rapids 350W TDP)

  • Two Intel® Data Center GPU Max 1100 PCIe cards (300W TDP each) with an Xe Link x2 bridge card.

The intent is to provide an end-to-end view of system setup and test content from the perspective of this specific configuration. This includes instructions for:

  • BIOS and operating system installation

  • Driver and tool installation

  • Readiness validation with example workloads

For simplicity, this guide focuses on Ubuntu. Intel GPU drivers support 3 baseline operating systems: Ubuntu, RHEL, and SLES. Other operating systems have similar steps.

For more information about the host system, see Intel Server D50DNP Family Technical Product Specification. Additional information about GPU is available in the Intel® Data Center GPU Max Series documentation.

Components

System firmware and BIOS are pre-installed. The following table lists all preinstalled firmware components and their versions.

Firmware component

Version

Details

IFWI

PVC2_1.23335

preinstalled

AMC Firmware

PVC_AMC_V_6.7.0.0

preinstalled

System Firmware

SE5C741.86B.01.01.0004.2303280404

preinstalled

Software components are expected to be installed by the end user. Systems were tested with the following components:

Software component

Version

Details

OS

Ubuntu* 22.04 LTS (Jammy)

5.15 kernel

GPU Driver

2328 Production Release

General-Purpose GPU documentation

Intel® oneAPI Base toolkit

2023.2.0-49384

Base toolkit documentation

Intel® oneAPI HPC toolkit

2023.2.0-49438

HPC toolkit documentation

Intel® oneAPI AI toolkit

2023.2.0.48997

AI tools documentation

Intel® XPU Manager

xpu-smi 1.2.21

XPU Manager releases

Workload: DGEMM

-

oneMKL repository

Workload: BabelSTREAM

-

BabelSTREAM repository

Workload: BERT Large

-

BERT Large documentation

Driver kernel build versions are frequently updated to enhance security and fix bugs. DKMS patches rely on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not a concern. Intel releases are regularly validated with the latest OSV builds, ensuring compatibility with any Ubuntu 5.15 build.

Setting up BIOS

Follow these steps to configure the required BIOS settings for full performance of ML and AI workloads.

  1. Enter BIOS [F2] and load default values [F9] to align with the validated setup.

BIOS main screen

Note

All the settings covered in this setup are defaults. No changes are necessary if the defaults are already applied. The following steps will verify that the expected settings are in use.

  1. Open the Advanced options and verify the Processor Configuration.

Enable Intel® Hyper-Threading Tech (Intel® Hyper-Threading Technology). This feature is used for improving the Instructions per Cycle (IPC).

BIOS Processor Configuration Screen

  1. Open the Advanced options and verify the Power & Performance settings. Choose the Balanced Performance option. This setting weights optimization toward performance while conserving energy.

BIOS Power/Performance Screen

  1. Open the Advanced options and verify the PCI configuration settings. Set MMIO High Base to 56T for MMIO optimization. Set Memory Mapped I/O size to 1024G.

BIOS PCI Configuration Screen

Installing Ubuntu 22.04 and the GPU driver

We recommend using the Ubuntu 22.04 Server (Jammy). Although the installation steps for RHEL* and SLES* should also work, the following steps have been verified with the Intel® Server Board D50DNP and Intel® Data Center GPU Max 1100 Series.

  1. Download Ubuntu 22.04 LTS from the Ubuntu website.

  2. Start Ubuntu 22.04 LTS x86_64 installation, press F6 to select boot device, for example, USB.

    OS Install Grub Options

  3. Select the following settings:

    Note

    Internet access is required for the following steps. Add a proxy server address if needed.

    • Language:

      OS Install Language Options

    • Ubuntu Server as the base for the installation:

      OS Install Ubuntu Server

    • Use an entire disk as the storage configuration. At least 650 GB is required to execute all the validation workloads.

      OS Install Storage Config

    • Accept the default options and create a user. To match the steps in this document, set up ‘user1’.

    • Select Install OpenSSH server, which is disabled by default, to enable remote SSH login and SCP to the server.

      OS Install OpenSSH

    Wait for installation to finish, remove installation media, and then log in.

  4. Check whether 5.15.0-xx-generic kernel is loaded.

    uname -r 
    

    Example output:

    5.15.0-84-generic
    

    Kernel driver build versions are frequently updated for security and bug fixes. DKMS patches depend on matching the kernel branch, not the minor build number. For example, with the kernel package 5.15.0-76-generic, only the 5.15 branch is required; the specific 0-76 build number is not important. Intel releases are regularly validated with the latest OSV builds, so any Ubuntu 5.15 build is expected to work.

  5. Follow the driver installation steps to install the latest production driver, including compute and media runtimes and development packages.

  6. Update the boot loader options by adding pci=realloc=off and disabling hang check to GRUB\_CMDLINE\_LINUX\_DEFAULT in /etc/default/grub.

    sudo vi /etc/default/grub 
    GRUB_CMDLINE_LINUX_DEFAULT="… i915.enable_hangcheck=0 pci=realloc=off"
    sudo update-grub
    
  7. Reboot the system.

    sudo reboot
    

    If Secure Boot is enabled in the BIOS, you might see a prompt during the reboot. Ensure you select Enroll MOK to allow the new kernel to take effect.

  8. List the group assigned ownership of the render nodes and the groups you are a member of:

    stat -c "%G" /dev/dri/render* 
    groups ${USER}
    

    If a group is listed for the render node but not for the user, add the user to the group using gpasswd. The following command adds the active user to the render group and spawns a new shell with that group active:

    sudo gpasswd -a ${USER} render 
    newgrp render 
    
  9. Verify the device is working with the i915 driver.

    $ sudo apt-get install hwinfo
    $ hwinfo --display 
    

    Example output for each Max 1100 card:

    ...
    274: PCI 2900.0: 0380 Display controller 
      [Created at pci.386] 
      Unique ID: W2eL.+ER_Ec9Ujm4   
      Parent ID: wIUg.xbjkZcxCQYD 
      SysFS ID: /devices/pci0000:26/0000:26:01.0/0000:27:00.0/0000:28:01.0/0000:29:00.0 
      SysFS BusID: 0000:29:00.0 
      Hardware Class: graphics card 
      Model: "Intel Display controller" 
      Vendor: pci 0x8086 "Intel Corporation" 
      Device: pci 0x0bda 
      SubVendor: pci 0x8086 "Intel Corporation" 
      SubDevice: pci 0x0000 
      Revision: 0x2f 
      Driver: "i915" 
      Driver Modules: "i915" 
      Memory Range: 0x3afe3f000000-0x3afe3fffffff (ro,non-prefetchable)   
      Memory Range: 0x3a7000000000-0x3a7fffffffff (ro,non-prefetchable) 
      IRQ: 787 (341 events) 
      Module Alias: "pci:v00008086d00000BDAsv00008086sd00000000bc03sc80i00" 
      Driver Info #0: 
        Driver Status: i915 is active 
        Driver Activation Cmd: "modprobe i915" 
      Config Status: cfg=new, avail=yes, need=no, active=unknown 
      Attached to: #210 (PCI bridge) 
    
  10. Perform a smoke test on the compute stack. This is not a comprehensive test; it only verifies that the GPU OpenCL runtime can be loaded. Additional tests are required to ensure full functionality.

clinfo -l 

Platform #0: Intel(R) OpenCL Graphics 
 +-- Device #0: Intel(R) Data Center GPU Max 1100 
  -- Device #1: Intel(R) Data Center GPU Max 1100 
  1. Update the device name.

The new GPU name:

sudo /sbin/update-pciids
lspci |grep Display

9a:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)
ca:00.0 Display controller: Intel Corporation Ponte Vecchio XT (1 Tile) [Data Center GPU Max 1100] (rev 2f)

Previous GPU name:

lspci |grep Display

9a:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)
ca:00.0 Display controller [0380]: Intel Corporation Device [8086:0bda] (rev 2f)

Example workloads

The following workloads have been validated with this Max 1100 configuration:

Documentation for each workload contains steps describing the installation of the necessary oneAPI toolkits.

Tools

This section describes the available tools available that can help with application development and optimization.

Intel® XPU Manager

Intel® XPU Manager is a free and open-source tool for monitoring and managing Intel Data Center GPUs. It is designed to simplify administration, maximize reliability and uptime, and improve utilization.

For more information, see Intel® XPU System Management Interface User Guide.

GDB – PVC debugger

GDB is installed on the machine as a part of the oneAPI base toolkit, so no extra step is needed to use it.

The following configuration is required to debug GPU using GDB. It is a one-time requirement on the system.

Prerequisite steps Before setting up the GDB debugger, follow these steps.

  1. Add the following two variables to GRUB_CMDLINE_LINUX_DEFAULT="" in /etc/default/grub "i915.debug_eu=1 i915.enable_hangcheck=0".

    $ sudo vi /etc/default/grub
    GRUB_CMDLINE_LINUX_DEFAULT="i915.debug_eu=1 i915.enable_hangcheck=0"
    $ sudo update-grub
    $ sudo reboot
    
  2. Disable preemption timeout on GPU.

    $ ACTION=="add|bind",SUBSYSTEM=="pci",DRIVER=="i915",RUN+="/bin/bash -c
    'for i in /sys/$DEVPATH/drm/card?/engine/[rc]cs*/preempt_timeout_ms; do
    echo 0 > $i; done'"
    $ udevadm trigger -s pci --action=add
    
  3. Ensure preemption timeout is set correctly.

    $ find /sys/devices -regex '.*/drm/card[0-9]*/engine/[rc]cs[0-9]*/preempt_timeout_ms' -exec echo {} \; -exec cat {} \;
    
  4. Set up GDB debugger.

    $ source /opt/intel/oneapi/setvars.sh
    $ export ZET_ENABLE_PROGRAM_DEBUGGING=1
    $ python3 /path/to/intel/oneapi/diagnostics/latest/diagnostics.py --filter debugger_sys_check --force
    
  5. Compile the program.

    $ mkdir array-transform
    $ cd array-transform
    $ wget https://raw.githubusercontent.com/oneapi-src/oneAPIsamples/master/Tools/ApplicationDebugger/arraytransform/src/array-transform.cpp
    $ icpx -fsycl -g -O0 array-transform.cpp -o array-transform
    $ export ONEAPI_DEVICE_SELCTOR=level_zero:0
    $ gdb-oneapi array-transform
    
  6. FRun GBD from the GDB console.

    (gdb) run
    

    Reference output:

    Starting program: /home/user1/workload/array-transform/array-transform
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    [New Thread 0x7fffca416640 (LWP 46007)]
    [Thread 0x7fffca416640 (LWP 46007) exited]
    [New Thread 0x7fffc9a15640 (LWP 46008)]
    [Thread 0x7fffc9a15640 (LWP 46008) exited]
    intelgt: gdbserver-ze started for process 46004.
    [New Thread 0x7fffc8ff4640 (LWP 46023)][SYCL] Using device: [Intel(R) Data Center GPU Max 1100] from [Intel(R) Level-Zero]
    success; result is correct.
    [Thread 0x7fffc8ff4640 (LWP 46023) exited]
    [Inferior 1 (process 46004) exited normally]
    Detaching from process 1
    [Inferior 2 (device [9a:00.0]) detached]
    Detaching from process 2
    [Inferior 3 (device [ca:00.0]) detached]
    intelgt: inferior 2 (gdbserver-ze) has been removed.
    intelgt: inferior 3 (gdbserver-ze) has been removed.
    
  7. Quit GDB console.

    (gdb) quit
    

Intel® VTune™ Profiler

This section describes how to use Intel® VTune™ Profiler with a DGEMM workload to analyze the performance of the Intel GPU MAX 1100.

The following steps assume the working directory is /home/user1/workload/benchmark/DGEMM. See DGEMM workload for setup steps.

Test setup:

$ sudo su
$ source /opt/intel/oneapi/setvars.sh
$ cd /home/user1/workload/benchmark/DGEMM
$ export ONEAPI_DEVICE_SELECTOR=level_zero:0
$ /dgemm.mkl

In your system configuration, you should not see any error message, such as “Failed to start profiling because the scope of the ptrace() system call application is limited.” However, if you encounter this error, set the value of the kernel.yama.ptrace_scope sysctl option to 0 with the following command:

$ sysctl -w kernel.yama.ptrace_scope=0

For more information, see the Intel® VTune™ Profiler User Guide.

VTune is a component of oneAPI Base Toolkit, so no additional installation is required. Run it using the following command. For a detailed description of the parameters, refer to the VTune User Guide.

$ vtune -collect gpu-hotspots -k characterization-mode=overview -k collect-programming-api=true -data-limit=0 --duration 20 -- ./dgemm.mkl

Support for loaned systems

If you need support during the sample period, either submit a service request or call the customer support center.

Submitting service requests

Flow these steps to submit a service request.

  1. Log in to the support portal.

  2. Select Intel® Data Center GPU Max 1100 and choose Create Request.

  3. Describe your issue on the next screen and select Check For Answers.

  4. Choose Continue to Request Creation.

  5. Provide answers to additional questions and click Submit Request.

A confirmation window will appear informing you a new case number has been created. You can expect a response within 24 hours.

Call the customer support center

The customer support center is open Monday to Friday from 8 AM to 5 PM PST. To reach the center, please call: (+1) 855-816-1934.