forked from NVIDIA/open-gpu-kernel-modules
-
Notifications
You must be signed in to change notification settings - Fork 131
Open
Labels
bugSomething isn't workingSomething isn't working
Description
NVIDIA Open GPU Kernel Modules Version
590.44.01
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 22.04.5 LTS
Kernel Release
Linux GPU-18 5.15.0-127-generic NVIDIA#137-Ubuntu SMP Fri Nov 8 15:21:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
RTX5090
Describe the bug
p2pfunction works fine
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 52.28GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
But while trying to use cuFile API ( GDS ), nvidia-fs driver will failed, like cufile.log in user space
29-12-2025 13:30:14:51 [pid=111425 tid=112507] ERROR 0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
29-12-2025 13:30:14:51 [pid=111425 tid=112507] ERROR 0:555 map failed
29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR cufio-obj:156 error allocating nvfs handle, size: 2097152
29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR cufio_core:1648 cuFileBufRegister error, object allocation failed
29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR cufio_core:1729 cuFileBufRegister error cufile success
29-12-2025 13:30:14:54 [pid=111425 tid=112507] ERROR 0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
29-12-2025 13:30:14:54 [pid=111425 tid=112507] ERROR 0:555 map failed
29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR 0:866 Buffer map failed for PCI-Group: 4 GPU: 4
29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR 0:994 Failed to obtain bounce buffer from domain: 4 GPU: 4
29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR 0:1299 failed to get bounce buffer for PCI group 4 GPU 4
29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR cufio_core:3371 Final direct subio failed retval -5011 buf_offset: 0 file_offset: 4096 size: 528384
29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR cufio_core:3392 Setting I/O to failed. Expected I/O Size 528384 actual: 0
29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR cufio-obj:156 error allocating nvfs handle, size: 2097152
29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR cufio_core:1648 cuFileBufRegister error, object allocation failed
29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR cufio_core:1729 cuFileBufRegister error cufile success
29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR 0:866 Buffer map failed for PCI-Group: 4 GPU: 4
29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR 0:994 Failed to obtain bounce buffer from domain: 4 GPU: 4
29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR 0:1299 failed to get bounce buffer for PCI group 4 GPU 4
29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR cufio_core:3371 Final direct subio failed retval -5011 buf_offset: 0 file_offset: 4096 size: 528384
29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR cufio_core:3392 Setting I/O to failed. Expected I/O Size 528384 actual: 0
29-12-2025 13:30:14:67 [pid=111425 tid=112515] ERROR 0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
29-12-2025 13:30:14:67 [pid=111425 tid=112515] ERROR 0:555 map failed
and dmesg in kernel space
[ 688.044426] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
va_start=0x40fc3400000/va_end=0x40fc34fffff/rounded_size=0x100000/gpu_buf_length=0x100000
[ 688.045534] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
va_start=0x80fc2d00000/va_end=0x80fc2dfffff/rounded_size=0x100000/gpu_buf_length=0x100000
[ 688.048290] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
va_start=0x80fc2f00000/va_end=0x80fc2ffffff/rounded_size=0x100000/gpu_buf_length=0x100000
[ 688.051622] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
va_start=0x80fc2900000/va_end=0x80fc29fffff/rounded_size=0x100000/gpu_buf_length=0x100000
[ 688.054450] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
Don't know if anyone is familiar with this.
To Reproduce
compile nvidia-fs from source, insmod nvidia-fs.ko
Use dali to load data, which will invoke cuFile APIs like cuFileHandleRegister cuFileBufRegister
Bug Incidence
Always
nvidia-bug-report.log.gz
I guess the problem should lies in nvidia_p2p_get_pages_persistent
More Info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working