Skip to content

support for GDS, nvidia-fs #50

@lyquid617

Description

@lyquid617

NVIDIA Open GPU Kernel Modules Version

590.44.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.5 LTS

Kernel Release

Linux GPU-18 5.15.0-127-generic NVIDIA#137-Ubuntu SMP Fri Nov 8 15:21:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX5090

Describe the bug

p2pfunction works fine

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 52.28GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

But while trying to use cuFile API ( GDS ), nvidia-fs driver will failed, like cufile.log in user space

 29-12-2025 13:30:14:51 [pid=111425 tid=112507] ERROR  0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
 29-12-2025 13:30:14:51 [pid=111425 tid=112507] ERROR  0:555 map failed

 29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR  cufio-obj:156 error allocating nvfs handle, size: 2097152
 29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR  cufio_core:1648 cuFileBufRegister error, object allocation failed
 29-12-2025 13:30:14:52 [pid=111425 tid=112507] ERROR  cufio_core:1729 cuFileBufRegister error cufile success
 29-12-2025 13:30:14:54 [pid=111425 tid=112507] ERROR  0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
 29-12-2025 13:30:14:54 [pid=111425 tid=112507] ERROR  0:555 map failed
 29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR  0:866 Buffer map failed for PCI-Group: 4 GPU: 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR  0:994 Failed to obtain bounce buffer from domain: 4 GPU: 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR  0:1299 failed to get bounce buffer for PCI group 4 GPU 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR  cufio_core:3371 Final direct subio failed retval  -5011  buf_offset:  0  file_offset:  4096  size:  528384
 29-12-2025 13:30:14:65 [pid=111425 tid=112507] ERROR  cufio_core:3392 Setting I/O to failed. Expected I/O Size  528384  actual:  0
 29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR  cufio-obj:156 error allocating nvfs handle, size: 2097152
 29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR  cufio_core:1648 cuFileBufRegister error, object allocation failed
 29-12-2025 13:30:14:65 [pid=111425 tid=112515] ERROR  cufio_core:1729 cuFileBufRegister error cufile success
 29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR  0:866 Buffer map failed for PCI-Group: 4 GPU: 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR  0:994 Failed to obtain bounce buffer from domain: 4 GPU: 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR  0:1299 failed to get bounce buffer for PCI group 4 GPU 4
 29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR  cufio_core:3371 Final direct subio failed retval  -5011  buf_offset:  0  file_offset:  4096  size:  528384
 29-12-2025 13:30:14:65 [pid=111425 tid=112525] ERROR  cufio_core:3392 Setting I/O to failed. Expected I/O Size  528384  actual:  0
 29-12-2025 13:30:14:67 [pid=111425 tid=112515] ERROR  0:522 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
 29-12-2025 13:30:14:67 [pid=111425 tid=112515] ERROR  0:555 map failed

and dmesg in kernel space

[  688.044426] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
                va_start=0x40fc3400000/va_end=0x40fc34fffff/rounded_size=0x100000/gpu_buf_length=0x100000
[  688.045534] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
                va_start=0x80fc2d00000/va_end=0x80fc2dfffff/rounded_size=0x100000/gpu_buf_length=0x100000
[  688.048290] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
                va_start=0x80fc2f00000/va_end=0x80fc2ffffff/rounded_size=0x100000/gpu_buf_length=0x100000
[  688.051622] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent
                va_start=0x80fc2900000/va_end=0x80fc29fffff/rounded_size=0x100000/gpu_buf_length=0x100000
[  688.054450] nvidia-fs:nvfs_pin_gpu_pages:1336 Error ret -22 invoking nvidia_p2p_get_pages_persistent

Don't know if anyone is familiar with this.

To Reproduce

compile nvidia-fs from source, insmod nvidia-fs.ko
Use dali to load data, which will invoke cuFile APIs like cuFileHandleRegister cuFileBufRegister

Bug Incidence

Always

nvidia-bug-report.log.gz

I guess the problem should lies in nvidia_p2p_get_pages_persistent

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions