Skip to content

Conversation

@raulcd
Copy link
Member

@raulcd raulcd commented Dec 18, 2025

Rationale for this change

The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a runs-on solution to run CUDA runners.

What changes are included in this PR?

Add the new workflow for cuda_extra.yml with CI jobs that use the runs-on CUDA runners.

Due to the underlying instances having CUDA 12.9 the jobs to be run are:

  • AMD64 Ubuntu 22 CUDA 11.7.1
  • AMD64 Ubuntu 24 CUDA 12.9.0
  • AMD64 Ubuntu 22 CUDA 11.7.1 Python
  • AMD64 Ubuntu 24 CUDA 12.9.0 Python

A follow up issue has been created to add jobs for CUDA 13, see: #48783

A new label CI: Extra: CUDA has also been created.

Are these changes tested?

Yes via CI

Are there any user-facing changes?

No

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Dec 18, 2025
@github-actions
Copy link

⚠️ GitHub issue #48582 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 22, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 7, 2026
@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

@pitrou @kou this PR was originally created to test the runs-on solution using our AWS infra to add the CUDA runners.
We used to have 4 CI jobs:

  • CUDA 13.0.2 C++
  • CUDA 13.0.2 Python
  • CUDA 11.7.1 C++
  • CUDA 11.7.1 Python

From an organization standpoint would you prefer to have the C++ jobs added to the CI: Extra: C++ and create a new CI: Extra: Python or should we add a new CI: Extra: Cuda with those 4 jobs?

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

The Python CUDA 13.0.2 errors are not related to this PR per se (this is only adding the new runners) but there seems to be an issue initializing CUDA:
OSError: Cuda error 999 in function 'cuInit': [CUDA_ERROR_UNKNOWN] unknown error
@gmarkall @pitrou should I create a new issue to fix this separately once we have added the new runners or do you have any pointers for a fix?

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

do you have any pointers for a fix?

Perhaps the driver version on the machine is older than 13.0.2 used in the container. Do you have a way to check the driver version installed on the machine that's hosting the docker image? (e.g. what is its nvidia-smi output?)

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

From the nvidia-smi output:

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

I think that the machine will need at least driver version 580 for a CUDA 13 container. Are you able to change the underlying machine?

@gmarkall
Copy link
Contributor

gmarkall commented Jan 7, 2026

Alternatively, it may be possible to make this configuration work by adding the relevant cuda-compat package to the container, which I think from https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html#id1 may be cuda-compat-13-0.

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

We are using the default images built here (yes they point out they have cuda 12):
https://github.com/runs-on/runner-images-for-aws?tab=readme-ov-file#gpu
There's the possibility of creating our own AWS images with packer but I am unsure we want to follow that route due to maintenance overhead:
https://runs-on.com/runners/gpu/#using-nvidia-deep-learning-amis
I have a meeting with Cyril from runs-on tomorrow, I might ask if CUDA 13 is in their roadmap.
Maybe we should go with cuda-compat here, I am trying a naive approach for testing purposes at the moment

@raulcd
Copy link
Member Author

raulcd commented Jan 7, 2026

Thanks @gmarkall for your help, unfortunately it seems to fail with a different error if cuda-compat-13-0 is installed:

 cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_IMAGE: This indicates that the device kernel image is invalid. This can also indicate an invalid CUDA module.

and nvidia-smi now shows 13.0:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+

@kou
Copy link
Member

kou commented Jan 8, 2026

I prefer CI: Extra: CUDA because CUDA isn't related in most changes. If we use CI: Extra: C++/Python, we need to use CUDA runners for many non CUDA related changes.

@gmarkall
Copy link
Contributor

gmarkall commented Jan 8, 2026

@raulcd I only just caught up with this - did you switch to CUDA 12.9 because of the "CUDA_ERROR_INVALID_IMAGE" error with CUDA 13?

@raulcd
Copy link
Member Author

raulcd commented Jan 8, 2026

did you switch to CUDA 12.9 because of the "CUDA_ERROR_INVALID_IMAGE" error with CUDA 13?

@gmarkall Yes, I think having some CUDA jobs for the release is better than having none. I've create a new issue to add CUDA 13 which we can follow up once this is merged:

@raulcd raulcd marked this pull request as ready for review January 8, 2026 11:52
@raulcd raulcd requested review from pitrou and rok January 8, 2026 12:07
@raulcd raulcd changed the title GH-48582: [CI][GPU][C++][Python] Add new cuda jobs using the new self-hosted runners GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners Jan 8, 2026
@gmarkall
Copy link
Contributor

gmarkall commented Jan 8, 2026

Yes, I think having some CUDA jobs for the release is better than having none.

Sounds good 🙂

I've create a new issue to add CUDA 13

OK - I think we got that error because we had some mismatching libraries somehow, and I'd be happy to help troubleshoot that in the follow-up from this PR.

@pitrou
Copy link
Member

pitrou commented Jan 8, 2026

A new label CI: Extra: CUDA has also been created.

Will those jobs also be run on a nightly basis as other Extra jobs?

@raulcd
Copy link
Member Author

raulcd commented Jan 8, 2026

Will those jobs also be run on a nightly basis as other Extra jobs?

Yes, they will be scheduled:

  schedule:
    - cron: |
        0 6 * * *

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 8, 2026
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026
@raulcd raulcd requested a review from pitrou January 9, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting changes Awaiting changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants