-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners #48583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
@pitrou @kou this PR was originally created to test the
From an organization standpoint would you prefer to have the C++ jobs added to the |
|
The Python CUDA 13.0.2 errors are not related to this PR per se (this is only adding the new runners) but there seems to be an issue initializing CUDA: |
Perhaps the driver version on the machine is older than 13.0.2 used in the container. Do you have a way to check the driver version installed on the machine that's hosting the docker image? (e.g. what is its |
|
From the I think that the machine will need at least driver version 580 for a CUDA 13 container. Are you able to change the underlying machine? |
|
Alternatively, it may be possible to make this configuration work by adding the relevant |
|
We are using the default images built here (yes they point out they have cuda 12): |
|
Thanks @gmarkall for your help, unfortunately it seems to fail with a different error if and |
|
I prefer |
…w self-hosted runners
This reverts commit f5766b7.
|
@raulcd I only just caught up with this - did you switch to CUDA 12.9 because of the "CUDA_ERROR_INVALID_IMAGE" error with CUDA 13? |
@gmarkall Yes, I think having some CUDA jobs for the release is better than having none. I've create a new issue to add CUDA 13 which we can follow up once this is merged: |
Sounds good 🙂
OK - I think we got that error because we had some mismatching libraries somehow, and I'd be happy to help troubleshoot that in the follow-up from this PR. |
Will those jobs also be run on a nightly basis as other Extra jobs? |
Yes, they will be scheduled: |
Rationale for this change
The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a runs-on solution to run CUDA runners.
What changes are included in this PR?
Add the new workflow for
cuda_extra.ymlwith CI jobs that use the runs-on CUDA runners.Due to the underlying instances having CUDA 12.9 the jobs to be run are:
A follow up issue has been created to add jobs for CUDA 13, see: #48783
A new label
CI: Extra: CUDAhas also been created.Are these changes tested?
Yes via CI
Are there any user-facing changes?
No