GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners #48583

raulcd · 2025-12-18T08:45:14Z

Rationale for this change

The CUDA jobs stopped working when Voltron Data infrastructure went down. We have set up with ASF Infra a runs-on solution to run CUDA runners.

What changes are included in this PR?

Add the new workflow for cuda_extra.yml with CI jobs that use the runs-on CUDA runners.

Due to the underlying instances having CUDA 12.9 the jobs to be run are:

AMD64 Ubuntu 22 CUDA 11.7.1
AMD64 Ubuntu 24 CUDA 12.9.0
AMD64 Ubuntu 22 CUDA 11.7.1 Python
AMD64 Ubuntu 24 CUDA 12.9.0 Python

A follow up issue has been created to add jobs for CUDA 13, see: #48783

A new label CI: Extra: CUDA has also been created.

Are these changes tested?

Yes via CI

Are there any user-facing changes?

No

GitHub Issue: [CI][GPU][C++][Python] Add new jobs for cuda #48582

github-actions · 2025-12-18T08:46:18Z

⚠️ GitHub issue #48582 has been automatically assigned in GitHub to PR creator.

.github/workflows/cpp_extra.yml

raulcd · 2026-01-07T11:53:00Z

@pitrou @kou this PR was originally created to test the runs-on solution using our AWS infra to add the CUDA runners.
We used to have 4 CI jobs:

CUDA 13.0.2 C++
CUDA 13.0.2 Python
CUDA 11.7.1 C++
CUDA 11.7.1 Python

From an organization standpoint would you prefer to have the C++ jobs added to the CI: Extra: C++ and create a new CI: Extra: Python or should we add a new CI: Extra: Cuda with those 4 jobs?

raulcd · 2026-01-07T13:29:00Z

The Python CUDA 13.0.2 errors are not related to this PR per se (this is only adding the new runners) but there seems to be an issue initializing CUDA:
OSError: Cuda error 999 in function 'cuInit': [CUDA_ERROR_UNKNOWN] unknown error
@gmarkall @pitrou should I create a new issue to fix this separately once we have added the new runners or do you have any pointers for a fix?

gmarkall · 2026-01-07T14:01:30Z

do you have any pointers for a fix?

Perhaps the driver version on the machine is older than 13.0.2 used in the container. Do you have a way to check the driver version installed on the machine that's hosting the docker image? (e.g. what is its nvidia-smi output?)

gmarkall · 2026-01-07T14:53:28Z

From the nvidia-smi output:

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

I think that the machine will need at least driver version 580 for a CUDA 13 container. Are you able to change the underlying machine?

gmarkall · 2026-01-07T14:56:20Z

Alternatively, it may be possible to make this configuration work by adding the relevant cuda-compat package to the container, which I think from https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html#id1 may be cuda-compat-13-0.

raulcd · 2026-01-07T15:10:24Z

We are using the default images built here (yes they point out they have cuda 12):
https://github.com/runs-on/runner-images-for-aws?tab=readme-ov-file#gpu
There's the possibility of creating our own AWS images with packer but I am unsure we want to follow that route due to maintenance overhead:
https://runs-on.com/runners/gpu/#using-nvidia-deep-learning-amis
I have a meeting with Cyril from runs-on tomorrow, I might ask if CUDA 13 is in their roadmap.
Maybe we should go with cuda-compat here, I am trying a naive approach for testing purposes at the moment

raulcd · 2026-01-07T15:32:59Z

Thanks @gmarkall for your help, unfortunately it seems to fail with a different error if cuda-compat-13-0 is installed:

 cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_IMAGE: This indicates that the device kernel image is invalid. This can also indicate an invalid CUDA module.

and nvidia-smi now shows 13.0:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+

kou · 2026-01-08T00:44:30Z

I prefer CI: Extra: CUDA because CUDA isn't related in most changes. If we use CI: Extra: C++/Python, we need to use CUDA runners for many non CUDA related changes.

…w self-hosted runners

This reverts commit f5766b7.

gmarkall · 2026-01-08T11:40:47Z

@raulcd I only just caught up with this - did you switch to CUDA 12.9 because of the "CUDA_ERROR_INVALID_IMAGE" error with CUDA 13?

raulcd · 2026-01-08T11:49:28Z

did you switch to CUDA 12.9 because of the "CUDA_ERROR_INVALID_IMAGE" error with CUDA 13?

@gmarkall Yes, I think having some CUDA jobs for the release is better than having none. I've create a new issue to add CUDA 13 which we can follow up once this is merged:

[CI][GPU] Add jobs for CUDA 13 #48783

gmarkall · 2026-01-08T12:34:44Z

Yes, I think having some CUDA jobs for the release is better than having none.

Sounds good 🙂

I've create a new issue to add CUDA 13

OK - I think we got that error because we had some mismatching libraries somehow, and I'd be happy to help troubleshoot that in the follow-up from this PR.

pitrou · 2026-01-08T13:05:56Z

A new label CI: Extra: CUDA has also been created.

Will those jobs also be run on a nightly basis as other Extra jobs?

.github/workflows/cuda_extra.yml

raulcd · 2026-01-08T13:12:33Z

Will those jobs also be run on a nightly basis as other Extra jobs?

Yes, they will be scheduled:

  schedule:
    - cron: |
        0 6 * * *

github-actions bot added the awaiting committer review Awaiting committer review label Dec 18, 2025

rok reviewed Dec 22, 2025

View reviewed changes

.github/workflows/cpp_extra.yml Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 22, 2025

raulcd force-pushed the GH-48582 branch from d6bb045 to 7ee187a Compare January 7, 2026 10:51

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 7, 2026

raulcd added 14 commits January 8, 2026 11:00

apacheGH-48582: [CI][GPU][C++][Python] Add new cuda jobs using the ne…

31347da

…w self-hosted runners

Set Ubuntu for archery via env

0fd6459

Try using g6f.large instances

9abdba1

Back to g4dn.xlarge to test

0e015b8

Try removing runs-on from matrix

795704c

Add Python cuda jobs

cf8c88b

Try spot instances

e6c72b8

Try updating CUDA to 13.1.80 for numba Python compatibility

5196286

Revert "Try updating CUDA to 13.1.80 for numba Python compatibility"

996443a

This reverts commit f5766b7.

Remove unnecessary cuda jobs from archery tasks

bd77345

Check nvidia-smi output

499cfeb

Try installing cuda-compat-13-0

d79e0f4

Remove stray cuda tasks

8acc7c7

Move Cuda jobs to its own extra workflow

6cba3c1

raulcd force-pushed the GH-48582 branch from b794666 to 6cba3c1 Compare January 8, 2026 10:09

raulcd mentioned this pull request Jan 8, 2026

[CI][GPU] Add jobs for CUDA 13 #48783

Open

raulcd added 2 commits January 8, 2026 11:49

Use Cuda 12.9.0 as is the last one from the instances

74df3f5

Remove unnecessary cuda-compat-13-0 installation

cacc1a4

raulcd marked this pull request as ready for review January 8, 2026 11:52

raulcd requested review from assignUser, jonkeane and kou as code owners January 8, 2026 11:52

raulcd requested review from pitrou and rok January 8, 2026 12:07

raulcd changed the title ~~GH-48582: [CI][GPU][C++][Python] Add new cuda jobs using the new self-hosted runners~~ GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners Jan 8, 2026

pitrou reviewed Jan 8, 2026

View reviewed changes

.github/workflows/cuda_extra.yml Outdated Show resolved Hide resolved

.github/workflows/cuda_extra.yml Outdated Show resolved Hide resolved

.github/workflows/cuda_extra.yml Show resolved Hide resolved

.github/workflows/cuda_extra.yml Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 8, 2026

Use cpp/** instead of cpp/src/arrow/gpu/**

452b575

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026

Remove trigger on push to branches and pull_requests paths

683247c

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jan 8, 2026

raulcd requested a review from pitrou January 9, 2026 09:15

GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners #48583

Are you sure you want to change the base?

GH-48582: [CI][GPU][C++][Python] Add new CUDA jobs using the new self-hosted runners #48583

Conversation

raulcd commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Uh oh!

raulcd commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raulcd commented Jan 7, 2026

Uh oh!

gmarkall commented Jan 7, 2026

Uh oh!

gmarkall commented Jan 7, 2026

Uh oh!

gmarkall commented Jan 7, 2026

Uh oh!

raulcd commented Jan 7, 2026

Uh oh!

raulcd commented Jan 7, 2026

Uh oh!

kou commented Jan 8, 2026

Uh oh!

gmarkall commented Jan 8, 2026

Uh oh!

raulcd commented Jan 8, 2026

Uh oh!

gmarkall commented Jan 8, 2026

Uh oh!

pitrou commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raulcd commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

raulcd commented Dec 18, 2025 •

edited

Loading

raulcd commented Jan 7, 2026 •

edited

Loading