High CPU usage with Java FFI + liburing (vs fio and FileChannel) #1449

davidtos · 2025-08-07T22:50:08Z

davidtos
Aug 7, 2025

I'm using liburing via Java Foreign function interface to do random writes across a set of 2200 files. I am seeing some unexpected high CPU usage compared to fio, and java's filechannel. I was hoping someone could give me some pointers to where to look why this happens.

Setup:

ext4 on nvme
2211 files shared by all threads
Each thread gets its own ring
Flags used: IORING_SETUP_SINGLE_ISSUER, IORING_SETUP_COOP_TASKRUN, IORING_SETUP_DEFER_TASKRUN
Buffed IO
write size 512B to 16KB
using fixed files
kernel 6.14.0-27-generic

Scaling behavior:
My bindings scale well to 4 threads but each thread doesn't add as much performance as the previous one and after 8 threads it starts performing worse.

Using perf record -e 'lock:*' -g --call-graph=dwarf -F 997 -p 23523 i am seeing some contention:

which kind of explains the iostat i am seeing... my bindings:

avg-cpu:  %user   %system  %iowait  %idle
           5.63    81.40     7.00    5.97

Vs Fio

avg-cpu:  %user   %system  %iowait  %idle
           0.00     0.34    99.66    0.00

Fio job:

[global]
ioengine=io_uring
rw=randwrite
direct=0
bs=512
size=512
runtime=60
time_based=1
group_reporting=0
randrepeat=0
norandommap=1
refill_buffers=0
iodepth=32
fixedbufs=0
hipri=0
registerfiles=1
sqthread_poll=0

[shared_files_job]
numjobs=8
nrfiles=400
filesize=10M
filename_format=testfile.$filenum
file_service_type=roundrobin
sqthread_poll=0

I understand Java bindings won’t match fio’s raw speed, but I expected the profile to be more I/O-bound like fio, not the opposite. In essence each benchmark thread in java is doing the following:

setup_ring(...);
loop until done {
    submit N io_uring_prep_write() calls using a fixed file/buffer; // keep N tasks in flight
    io_uring_submit();
    io_uring_wait_cqe_nr(N);
}

I tried different versions of the previous example submitting less/more often, peeking and waiting for n CQE's different queue sizes but nothing seems to make it perform more like fio. I also created a version of this with random reads but that doesn't suffer from a scaling issue.

I’m looking for guidance on how to structure multithreaded buffered I/O with shared files in a way that avoids the CPU bottlenecks I’m seeing.

Any insight into how best to mitigate it would be appreciated.

Thanks

UPDATE:

Some perf top results:

Single threaded:

Overhead  Shared Object          Symbol                                                                                                                                                                                                                    
   7.57%  [kernel]               [k] try_to_wake_up      
   6.14%  [kernel]               [k] io_init_req         
   5.72%  [kernel]               [k] io_issue_sqe        
   5.31%  [kernel]               [k] llist_reverse_order 
   5.30%  [kernel]               [k] _raw_spin_lock      
   4.56%  [kernel]               [k] __schedule          
   3.10%  [kernel]               [k] kfree               
   3.03%  [kernel]               [k] __pi_memset_generic 
   2.94%  [kernel]               [k] __slab_free         
   2.58%  [kernel]               [k] io_clean_op

6 threads:

Overhead  Shared Object          Symbol                                                                                                                                                                                                                    
  10.56%  [kernel]               [k] io_init_req                  
   6.32%  [kernel]               [k] io_clean_op                
   5.83%  [kernel]               [k] io_issue_sqe               
   3.57%  [kernel]               [k] io_prep_async_work         
   3.18%  [kernel]               [k] llist_reverse_order        
   3.13%  [kernel]               [k] __pi_memset_generic        
   3.11%  libc.so.6              [.] _int_malloc                
   2.73%  [kernel]               [k] __pi_clear_page            
   2.18%  [JIT] tid 298374       [.] 0x0000e559d4171db0         
   2.17%  [kernel]               [k] __slab_free                
   2.12%  liburing-ffi.so.2.9    [.] io_uring_prep_write

axboe · 2025-08-10T22:13:26Z

axboe
Aug 10, 2025
Maintainer

Since you closed without further comments, anything else to add here? You have a lot of io-wq contention, which is most likely due to either the bigger write sizes or just the storage and file system used. You will probably see better performance if you have the threads share the io-wq backend, potentially, using IORING_SETUP_ATTACH_WQ to set up subsequent thread rings with struct io_uring_params->wq_fd set to the first ring.

1 reply

davidtos Aug 11, 2025
Author

Thanks I will try to see if IORING_SETUP_ATTACH_WQ helps with multithreading. I tried writing 10 bytes instead of 4096 but that doesn't make a difference. The Java code I am comparing against uses the same files on the same storage and file system. Maybe Java does some trickery but I think I can rule out those two variables(?).

The weird thing is that even with a single ring the CPU usage jumps from an Idle 4% to 47% during the benchmark while Java's filechannel only goes up to 17% CPU. This only happens with writing. The read benchmark (uring vs filechannel) that are almost the same the writing ones do not show this behavior and stay within a few percentile of each other.

I guess my question is what would be a good way to see what those worker threads are up to?

davidtos · 2025-08-22T17:32:49Z

davidtos
Aug 22, 2025
Author

Tried running it with 20 threads, each with their own ring connected to a single ring using IORING_SETUP_ATTACH_WQ . The scores stay about the same but with the flag it uses around 100 threads more 2600 vs 2500 (across multiple runs). Guessing it's probably the work itself causing the issue.

1 reply

axboe Aug 22, 2025
Maintainer

2500 or 2600 threads is literally insane! That would be a huge efficiency issue, it's way too many threads for driving any kind of IO. What sq/cq depths are you setting up the ring with?

davidtos · 2025-08-23T13:44:26Z

davidtos
Aug 23, 2025
Author

I ran the benchmark with different number of threads and queue depths to see how many worker threads it creates (using ps -o pid,tid,comm -p 107901 -L | grep iou -c). It is running on a 16C/32T machine, seeing your reaction I guess anything more than 100 threads is a bit excessive. Tried io_uring_register_iowq_max_workers but that just lowers the benchmark scores.

Threads	Depth	Workers
1	16	14
32	16	608
1	32	19
32	32	1111
1	64	30
32	64	2096
1	128	100
32	128	4096

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High CPU usage with Java FFI + liburing (vs fio and FileChannel) #1449

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

High CPU usage with Java FFI + liburing (vs fio and FileChannel) #1449

Uh oh!

Uh oh!

davidtos Aug 7, 2025

Replies: 3 comments · 2 replies

Uh oh!

axboe Aug 10, 2025 Maintainer

Uh oh!

Uh oh!

davidtos Aug 11, 2025 Author

Uh oh!

davidtos Aug 22, 2025 Author

Uh oh!

axboe Aug 22, 2025 Maintainer

Uh oh!

davidtos Aug 23, 2025 Author

davidtos
Aug 7, 2025

Replies: 3 comments 2 replies

axboe
Aug 10, 2025
Maintainer

davidtos Aug 11, 2025
Author

davidtos
Aug 22, 2025
Author

axboe Aug 22, 2025
Maintainer

davidtos
Aug 23, 2025
Author