-
Notifications
You must be signed in to change notification settings - Fork 157
Add HIP 0001 ring buffer proposal for Hyperlight I/O #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| **Risk**: Malicious guest corrupting the queue **Mitigation**: Do not expose low level queue API to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that not exposing it to the guest does anything. The guest is fully untrusted and so we need to do the mitigation suggested after this, that we serialize all data as known types and do assertions as suggested.
| example `ioeventfd` for kvm). This is especially useful for streaming scenarios where the guest can | ||
| continue processing while the host consumes data asynchronously. | ||
|
|
||
| **3. Inline Descriptors** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this proposed to be implemented now or as possible improvement in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to give it a try in the first draft, but really if we're gonna keep it or not depends on two factors: 1) can we actually fit any meaningful data into inline slot, e.g. can serialized function call with single argument fit? 2) the performance improvement for such calls is substantial. Both are hard to assess without actual implementation.
| 2. Device writes up to buffer length | ||
| 3. Device sets actual written length in descriptor | ||
| 4. If `actual_length > buffer_length`, device sets a `TRUNCATED` flag, | ||
| 5. Driver can re-submit with larger buffer if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can image this would be possible performance issue? Can we provide a metric for when this occurs and a way to improve initial buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I would imagine that this only happens when you try to transport a "big" chunk of data for which the base case should be using stream rather then pushing everything at once. That being said the protocol should support full round trip but really the mitigation of this problem should be using streams.
| ### Snapshotting | ||
|
|
||
| Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any | ||
| attempt to snapshot a sandbox with such pending requests will result in a snapshot failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ensure this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can query the queue for number of inflight messages during snapshotting and fail if it reports positive number. Because we have to track the number of free slots in the queue getting number of inflight messages is just length of the queue minus free slots.
jsturtevant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for such a detailed write up. Left a few minor comments but otherwise this seems like it would set up the project to work well for the future
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
2318367 to
25f5d22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds HIP 0001, a comprehensive proposal document for implementing a virtio-inspired ring buffer mechanism for Hyperlight I/O. The proposal aims to replace the current stack-based communication model with a more efficient ring buffer approach to reduce VM exits and enable streaming communication patterns.
Changes:
- Adds a detailed technical proposal document describing the ring buffer design, API, and implementation plan
- Updates typos.toml to allow "readables" and "writables" as valid terms used in the proposal
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| proposals/0001-rng-buf/README.md | Comprehensive HIP document describing virtio-inspired ring buffer design with memory layout, API examples, and implementation plan |
| typos.toml | Adds exceptions for "readables" and "writables" terms used in the proposal document |
|
Really great work on this HIP, @andreiltd! Loved seeing the diagrams, extensive testing plan, code, and whatnot. A few thoughts: 1. Side-by-side comparison suggestion I think the HIP could benefit from a more explicit side-by-side comparison in the "Comparison with current implementation" section. Something like a table showing:
I feel like this would help readers quickly grasp the key difference in the two approaches. 2. Related work in Nanvix fork I've done some similar work in the Nanvix hyperlight fork that might be relevant: 76c9e7c There, I added a 3. Single-threaded vs multi-threaded phasing For the multi-threaded case (guest and host on separate threads), the host would need a way to interrupt the running guest when responses are ready--yes? For that, I've also done related work for that in Nanvix's Hyperlight fork: a897cad for KVM. My hardware interrupts work enables PIC/APIC interrupt injection which could support the ioeventfd-style notifications mentioned in the HIP. Happy to coordinate on this when the time comes :) The backward compatibility goal (current function call model portable without public API changes) is really cool to see--great for downstream consumers like Nanvix. Looking forward to seeing this land! |
Fix typos Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tomasz Andrzejak <andreiltd@users.noreply.github.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
a2a24cf to
934dee7
Compare
syntactically
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly really like this! Left a few minor nits, and a couple of bigger comments clarifying the question of where buffers get allocated.
I didn't review the Rust type design in detail, as I understand that is going to be updated shortly.
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| **Risk**: Malicious guest corrupting the queue **Mitigation**: Do not expose low level queue API to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jsturtevant here---the raw state of the queue is fundamentally exposed to the guest. The thing that we need to do on the host for safety is figure out what that means for us: making sure that the buffers are always in guest memory, ensuring that we do not access them in racy ways (even if the guest tries to trick us into having the same buffer both read from and written to at the same time), etc.
| ## Design Details | ||
|
|
||
| ### Packed Virtqueue Overview | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the producer/consumer terminology here. I think it added to the confusion before about how we should have buffer allocation responsibility be divided. How about something like "the driver is the side that can allocate buffers accessible by both sides (i.e. the guest), and the device is the side that reads from/writes to these buffers on request (i.e. the host). The host is essentially providing a specialised paravirtualised device to the guest, and so always takes on the device role.
| can act as driver or device depending on the direction of communication. | ||
|
|
||
| #### Bidirectional Communication | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like:
Hyperlight needs to support both device-initiated (for host->guest function calls) and driver-initiated (for guest->host function calls) communication. This is generally accomplished by having the guest create during early initialisation and keep populated a set of available buffers into which the host should next write incoming work for the guest.
I am not fundamentally opposed to having split rx/tx queues, like e.g. virtio-net, but unsure if they actually make sense here, at least in the single-threaded case?
|
|
||
| #### Memory Layout | ||
|
|
||
| The queue structure reside in shared memory and are accessible to both host and guest. The layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an offset from the mapped region base, or a GPA? I would tend to think GPA, even though in most cases the host will want to ensure that it is in the scratch region and then convert to an offset from there.
|
|
||
| <img width="1324" height="726" alt="submit" src="https://github.com/user-attachments/assets/a3ee2dea-a55b-4d50-8b96-1702617a21f0" /> | ||
|
|
||
| **Step 2:** Device processes and writes response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that in this picture there is implicitly some header inside the buffers passed where some kind of call ID allocated by the caller can be used to link up calls/responses?
| descriptor layer or in the flatbuffer schema: | ||
|
|
||
| 1. Driver allocates buffer of estimated size | ||
| 2. Device writes up to buffer length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this flag set in the queue flags, or in the header protocol inside of the buffers? The latter I assume needs to keep track of both a request id and an offset from the beginning of the response to that request for this buffer, for this to work?
|
|
||
| ### Snapshotting | ||
|
|
||
| Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is a slightly more complex issue with snapshotting as well: the state of the buffer pool in the driver will need to be considered/updated, because (a) the physical addresses of the inflight "write the next request for me here" buffers will need to be updated to the new scratch region and (b) something will need to be done (perhaps the same update? but slightly unclear, in case the guest is currently using data in those buffers) with the buffers in the allocator pool that are not currently available in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, some kind of glitch seems to have resulted in my comments getting posted twice. I can't delete this duplicate review comment, apparently.
syntactically
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments on the Rust definitions as well.
| This mechanism ensures that no locks are required for synchronization, only memory barriers combined | ||
| with atomic publishing of flags ensure that the other side will never observe a partial update: | ||
|
|
||
| - Driver: Write descriptor fields ? memory barrier ? atomic Release-store flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What barrier were you thinking we might need here? I would think that the synchronising release store and acquire load should always be enough.
Separately-ish, there is the issue of how to ensure that the host does not have data race UB, even in the face of a guest that maliciously violates the synchronisation protocol. This is all a bit complicated by the fact that the reads/writes on the other side are outside the Rust abstract machine. We have some notes about this in shared_mem.rs, which were written for the future thinking about this kind of concurrent update situation. There was also a recentish discussion in unsafe-code-guidelines about this topic, which came to a similar conclusion to our notes: what we really want is a relaxed atomic-per-byte memcpy to get data in/out of the guest (after the acquire atomic load), but unfortunately this operation does not exist in Rust yet. So, I think that the present approach of having an acquire load and then volatile non-atomic reads/writes is still the "least bad" option, even though it is not exactly correct w.r.t. the memory model---hopefully we can eventually switch to the per-byte atomic memcpy.
| pub struct DescFlags: u16 { | ||
| const NEXT = 1 << 0; // Buffer continues via next descriptor | ||
| const WRITE = 1 << 1; // Device write-only (otherwise read-only) | ||
| const INDIRECT = 1 << 2; // Buffer contains list of descriptors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text mentioned not supporting indirect, at least for now---do we want this flag to be allocated even though we aren't supporting it yet?
| /// Event suppression structure for controlling notifications. | ||
| /// Both sides can control when they want to be notified. | ||
| #[repr(C)] | ||
| pub struct EventSuppression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these end up?
| desc_table: DescTable, // Descriptor table in shared memory | ||
| id_free: SmallVec<[u16; 256]>, // Stack of free IDs (allows out-of-order completion) | ||
| id_num: SmallVec<[u16; 256]>, // Chain length per ID | ||
| drv_evt_addr: u64, // Controls when device notifies about used buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does _addr mean here? Are these meant to be bitpacked versions of the suppression structure above?
| /// Length in bytes (for used: bytes written by device) | ||
| pub len: u32, | ||
| /// Buffer ID for correlating completions with submissions | ||
| pub id: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these meant to identify an individual buffer, or are they the higher-level semantic protocol thing used to correlate e.g. calls and returns? If the former, do they gain us anything on just using the address as the buffer ID?
|
|
||
| impl<M: MemOps> RingProducer<M> { | ||
| /// Submit a buffer chain to the ring. | ||
| /// Returns the descriptor ID assigned to this chain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this as well as the _notify variant?
| /// Submit with notification check based on device's event suppression settings. | ||
| pub fn submit_available_with_notify(&mut self, chain: &BufferChain) -> Result<SubmitResult, RingError>; | ||
|
|
||
| /// Poll for a used buffer. Returns WouldBlock if no completions available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to think a little bit about whether there is a way to actually block/return control to the host, and whether that should be integrated into or distinct from the host->guest notification mechanism, which, at least right now, will also exit to the host in the calling thread (but perhaps in the future could be some faster IPI?)
|
|
||
| #### Buffer Chain Builder | ||
|
|
||
| Type-state pattern enforces that readable buffers are added before writable buffers, preventing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about this "readable before writable" invariant?
| inner: RingProducer<M>, | ||
| notifier: N, | ||
| pool: P, | ||
| inflight: Vec<Option<ProducerInflight>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this? I don't see the ProducerInflight definition here
|
|
||
| ```rust | ||
| /// Trait for buffer providers. | ||
| pub trait BufferProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case these eventually end up passed around through a lot of parsing code with indeterminate lifetimes, would it be useful for alloc/dealloc here to be more like get/put and keep a refcount for the buffer and return it to the free pool when it's unreferenced? I have no idea of this is true, just thought it was worth throwing out there.
Rendered