-
Notifications
You must be signed in to change notification settings - Fork 14.2k
server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265
Conversation
|
IMO hooking into Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once. To have more frequent updates, simply lower the number of tokens for each batch (controlled via |
Right. Seen from this perspective, if it's I make fake-time interpolation on the backend, it's not even worth trying to make it smooth; it's better to just have progress tracking for each batch! I'll start again : Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed. Much cleaner approach, no core callbacks required. |
|
Why not just use streamed |
Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach. |
|
I track total batches and increment after each llama_decode(). Stream the existing prompt_progress object at batch boundaries with estimated token counts, Only activates when there are 2+ batches so large prompts automatically get progress updates, ...,"prompt_progress":{"total":509,"cache":0,"processed":127,"time_ms":1901}} |
19e74d5 to
76b163c
Compare
maybe you misunderstood the question from @ExtReMLapin he meant that we already had this exact function that you are trying to implement in this PR, and its name is
I don't get why you need to add some extra calculations in this PR for that - is there any cases where the current progress calculation is wrong? (asking to make sure that we are not spending double efforts here) |
| slot.t_start_generation = 0; | ||
|
|
||
| const int32_t batch_size = std::max<int32_t>(1, llama_n_batch(ctx)); | ||
| slot.n_batches_total = (slot.task->n_tokens() + batch_size - 1) / batch_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this calculation is likely wrong as it doesn't take into account cached tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch on cached tokens! I haven't tested the cache scenario yet. I'll apply/test your suggested fix
The existing prompt_progress with return_progress: true only emits once after prompt processing completes. This PR streams progress during processing at batch boundaries. It's about real-time updates, not final summary. Without --stream-prompt-progress, behavior is unchanged |
No, it was supposed to sent on each batch (in real time):
Maybe something wrong with your test (I assume?) But this will return on each processed batch. If it doesn't, you maybe re-using cached prompts |
Oh OK I never tested "return_progress": True, on API with large prompt / small batch ! I try it... It's possible that from the beginning, all that was needed was front-end work |
|
The existing implementation streams progress at each batch boundary exactly as intended. I completely missed this during my initial testing: sorry for the noise! Closing this PR. Thanks for the patience and for pointing me to "test_return_progress". (root|~) curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421375,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":256,"time_ms":4222}} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421378,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":384,"time_ms":6470}} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":509,"time_ms":8745}} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"It"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" looks"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" like"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ve"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" past"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"ed"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" long"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" sequence"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":"length","index":0,"delta":{}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":509,"prompt_ms":8745.15,"prompt_per_token_ms":17.181041257367387,"prompt_per_second":58.20369004533942,"predicted_n":10,"predicted_ms":406.724,"predicted_per_token_ms":40.672399999999996,"predicted_per_second":24.5866976131234}} data: [DONE] |
|
IIRC Some downstream projects also use this exact feature for receiving real-time prompt processing progress, and so far I haven't received any issues about it |
Make sure to read the contributing guidelines before submitting a PR
Add batch-level prompt preprocessing progress
Track n_batches_total and n_batches_processed per slot. Emit prompt_progress
chunks after each llama_decode() during prompt processing. Activates automatically
when prompt requires 2+ batches (controlled by -b flag)
Setup (A 100% CPU model added on a testing-server for easier testing) :
Backend testing command
Close #17079