-
Notifications
You must be signed in to change notification settings - Fork 265
perf: Optimize contains expression with SIMD-based scalar pattern sea… #2991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: Optimize contains expression with SIMD-based scalar pattern sea… #2991
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2991 +/- ##
============================================
+ Coverage 56.12% 59.58% +3.46%
- Complexity 976 1377 +401
============================================
Files 119 167 +48
Lines 11743 15493 +3750
Branches 2251 2569 +318
============================================
+ Hits 6591 9232 +2641
- Misses 4012 4962 +950
- Partials 1140 1299 +159 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I tested locally and see good performance now: |
native/spark-expr/Cargo.toml
Outdated
|
|
||
| [dependencies] | ||
| arrow = { workspace = true } | ||
| arrow-string = "57.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already depend on arrow. Is contains not re-exported in the arrow crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - removed it a449297
|
Thanks @Shekharrajak this is a really nice speedup! Could you fix the clippy errors (you can probably just run |
54dc054 to
27929a3
Compare
Co-authored-by: Andy Grove <agrove@apache.org>
Which issue does this PR close?
Closes #2972.
Rationale for this change
The contains expression shows poor performance in Comet (0.2X vs Spark) because DataFusion's make_scalar_function wrapper expands scalar patterns to arrays, bypassing arrow-rs's optimized scalar path.
What changes are included in this PR?
How are these changes tested?