perf: Optimize startsWith and endsWith string functions #3000

Mazen-Ghanaym · 2025-12-27T17:26:46Z

Which issue does this PR close?

Closes #2973.

Rationale for this change

The startsWith and endsWith string functions were previously delegated to DataFusion's built-in scalar functions, which introduced unnecessary overhead and did not fully leverage Comet's native execution capabilities. This PR implements optimized native expressions to improve performance.

What changes are included in this PR?

This PR introduces custom StartsWithExpr and EndsWithExpr physical expressions with the following optimizations:

startsWith:

Uses Arrow's compute::starts_with kernel with a pre-allocated pattern array to avoid per-batch allocations.
Achieves 1.1X speedup over Spark.

endsWith:

Uses direct buffer access to the underlying StringArray data, bypassing iterator overhead.
Manually calculates suffix offsets and performs raw byte slice comparison (memcmp).
Achieves 1.0X parity with Spark (improved from 0.9X regression).

Files Changed:

native/spark-expr/src/string_funcs/starts_ends_with.rs (NEW)
native/spark-expr/src/string_funcs/mod.rs
native/core/src/execution/expressions/strings.rs
native/core/src/execution/planner/expression_registry.rs
spark/src/main/scala/org/apache/comet/serde/strings.scala
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala

How are these changes tested?

Existing Tests: The implementation passes all existing Comet tests, including TPC-DS and TPC-H correctness suites which exercise string functions.
Benchmark Verification: Performance was verified using CometStringExpressionBenchmark:
- startsWith: 1.1X faster than Spark (Comet 1887ms vs Spark 2028ms)
- endsWith: 1.0X parity with Spark (Comet 3389ms vs Spark 3354ms)
CI Verification: A temporary workflow was used to verify the benchmark executes correctly in GitHub Actions CI environment and the results are in the Benchmark Results

Benchmark Results

Environment: OpenJDK 64-Bit Server VM 11.0.29+7-LTS on Linux 6.11.0-1018-azure
Processor: AMD EPYC 7763 64-Core Processor

startsWith

Case	Best Time(ms)	Avg Time(ms)	Stdev(ms)	Rate(M/s)	Per Row(ns)	Relative
Spark	1657	1669	17	0.6	1580.2	1.0X
Comet (Scan)	1740	1755	20	0.6	1659.6	1.0X
Comet (Scan + Exec)	1546	1546	1	0.7	1474.1	1.1X

endsWith

Case	Best Time(ms)	Avg Time(ms)	Stdev(ms)	Rate(M/s)	Per Row(ns)	Relative
Spark	1625	1632	10	0.6	1549.9	1.0X
Comet (Scan)	1731	1732	1	0.6	1651.2	0.9X
Comet (Scan + Exec)	1562	1563	0	0.7	1490.0	1.0X

Summary:

startsWith: 1.1X faster than Spark (1546ms vs 1657ms)
endsWith: 1.0X parity with Spark (1562ms vs 1625ms)

- Implements a hybrid optimization strategy for string primitives. - Uses Arrow compute kernels for startsWith with pre-allocated pattern arrays to avoid per-batch allocation overhead. - Uses direct buffer access and manual suffix calculation for endsWith to bypass iterator overhead and match JVM intrinsic performance. - Achieves 1.1X speedup for startsWith and parity (1.0X) for endsWith compared to Spark. Closes apache#2973

coderfender · 2025-12-27T17:32:46Z

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

Mazen-Ghanaym · 2025-12-27T17:56:02Z

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

I spent around 4 days trying to optimize, and here is the short story of the journey
Day 1-2: Tried DataFusion built-ins Started with ScalarFunctionExpr using DataFusion's native starts_with/ends_with. Result: 1.0X – just matched Spark, no improvement.

Day 2: Tried unsafe Rust with direct byte access Attempted raw pointer manipulation for maximum speed. Failed due to compatibility issues.

Day 3: Tried safe Rust with stdlib slices. Used .starts_with() and .ends_with() on string slices with manual iteration. Result: 0.9X – actually slower than Spark, I think it's due to iterator overhead.

Day 3-4: Arrow compute kernels + pre-allocated pattern The breakthrough: I realized the overhead came from repeatedly processing the pattern each batch. Pre-allocating the pattern as a StringArray once and calling Arrow's compute::starts_with directly gave us 1.1X for startsWith.

For endsWith, Arrow's kernel was still slightly slower (0.9X), so I went with direct buffer access and manual suffix calculation to reach 1.0X parity.

I tried to optimize further but I don't know any optimizations that can beat Java in these direct, simple operations.

Mazen-Ghanaym and others added 6 commits December 27, 2025 03:32

Merge branch 'apache:main' into feat/optimize-strings-2973

aabf3f3

Fix code formatting for CI check

9248b70

Fix clippy lints

122623e

Add temp benchmark workflow for CI verification

fe9cabb

Remove temp benchmark workflow

d2f0aea

Mazen-Ghanaym changed the title ~~Feat/optimize strings 2973~~ feat: Optimize startsWith and endsWith string functions Dec 27, 2025

Mazen-Ghanaym changed the title ~~feat: Optimize startsWith and endsWith string functions~~ perf: Optimize startsWith and endsWith string functions Dec 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Optimize startsWith and endsWith string functions #3000

perf: Optimize startsWith and endsWith string functions #3000

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

coderfender commented Dec 27, 2025

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: Optimize startsWith and endsWith string functions #3000

Are you sure you want to change the base?

perf: Optimize startsWith and endsWith string functions #3000

Uh oh!

Conversation

Mazen-Ghanaym commented Dec 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Benchmark Results

startsWith

endsWith

Uh oh!

coderfender commented Dec 27, 2025

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants