-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Record sort order when writing Parquet with WITH ORDER #19595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
408898d to
43d152b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements the recording of sort order metadata in Parquet files when writing data with WITH ORDER clauses. When an external table is created with an ordering specification, subsequent INSERT INTO or COPY operations will now embed sorting column information in the Parquet row group metadata, enabling downstream readers to potentially skip redundant sort operations.
- Adds conversion functions to translate DataFusion ordering expressions to Parquet
SortingColumnmetadata - Updates
ParquetSinkto accept and propagate sorting column information through the writer pipeline - Includes comprehensive test coverage to verify metadata is correctly written
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| datafusion/datasource-parquet/src/metadata.rs | Adds sort_expr_to_sorting_column() and lex_ordering_to_sorting_columns() helper functions to convert DataFusion ordering to Parquet sorting metadata |
| datafusion/datasource-parquet/src/file_format.rs | Integrates sorting column conversion into create_writer_physical_plan() and updates ParquetSink with builder pattern support for sorting columns; modifies create_writer_props() to set sorting columns on writer properties |
| datafusion/core/tests/parquet/ordering.rs | Adds new test file with test_create_table_with_order_writes_sorting_columns to verify sorting metadata is correctly written to Parquet files |
| datafusion/core/tests/parquet/mod.rs | Registers the new ordering test module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When writing data to a table created with `CREATE EXTERNAL TABLE ... WITH ORDER`, this change records the sorting columns in the Parquet file's row group metadata. Changes: - Add `sort_expr_to_sorting_column()` and `lex_ordering_to_sorting_columns()` functions in metadata.rs to convert DataFusion ordering to Parquet SortingColumn - Add `sorting_columns` field to ParquetSink with `with_sorting_columns()` builder - Update `create_writer_physical_plan()` to pass order requirements to ParquetSink - Update `create_writer_props()` to set sorting columns on WriterProperties - Add test verifying sorting_columns metadata is written correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ab3529f to
e4f1e9a
Compare
|
@zhuqi-lucas are you able to review this? |
zhuqi-lucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @adriangb, sorry i was missing this PR.
Which issue does this PR close?
Part of #19433
Rationale for this change
When writing data to a table created with
CREATE EXTERNAL TABLE ... WITH ORDER, the sorting columns should be recorded in the Parquet file's row group metadata. This allows downstream readers to know the data is sorted and potentially skip sorting operations.What changes are included in this PR?
sort_expr_to_sorting_column()andlex_ordering_to_sorting_columns()functions inmetadata.rsto convert DataFusion ordering to ParquetSortingColumnsorting_columnsfield toParquetSinkwithwith_sorting_columns()builder methodcreate_writer_physical_plan()to pass order requirements toParquetSinkcreate_writer_props()to set sorting columns onWriterPropertiessorting_columnsmetadata is written correctlyAre these changes tested?
Yes, added
test_create_table_with_order_writes_sorting_columnsthat:WITH ORDER (a ASC NULLS FIRST, b DESC NULLS LAST)sorting_columnsmetadata matches the expected orderAre there any user-facing changes?
No user-facing API changes. Parquet files written via
INSERT INTOorCOPYfor tables withWITH ORDERwill now containsorting_columnsmetadata in the row group.🤖 Generated with Claude Code