-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
By spec |
|
Interesting, the Overall I'm not a fan of introducing a slower conversion for managing a case violating the spec. |
|
@alippai Thanks for reviewing this. I am fine with keeping the original behaviour as is, and add a switch. That is actually another todo for Python conversion at:
If that's preferred, I can add a switch for Python and Arrow conversion sides, and keep the original behaviour as is ( Otherwise, we can also simply just remove this todo as well. |
|
cc @rok pinging in case you have any opinions on this topic. |
|
I don't have a strong opinion on this either way. Avoiding a performance regression by making this non-default behavior seems like a good idea at this point. |
|
Yeah let me work on it 👍 |
|
|
2ffa9a0 to
7e8eb86
Compare
7e8eb86 to
3896331
Compare
|
This PR should be ready for a look. |
3896331 to
0bf9fec
Compare
|
Looks good, thanks for the change |
| // Date64Type is millisecond timestamp | ||
| if (this->options_.truncate_date64_time) { | ||
| // Truncate intraday milliseconds | ||
| ConvertDatetimeWithTruncation<1L>(*data, out_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid computing the ... * 1L for each value in the array when SHIFT == 1? Or will the compiler optimize this away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will optimize it out as a noop from my understanding but to make sure, I changed a bit to leverage constexpr for 1 case. It should compiletime branch it out, and should be optimized enough as documented in c++ lang.
| template <int64_t SHIFT> | ||
| inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) { | ||
| for (int c = 0; c < data.num_chunks(); c++) { | ||
| const auto& arr = *data.chunk(c); | ||
| const int64_t* in_values = GetPrimitiveValues<int64_t>(arr); | ||
| for (int64_t i = 0; i < arr.length(); ++i) { | ||
| *out_values++ = arr.IsNull(i) | ||
| ? kPandasTimestampNull | ||
| : ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SHIFT sounds like we are bit-shifting, where this is more a factor.
| template <int64_t SHIFT> | |
| inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) { | |
| for (int c = 0; c < data.num_chunks(); c++) { | |
| const auto& arr = *data.chunk(c); | |
| const int64_t* in_values = GetPrimitiveValues<int64_t>(arr); | |
| for (int64_t i = 0; i < arr.length(); ++i) { | |
| *out_values++ = arr.IsNull(i) | |
| ? kPandasTimestampNull | |
| : ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT); | |
| } | |
| } | |
| } | |
| template <int64_t FACTOR> | |
| inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) { | |
| for (int c = 0; c < data.num_chunks(); c++) { | |
| const auto& arr = *data.chunk(c); | |
| const int64_t* in_values = GetPrimitiveValues<int64_t>(arr); | |
| for (int64_t i = 0; i < arr.length(); ++i) { | |
| *out_values++ = arr.IsNull(i) | |
| ? kPandasTimestampNull | |
| : ((in_values[i] - in_values[i] % kMillisecondsInDay) * FACTOR); | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this naming exists in ConvertDatetime as well :-(.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah .. let me just keep it consistent for now
0bf9fec to
e478128
Compare
EnricoMi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@AlenkaF do you mind taking a look when you find some time? I believe I resolved all comments. Now it does not change any default behaviour 🫡 |
… intraday milliseconds in Date64
9066b84 to
3e75327
Compare
Rationale for this change
arrow/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
Lines 1655 to 1656 in 0bfbd19
arrow/python/pyarrow/src/arrow/python/python_to_arrow.cc
Line 312 in d09233a
What changes are included in this PR?
This PR adds an option for truncating intraday milliseconds in Date64, which is disabled by default for pandas conversion, and enabled by default for Python conversion to avoid breaking changes.
Are these changes tested?
Yes, unittests were added, and tested as below:
Are there any user-facing changes?
No by default. It adds a new option
(Generated by ChatGPT)
pa.array())truncate_date64_time=False946684800000(truncated) →946730096123(preserved)pa.array())truncate_date64_time=False946684800000(truncated) →946730096123(preserved)pa.array()withfrom_pandas=True)truncate_date64_time=False946684800000(truncated) →946730096123(preserved)to_pandas())truncate_date64_time=True2018-05-10 00:02:03.456000(preserved) →2018-05-10 00:00:00(truncated)