GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

HyukjinKwon · 2025-12-12T01:46:43Z

Rationale for this change

arrow/python/pyarrow/src/arrow/python/arrow_to_pandas.cc

Lines 1655 to 1656 in 0bfbd19

    
           // Date64Type is millisecond timestamp stored as int64_t 
        
           // TODO(wesm): Do we want to make sure to zero out the milliseconds?

arrow/python/pyarrow/src/arrow/python/python_to_arrow.cc

Line 312 in d09233a

// TODO: introduce an option for this

What changes are included in this PR?

This PR adds an option for truncating intraday milliseconds in Date64, which is disabled by default for pandas conversion, and enabled by default for Python conversion to avoid breaking changes.

Are these changes tested?

Yes, unittests were added, and tested as below:

pytest pyarrow/tests/test_pandas.py

Are there any user-facing changes?

No by default. It adds a new option

(Generated by ChatGPT)

Conversion Type	Default Behavior	With Explicit Option	Option Value	Result
Python sequences → Arrow (`pa.array()`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
NumPy arrays → Arrow (`pa.array()`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
Pandas Series → Arrow (`pa.array()` with `from_pandas=True`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
Arrow → Pandas (`to_pandas()`)	Preserves time	Truncates time	`truncate_date64_time=True`	`2018-05-10 00:02:03.456000` (preserved) → `2018-05-10 00:00:00` (truncated)

import datetime
import pyarrow as pa

dt_with_time = datetime.datetime(2000, 1, 1, 12, 34, 56, 123456)
dt_date_only = datetime.datetime(2000, 1, 1)

# ============================================================================
# 1. Python sequences (lists)
# ============================================================================

# BEFORE (default behavior - truncates time)
arr_python_before = pa.array([dt_with_time], type=pa.date64())
arr_python_date_only_before = pa.array([dt_date_only], type=pa.date64())
print("Python sequences - BEFORE (default):")
print(f"  int64: {arr_python_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_python_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_python_before.equals(arr_python_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_python_after = pa.array([dt_with_time], type=pa.date64(), truncate_date64_time=False)
arr_python_date_only_after = pa.array([dt_date_only], type=pa.date64(), truncate_date64_time=False)
print("Python sequences - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_python_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_python_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_python_after.equals(arr_python_date_only_after)}")  # False

# ============================================================================
# 2. NumPy arrays
# ============================================================================

import numpy as np

arr_numpy = np.array([dt_with_time], dtype=object)
arr_numpy_date_only = np.array([dt_date_only], dtype=object)

# BEFORE (default behavior - truncates time, since array() defaults to True)
arr_numpy_before = pa.array(arr_numpy, type=pa.date64())
arr_numpy_date_only_before = pa.array(arr_numpy_date_only, type=pa.date64())
print("\nNumPy arrays - BEFORE (default):")
print(f"  int64: {arr_numpy_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_numpy_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_numpy_before.equals(arr_numpy_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_numpy_after = pa.array(arr_numpy, type=pa.date64(), truncate_date64_time=False)
arr_numpy_date_only_after = pa.array(arr_numpy_date_only, type=pa.date64(), truncate_date64_time=False)
print("NumPy arrays - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_numpy_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_numpy_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_numpy_after.equals(arr_numpy_date_only_after)}")  # False

# ============================================================================
# 3. Pandas Series
# ============================================================================

import pandas as pd

series_pandas = pd.Series([dt_with_time], dtype=object)
series_pandas_date_only = pd.Series([dt_date_only], dtype=object)

# BEFORE (default behavior - truncates time, since array() defaults to True)
arr_pandas_before = pa.array(series_pandas, type=pa.date64(), from_pandas=True)
arr_pandas_date_only_before = pa.array(series_pandas_date_only, type=pa.date64(), from_pandas=True)
print("\nPandas Series - BEFORE (default):")
print(f"  int64: {arr_pandas_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_pandas_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_pandas_before.equals(arr_pandas_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_pandas_after = pa.array(series_pandas, type=pa.date64(), from_pandas=True, truncate_date64_time=False)
arr_pandas_date_only_after = pa.array(series_pandas_date_only, type=pa.date64(), from_pandas=True, truncate_date64_time=False)
print("Pandas Series - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_pandas_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_pandas_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_pandas_after.equals(arr_pandas_date_only_after)}")  # False

# ============================================================================
# 4. Arrow to Pandas conversion (to_pandas)
# ============================================================================

milliseconds_at_midnight = 1525910400000  # 2018-05-10 00:00:00
milliseconds_with_time = milliseconds_at_midnight + 123456  # 2018-05-10 00:02:03.456

arr_arrow = pa.array([milliseconds_at_midnight, milliseconds_with_time], type=pa.date64())

# BEFORE (default behavior - preserves time, since to_pandas() defaults to False)
result_before = arr_arrow.to_pandas(date_as_object=False)
print("\nArrow to Pandas - BEFORE (default):")
print(f"  arr.to_pandas(date_as_object=False)[0] = {result_before[0]}")  # 2018-05-10 00:00:00
print(f"  arr.to_pandas(date_as_object=False)[1] = {result_before[1]}")  # 2018-05-10 00:02:03.456000

# AFTER (explicit truncate_date64_time=True - truncates time)
result_after = arr_arrow.to_pandas(date_as_object=False, truncate_date64_time=True)
print("Arrow to Pandas - AFTER (truncate_date64_time=True):")
print(f"  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[0] = {result_after[0]}")  # 2018-05-10 00:00:00
print(f"  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[1] = {result_after[1]}")  # 2018-05-10 00:00:00

Python sequences - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
Python sequences - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

NumPy arrays - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
NumPy arrays - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

Pandas Series - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
Pandas Series - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

Arrow to Pandas - BEFORE (default):
  arr.to_pandas(date_as_object=False)[0] = 2018-05-10 00:00:00
  arr.to_pandas(date_as_object=False)[1] = 2018-05-10 00:02:03.456000
Arrow to Pandas - AFTER (truncate_date64_time=True):
  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[0] = 2018-05-10 00:00:00
  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[1] = 2018-05-10 00:00:00

github-actions · 2025-12-12T01:47:10Z

⚠️ GitHub issue #48465 has been automatically assigned in GitHub to PR creator.

alippai · 2025-12-15T13:25:35Z

By spec Date64 should be limited to full day values in arrow

alippai · 2025-12-15T15:58:39Z

Interesting, the arr.to_pandas(date_as_object=False) docs also says it should be the appropriate time unit (which is D in this case, not ms).

Overall I'm not a fan of introducing a slower conversion for managing a case violating the spec.

HyukjinKwon · 2025-12-18T07:26:51Z

@alippai Thanks for reviewing this. I am fine with keeping the original behaviour as is, and add a switch. That is actually another todo for Python conversion at:

arrow/python/pyarrow/src/arrow/python/python_to_arrow.cc

Line 312 in d09233a

// TODO: introduce an option for this

If that's preferred, I can add a switch for Python and Arrow conversion sides, and keep the original behaviour as is (True for Python conv, and False for Arrow conv).

Otherwise, we can also simply just remove this todo as well.

AlenkaF · 2025-12-23T09:12:10Z

cc @rok pinging in case you have any opinions on this topic.

rok · 2025-12-23T09:49:51Z

I don't have a strong opinion on this either way. Avoiding a performance regression by making this non-default behavior seems like a good idea at this point.

HyukjinKwon · 2025-12-23T11:07:48Z

Yeah let me work on it 👍

github-actions · 2025-12-29T06:08:42Z

⚠️ GitHub issue #48672 has been automatically assigned in GitHub to PR creator.

HyukjinKwon · 2025-12-29T07:41:53Z

This PR should be ready for a look.

alippai · 2025-12-29T08:13:38Z

Looks good, thanks for the change

EnricoMi · 2026-01-06T10:24:21Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

+      // Date64Type is millisecond timestamp
+      if (this->options_.truncate_date64_time) {
+        // Truncate intraday milliseconds
+        ConvertDatetimeWithTruncation<1L>(*data, out_values);


Can we avoid computing the ... * 1L for each value in the array when SHIFT == 1? Or will the compiler optimize this away?

I believe it will optimize it out as a noop from my understanding but to make sure, I changed a bit to leverage constexpr for 1 case. It should compiletime branch it out, and should be optimized enough as documented in c++ lang.

EnricoMi · 2026-01-06T10:38:09Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

+template <int64_t SHIFT>
+inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {
+  for (int c = 0; c < data.num_chunks(); c++) {
+    const auto& arr = *data.chunk(c);
+    const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);
+    for (int64_t i = 0; i < arr.length(); ++i) {
+      *out_values++ = arr.IsNull(i)
+                          ? kPandasTimestampNull
+                          : ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT);
+    }
+  }
+}


The SHIFT sounds like we are bit-shifting, where this is more a factor.

Suggested change

template <int64_t SHIFT>

inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {

for (int c = 0; c < data.num_chunks(); c++) {

const auto& arr = *data.chunk(c);

const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);

for (int64_t i = 0; i < arr.length(); ++i) {

*out_values++ = arr.IsNull(i)

? kPandasTimestampNull

: ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT);

}

}

}

template <int64_t FACTOR>

inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {

for (int c = 0; c < data.num_chunks(); c++) {

const auto& arr = *data.chunk(c);

const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);

for (int64_t i = 0; i < arr.length(); ++i) {

*out_values++ = arr.IsNull(i)

? kPandasTimestampNull

: ((in_values[i] - in_values[i] % kMillisecondsInDay) * FACTOR);

}

}

}

Looks like this naming exists in ConvertDatetime as well :-(.

Yeah .. let me just keep it consistent for now

EnricoMi

LGTM!

HyukjinKwon · 2026-01-08T21:51:59Z

@AlenkaF do you mind taking a look when you find some time? I believe I resolved all comments. Now it does not change any default behaviour 🫡

… intraday milliseconds in Date64

HyukjinKwon requested review from AlenkaF, raulcd and rok as code owners December 12, 2025 01:46

github-actions bot added Component: Python awaiting review Awaiting review labels Dec 12, 2025

HyukjinKwon marked this pull request as draft December 23, 2025 11:07

HyukjinKwon mentioned this pull request Dec 29, 2025

[Python] Add a switch for truncating intraday milliseconds in Date64 in Python conversion #48672

Open

HyukjinKwon changed the title ~~GH-48465: [Python] Truncate intraday milliseconds in Date64 to pandas conversion~~ GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 Dec 29, 2025

HyukjinKwon force-pushed the truncate-millies branch 5 times, most recently from 2ffa9a0 to 7e8eb86 Compare December 29, 2025 07:23

HyukjinKwon marked this pull request as ready for review December 29, 2025 07:25

HyukjinKwon force-pushed the truncate-millies branch from 7e8eb86 to 3896331 Compare December 29, 2025 07:41

HyukjinKwon force-pushed the truncate-millies branch from 3896331 to 0bf9fec Compare December 29, 2025 07:54

EnricoMi reviewed Jan 6, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 6, 2026

HyukjinKwon force-pushed the truncate-millies branch from 0bf9fec to e478128 Compare January 7, 2026 02:03

EnricoMi approved these changes Jan 7, 2026

View reviewed changes

apacheGH-48672, apacheGH-48465: [Python] Add an option for truncating…

3e75327

… intraday milliseconds in Date64

HyukjinKwon force-pushed the truncate-millies branch from 9066b84 to 3e75327 Compare January 9, 2026 07:34

	// Date64Type is millisecond timestamp stored as int64_t
	// TODO(wesm): Do we want to make sure to zero out the milliseconds?

GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

Are you sure you want to change the base?

GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

Uh oh!

Conversation

HyukjinKwon commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

alippai commented Dec 15, 2025

Uh oh!

alippai commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Dec 18, 2025

Uh oh!

AlenkaF commented Dec 23, 2025

Uh oh!

rok commented Dec 23, 2025

Uh oh!

HyukjinKwon commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

HyukjinKwon commented Dec 29, 2025

Uh oh!

alippai commented Dec 29, 2025

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Dec 12, 2025 •

edited

Loading

alippai commented Dec 15, 2025 •

edited

Loading