-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Open
Labels
StringsString extension data type and string dataString extension data type and string data
Milestone
Description
We documented that invalid unicode can no longer be stored in a str dtype column (https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#invalid-unicode-input), and for sure that will error when you explicitly ask for str:
>>> pd.Series(['\ud800'], dtype=str)
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowedBut I am wondering if we should still by default fall back to object dtype in the case you are not specifying a dtype, i.e. for the default inference. Right now also pd.Series(['\ud800']) gives the same error.
(it might be a performance cost in validating that up front though, or otherwise we could the specific error if we know we started without user-specified dtype)
Metadata
Metadata
Assignees
Labels
StringsString extension data type and string dataString extension data type and string data