Skip to content

Constructor with invalid unicode: automatically fall back to object dtype? #63396

@jorisvandenbossche

Description

@jorisvandenbossche

We documented that invalid unicode can no longer be stored in a str dtype column (https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#invalid-unicode-input), and for sure that will error when you explicitly ask for str:

>>> pd.Series(['\ud800'], dtype=str)
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

But I am wondering if we should still by default fall back to object dtype in the case you are not specifying a dtype, i.e. for the default inference. Right now also pd.Series(['\ud800']) gives the same error.
(it might be a performance cost in validating that up front though, or otherwise we could the specific error if we know we started without user-specified dtype)

Metadata

Metadata

Assignees

No one assigned

    Labels

    StringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions