Skip to content

Conversation

@renato2099
Copy link
Contributor

Which issue does this PR close?

Closes #1305

Rationale for this change

Datafusion-python should follow datafusion implementation and disallow dropping keys when doing a full outer join as both keys are not equivalent thus they can't just be dropped. Users can then decide on how to proceed based on their use case.

What changes are included in this PR?

fix + unit test

Are there any user-facing changes?

yes, disallowing dropping keys when doing a full outer join as that is not semantically correct.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renato2099

Thanks for working on this.

Documentation gaps:

The doc string for join in python/datafusion/dataframe.py states:

drop_duplicate_keys: When True, the columns from the right DataFrame
    that have identical names in the ``on`` fields to the left DataFrame
    will be dropped.

It does not mention the full join exception. Users reading this may assume the parameter works the same for all join types.

Similarly, docs/source/user-guide/common-operations/joins.rst should also document for the full join and drop_duplicate_keys behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full join on dataframe with only index yields dropped rows

2 participants