Add deduplication logic #1

PaliC · 2025-06-23T20:57:28Z

This pull request introduces deduplication functionality to the export.py script and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.

The deduplication functionality is specifically does 3 things

Filter out only for the columns ['code', 'run_mode', 'run_passed', 'run_meta', 'submission_id'] to make processing manageable. Otherwise, I was running into issues with pandas loading the submissions dataframe.
It bins the submissions by success, run mode, and score (if available) or duration of the run. We then deduplicate within these bins as minhash takes a while (deduping this dataset took an hour)
We first dedup using a hash
We then dedup using minhash lsh (this is what takes a while). You can find an explanation of this process here https://medium.com/@omkarsoak/from-min-hashing-to-locality-sensitive-hashing-the-complete-process-b88b298d71a1

There is also a good amount of tests added to make sure this stuff actually works

msaroufim · 2025-06-24T00:06:53Z

Thanks! Mind sharing some more stats on the real dataset?

How much data was around before your change
How much data is around after
Some manual vibe checks as well of kernels that were filtered out would be nice to sanity check

Code is quite long to review but at least above should give us some more confidence before merge

ngc92 · 2025-06-24T12:43:48Z

dedup.py

+                        # For leaderboard mode with successful runs, prefer higher scores
+                        if run_mode == 'leaderboard' and row.get('run_passed') == True:
+                            if row.get('run_score', 0) > existing_row.get('run_score', 0):
+                                unique_entries[content_hash] = row


I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks

oops, I see. I'll rerun things and reupload

PaliC · 2025-06-24T12:45:22Z

@msaroufim for the first two questions we have these

✓ Loaded submissions.parquet: 40,095 entries
✓ Loaded deduplicated_submissions.parquet: 15,238 entries

============================================================
COUNT COMPARISON

Original submissions: 40,095
Deduplicated submissions: 15,238
Removed entries: 24,857
Percentage removed: 62.00%

So it seems like a lot of the kernels are actually similar. I'll rerun things and grab some files for a sanity check

PaliC · 2025-06-24T15:26:33Z

@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different.

This change modifies the extraction processes so that it uses a lot less memory. In particular, the process no longer loads the whole dataset into memory before exporting to parquet files. Instead, it processes the dataset into small, incremental parquet files, and then consolidates these files into a single file as the final step.

# Conflicts: # export.py

PaliC · 2025-11-30T05:14:36Z

Updated values
Successful submissions

Deduplication results Summary:
Original rows: 60357
After hash based dedup dedup: 22718 rows
Final rows: 9281
Removed 51076 duplicates (84.6%)
Saved to data/successful_submissions_deduplicated.parquet

Submissions

Flattening and saving...                                                                                                     
Deduplication results Summary:
Original rows: 109709
After hash based dedup dedup: 47362 rows
Final rows: 19012
Removed 90697 duplicates (82.7%)
Saved to data/submissions_deduplicated.parquet

PaliC added 2 commits June 23, 2025 16:56

Add deduplication logic

15c53eb

magic number removal

2ad5e80

PaliC marked this pull request as ready for review June 24, 2025 00:01

Add deduplicated datasets

a30f85d

ngc92 reviewed Jun 24, 2025

View reviewed changes

PaliC requested a review from ngc92 June 24, 2025 15:33

Benjamin Horowitz and others added 7 commits November 24, 2025 15:58

Add deduplication logic

b39b292

magic number removal

216d47a

Add deduplicated datasets

89573b4

remove test

d32c79c

Merge branch 'dedup' of github.com:PaliC/kernelbot-data into dedup

0c1429b

# Conflicts: # export.py

update

9867a05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deduplication logic #1

Add deduplication logic #1

Uh oh!

PaliC commented Jun 23, 2025 •

edited

Loading

Uh oh!

msaroufim commented Jun 24, 2025

Uh oh!

ngc92 Jun 24, 2025

Uh oh!

PaliC Jun 24, 2025

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

PaliC commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add deduplication logic #1

Are you sure you want to change the base?

Add deduplication logic #1

Uh oh!

Conversation

PaliC commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Jun 24, 2025

Uh oh!

ngc92 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC commented Jun 24, 2025

============================================================ COUNT COMPARISON

Uh oh!

PaliC commented Jun 24, 2025

Uh oh!

PaliC commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PaliC commented Jun 23, 2025 •

edited

Loading

============================================================
COUNT COMPARISON