Skip to content

Conversation

@PaliC
Copy link

@PaliC PaliC commented Jun 23, 2025

This pull request introduces deduplication functionality to the export.py script and updates the documentation to include testing instructions. The key changes include integrating a deduplication module, handling deduplicated datasets, and providing a detailed test setup for verifying the deduplication logic.

The deduplication functionality is specifically does 3 things

  1. Filter out only for the columns ['code', 'run_mode', 'run_passed', 'run_meta', 'submission_id'] to make processing manageable. Otherwise, I was running into issues with pandas loading the submissions dataframe.
  2. It bins the submissions by success, run mode, and score (if available) or duration of the run. We then deduplicate within these bins as minhash takes a while (deduping this dataset took an hour)
  3. We first dedup using a hash
  4. We then dedup using minhash lsh (this is what takes a while). You can find an explanation of this process here https://medium.com/@omkarsoak/from-min-hashing-to-locality-sensitive-hashing-the-complete-process-b88b298d71a1

There is also a good amount of tests added to make sure this stuff actually works

@PaliC PaliC marked this pull request as ready for review June 24, 2025 00:01
@msaroufim
Copy link
Member

Thanks! Mind sharing some more stats on the real dataset?

  1. How much data was around before your change
  2. How much data is around after
  3. Some manual vibe checks as well of kernels that were filtered out would be nice to sanity check

Code is quite long to review but at least above should give us some more confidence before merge

# For leaderboard mode with successful runs, prefer higher scores
if run_mode == 'leaderboard' and row.get('run_passed') == True:
if row.get('run_score', 0) > existing_row.get('run_score', 0):
unique_entries[content_hash] = row
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scores are still lower = better; run.duration is the end-to-end wallclock time for the entire run, including, e.g., testing code, whereas score is the geomean of all benchmarks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I see. I'll rerun things and reupload

@PaliC
Copy link
Author

PaliC commented Jun 24, 2025

@msaroufim for the first two questions we have these

✓ Loaded submissions.parquet: 40,095 entries
✓ Loaded deduplicated_submissions.parquet: 15,238 entries

============================================================
COUNT COMPARISON

Original submissions: 40,095
Deduplicated submissions: 15,238
Removed entries: 24,857
Percentage removed: 62.00%

So it seems like a lot of the kernels are actually similar. I'll rerun things and grab some files for a sanity check

@PaliC
Copy link
Author

PaliC commented Jun 24, 2025

@msaroufim https://www.diffchecker.com/KamzTAeT/ (I think this one is more similar) and https://www.diffchecker.com/MtK5pbWL/ show a entry which was removed due to depulicaton (on right) compared to two other entries that remained in the dataset. We can be a bit less aggressive with deduping, as they look sortof different.

@PaliC PaliC requested a review from ngc92 June 24, 2025 15:33
Benjamin Horowitz and others added 7 commits November 24, 2025 15:58
This change modifies the extraction processes so that it uses a lot less
memory. In particular, the process no longer loads the whole dataset
into memory before exporting to parquet files. Instead, it processes the
dataset into small, incremental parquet files, and then consolidates
these files into a single file as the final step.
This change modifies the extraction processes so that it uses a lot less memory. In particular, the process no longer loads the whole dataset into memory before exporting to parquet files. Instead, it processes the dataset into small, incremental parquet files, and then consolidates these files into a single file as the final step.
@PaliC
Copy link
Author

PaliC commented Nov 30, 2025

Updated values
Successful submissions

Deduplication results Summary:
Original rows: 60357
After hash based dedup dedup: 22718 rows
Final rows: 9281
Removed 51076 duplicates (84.6%)
Saved to data/successful_submissions_deduplicated.parquet

Submissions

Flattening and saving...                                                                                                     
Deduplication results Summary:
Original rows: 109709
After hash based dedup dedup: 47362 rows
Final rows: 19012
Removed 90697 duplicates (82.7%)
Saved to data/submissions_deduplicated.parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants