Skip to content

Conversation

@Shekharrajak
Copy link
Contributor

@Shekharrajak Shekharrajak commented Dec 28, 2025

Which issue does this PR close?

Closes #2946

Rationale for this change

The sql_hive-1 tests for Spark 4.0 were timing out (hanging indefinitely) when Comet was enabled. The last test shown in logs was HivePartitionFilteringSuite. Investigation showed that CometShuffleManager accessed SparkEnv.get.executorId during initialization via a lazy val, which could hang when SparkEnv wasn't fully initialized (e.g., during Hive metastore operations in Spark 4.0).
This fix defers SparkEnv access until task execution (when getWriter()/getReader() is called), ensuring SparkEnv is available and preventing the hang.

What changes are included in this PR?

Changed shuffleExecutorComponents from a lazy val that accessed SparkEnv.get during construction to a @volatile variable with double-checked locking

Added a null check with a clear error message if SparkEnv is unexpectedly null

How are these changes tested?

CI verification: Re-enabled sql_hive-1 tests for Spark 4.0 in the GitHub Actions workflow. These tests will run as part of the CI pipeline to verify the fix.

@codecov-commenter
Copy link

codecov-commenter commented Dec 28, 2025

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.58%. Comparing base (f09f8af) to head (cfe9b8e).
⚠️ Report is 803 commits behind head on main.

Files with missing lines Patch % Lines
.../comet/execution/shuffle/CometShuffleManager.scala 81.25% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3002      +/-   ##
============================================
+ Coverage     56.12%   59.58%   +3.46%     
- Complexity      976     1379     +403     
============================================
  Files           119      167      +48     
  Lines         11743    15501    +3758     
  Branches       2251     2571     +320     
============================================
+ Hits           6591     9237    +2646     
- Misses         4012     4964     +952     
- Partials       1140     1300     +160     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shekharrajak Shekharrajak changed the title Fix CometShuffleManager hang by deferring SparkEnv access fix: CometShuffleManager hang by deferring SparkEnv access Dec 28, 2025
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shekharrajak

@andygrove andygrove merged commit bce61f5 into apache:main Dec 29, 2025
124 of 126 checks passed
@Shekharrajak Shekharrajak deleted the fix/issue-2946-cometshuffle-hang branch December 29, 2025 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark 4.0.x hive-1 jdk17 tests failing consistently in multiple PRs

3 participants