fix: CometShuffleManager hang by deferring SparkEnv access #3002
+31
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #2946
Rationale for this change
The sql_hive-1 tests for Spark 4.0 were timing out (hanging indefinitely) when Comet was enabled. The last test shown in logs was HivePartitionFilteringSuite. Investigation showed that CometShuffleManager accessed SparkEnv.get.executorId during initialization via a lazy val, which could hang when SparkEnv wasn't fully initialized (e.g., during Hive metastore operations in Spark 4.0).
This fix defers SparkEnv access until task execution (when getWriter()/getReader() is called), ensuring SparkEnv is available and preventing the hang.
What changes are included in this PR?
Changed shuffleExecutorComponents from a lazy val that accessed SparkEnv.get during construction to a @volatile variable with double-checked locking
Added a null check with a clear error message if SparkEnv is unexpectedly null
How are these changes tested?
CI verification: Re-enabled sql_hive-1 tests for Spark 4.0 in the GitHub Actions workflow. These tests will run as part of the CI pipeline to verify the fix.