-
Notifications
You must be signed in to change notification settings - Fork 603
[common] Add support for cuBLASLt GEMM for GroupedTensor #2502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM - Fix random padding in tests to ensure 16-byte alignment for all dtypes - Reorder GroupedGemmSetupWorkspace members for natural alignment - Remove debug prints Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
/te-ci L0 |
Greptile SummaryAdds
The implementation follows established patterns from the codebase, includes proper validation, and has thorough test coverage. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant nvte_grouped_gemm
participant Validation
participant Operand Selection
participant Setup Kernel
participant cuBLASLt
User->>nvte_grouped_gemm: Call with A, B, C, D, alpha, beta
nvte_grouped_gemm->>Validation: Check SM >= 100 (Blackwell)
nvte_grouped_gemm->>Validation: validate_grouped_gemm_inputs()
Validation-->>nvte_grouped_gemm: OK
nvte_grouped_gemm->>Operand Selection: select_grouped_operand(A, transa)
Operand Selection->>Operand Selection: Check FP8 TN layout requirements
Operand Selection->>Operand Selection: Choose row-wise vs column-wise data
Operand Selection-->>nvte_grouped_gemm: A_sel (dptr, dtype, trans, use_columnwise)
nvte_grouped_gemm->>Operand Selection: select_grouped_operand(B, transb)
Operand Selection-->>nvte_grouped_gemm: B_sel (dptr, dtype, trans, use_columnwise)
nvte_grouped_gemm->>Setup Kernel: Allocate setup workspace
nvte_grouped_gemm->>Setup Kernel: launch_grouped_gemm_setup()
Setup Kernel->>Setup Kernel: setup_grouped_gemm_kernel<<<blocks, threads>>>
Note over Setup Kernel: Per-tensor computation:<br/>- Compute A/B/C/D pointers from offsets<br/>- Compute M/N/K from dimensions<br/>- Fill alpha_ptrs, beta_ptrs arrays
Setup Kernel-->>nvte_grouped_gemm: Workspace populated
nvte_grouped_gemm->>cuBLASLt: init_matrix_layouts(descA, descB, descC, descD)
nvte_grouped_gemm->>cuBLASLt: init_matmul_desc(op_A, op_B)
nvte_grouped_gemm->>cuBLASLt: set_fp8_scale_pointers() if FP8
nvte_grouped_gemm->>cuBLASLt: select_grouped_gemm_algo() with avg hints
cuBLASLt-->>nvte_grouped_gemm: Algorithm selected
nvte_grouped_gemm->>cuBLASLt: cublasLtMatmul()
Note over cuBLASLt: Execute grouped GEMM:<br/>D[i] = alpha[i] * op(A[i]) @ op(B[i]) + beta[i] * C[i]
cuBLASLt-->>nvte_grouped_gemm: GEMM complete
nvte_grouped_gemm-->>User: Return
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (4)
-
tests/cpp/operator/test_grouped_gemm.cu, line 75 (link)logic: missing columnwise_data in move assignment
-
tests/cpp/operator/test_grouped_gemm.cu, line 336-337 (link)logic: kSameFirst and kSameLast test cases use identical shape tuples. Should kSameFirst and kSameLast have different shape patterns to properly test the respective scenarios?
-
tests/cpp/operator/test_grouped_gemm.cu, line 367-378 (link)logic: missing case for InputCase::kFP8Delayed
-
transformer_engine/common/include/transformer_engine/gemm.h, line 265-266 (link)style: The
avg_m,avg_n,avg_kparameters are not documented in the function commentWhat do these average dimension parameters represent and how should they be computed?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
4 files reviewed, 4 comments
- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers - Simplify select_grouped_operand by removing dead code branches - Add GroupedOperandSelection.tensor field to avoid passing tensor separately - Extract set_fp8_scale_pointers and init_matrix_layouts helpers - Add safety check for FP8 on Hopper column-wise fallback - Support NULL C tensor when beta=0 (uses D as placeholder) - Remove unused get_scale_inv() from test - Add use_null_c test parameter and test case - Fix documentation: alpha/beta are single element tensors only Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
transformer_engine/common/include/transformer_engine/gemm.h, line 266 (link)syntax: Documentation incorrectly states
alpha[i]andbeta[i]. The implementation uses a singlealphaandbetavalue for all matrices in the group (batch stride is 1 ininit_matmul_descat cublaslt_gemm.cu:1404), not per-matrix scaling.
4 files reviewed, 1 comment
- Change alpha/beta from single values to per-matrix arrays - Validate alpha/beta have exactly num_tensors elements - Update kernel to index alpha_ptr[idx] and beta_ptr[idx] - Move alpha/beta validation to validate_grouped_gemm_inputs - Update tests to use per-matrix alpha/beta arrays - Update documentation Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
|
/te-ci |
| NVTEMatmulConfig config, cudaStream_t stream, const int64_t *avg_m, | ||
| const int64_t *avg_n, const int64_t *avg_k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The average sizes seem like advanced configs that would be better to leave out of the top-level API. Can we move them inside NVTEMatmulConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suggestion of @ptrendx - customer may potentially want to use them if they know something more about the shapes. For example if there are multiple tensors of with k dimension D and one tensor with K dimension equal to 1, then it is potentially true that telling cublas that avg dim = D will result in better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely expose these options, but it would be better to put them in NVTEMatmulConfig rather than the top-level function signature. If you look at nvte_cublas_gemm:
TransformerEngine/transformer_engine/common/include/transformer_engine/gemm.h
Lines 109 to 112 in 97a09c2
| void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, | |
| NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, | |
| NVTETensor workspace, bool accumulate, bool use_split_accumulator, | |
| int math_sm_count, cudaStream_t stream); |
You see advanced options like
pre_gelu_out and use_split_accumulator that are only needed for specialized cases. It's even worse for use_split_accumulator, since that's only relevant for FP8 on Hopper and now it's sitting uselessly in the API forever. Compare with the v2 API, which is much closer to the original GEMM API:TransformerEngine/transformer_engine/common/include/transformer_engine/gemm.h
Lines 131 to 133 in 97a09c2
| void nvte_cublas_gemm_v2(int transa, int transb, const float *alpha, const NVTETensor A, | |
| const NVTETensor B, const float *beta, const NVTETensor C, NVTETensor D, | |
| NVTETensor workspace, NVTEMatmulConfig config, cudaStream_t stream); |
The idea of NVTEMatmulConfig is to hold these advanced, kernel-specific options in a way where we can add or deprecate them easily without breaking API changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added analogous NVTEGroupedMatmulConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that cublas does not support pre_gelu_out and use_split_accumulator for grouped gemm.
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
transformer_engine/common/include/transformer_engine/gemm.h, line 256 (link)syntax: Documentation mentions non-existent
configparameter - this parameter is not in the function signature on line 276-280 -
tests/cpp/operator/test_grouped_gemm.cu, line 141 (link)logic: Alignment calculation appears incorrect - multiplying 16 bytes by 8 bits gives 128 bits, not the intended 16-byte alignment
6 files reviewed, 2 comments
|
/te-ci L0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (4)
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 445 (link)style: Use
const_castsparingly - casting away const froma_basecould be risky if the caller expects the data to remain unchangedNote: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 451-455 (link)style: The comment mentions "Test stores A" but this is implementation code, not test code - update comment for clarity
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
tests/cpp/operator/test_grouped_gemm.cu, line 282-283 (link)style: Commented code should be removed before merging
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 313-318 (link)style: Variable naming could be clearer -
rowa/cola/rowb/colbcould be confused with actual row/column counts vs storage dimensionsConsider renaming to
lda_rows/lda_colsetc. or adding clarifying commentsNote: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
8 files reviewed, 4 comments
for more information, see https://pre-commit.ci
| inline int64_t compute_avg_first_dim(const transformer_engine::GroupedTensor *t) { | ||
| // logical_shape[0] is either num_tensors*M (uniform) or sum_of_M (varying first) | ||
| // In both cases, dividing by num_tensors gives the average | ||
| return static_cast<int64_t>(t->logical_shape.data[0]) / static_cast<int64_t>(t->num_tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is logical_shape.data[0]? do we have access to this field if we want cuda graph (ie. without D2H copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logical_shape represents total shape of all tensors in groupped tensor. For example we can have grouped tensor of logical shape [10, 10] and tensors of shape [3, 10] and [7, 10] inside it. The logical shape must be constant to use cuda graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this could be an issue for future cuda graph support of dynamic shapes MOE. Basically, how it works is like for input buffer shape [10, 10], this [10, 10] shape is static, but there will be an on-device tensor that says the splits will be [[3, 10], [3, 10]], and the actual valid shape will be [6, 10]. The excessive shape should be the gemm performance wasted (?)
That being said, this function is simply for getting a cublas heuristic, so it's nothing fundamental. Can we support injecting static values directly here? Assuming users know their model well, and they can manually set those values and avoid knowing these shapes on host, since that would mean D2H copy, therefore no cuda graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem may be that the wrong statistics are computed, not any cuda errors or incorrect memory read, am I right?
There is support for custom statistic injection here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong statistics can be avoided if we do zero pad, but that's another story, need to check with cublas folks
btw, does it PR have BF16 grouped gemm? is grouped gemm optimization sm100 only for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it supports bf16 and there are plans to support it on other devices also
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (4)
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 449-453 (link)style:
const_castremoves const correctness. While the comment says cuBLAS won't modify the data, this breaks type safety guarantees.Consider if cuBLASLt API could accept
const void**for input matrices, or document why this is necessary.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
tests/cpp/operator/test_grouped_gemm.cu, line 136-145 (link)style: Random padding for alignment testing is good, but the random seed is hardcoded (12345).
Verify tests are deterministic and document why this specific padding strategy was chosen.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 582-585 (link)style: Fast accumulator mode is always enabled for FP8, but split accumulator isn't supported by cuBLASLt grouped GEMM API.
Check that this accuracy trade-off is documented in the public API so users are aware.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 514 (link)style: SM100 (Blackwell) requirement is very restrictive.
Verify this is a hard requirement from cuBLASLt grouped GEMM API and document the limitation clearly for users.
8 files reviewed, 4 comments
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
transformer_engine/common/gemm/config.h, line 47-48 (link)logic:
attr_sizescalculation for optional fields includes fullstd::optionalsize rather than justint64_twhen
nvte_get_grouped_matmul_config_attributecopies value to buffer, it only copiessizeof(int64_t)(line 150-160 in config.cpp), butattr_sizesdeclaressizeof(avg_m)which issizeof(std::optional<int64_t>)(typically 16 bytes, not 8) -
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 449 (link)style:
const_castremoves const qualifier from input pointers A, B, Cwhile comment notes this is safe because cuBLAS won't modify the data, this relies on cuBLAS internal behavior and could break if cuBLAS implementation changes - consider verifying this is documented in cuBLAS API
-
tests/cpp/test_common.cu, line 1093-1102 (link)style: random padding logic may generate non-deterministic test behavior
while the seed is fixed (12345), random padding affects memory layout which could expose alignment issues inconsistently across test runs - consider documenting this is intentional for broader coverage
9 files reviewed, 3 comments
Additional Comments (5)
Check if
Verify the dimension logic is correct for both row-wise and column-wise layouts with all transpose combinations
|
| // Number of streaming multiprocessors to use in GEMM kernel | ||
| int sm_count = 0; | ||
|
|
||
| static constexpr size_t attr_sizes[] = {sizeof(avg_m), sizeof(avg_n), sizeof(avg_k), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with greptile, after the change to optional this is wrong - you would need to do something like
sizeof(decltype(avg_m)::value_type|
|
||
| // Compute data pointers | ||
| // Note: const_cast is safe here - cuBLAS requires void** but won't modify A/B/C data | ||
| A_ptrs[idx] = const_cast<char *>(a_base) + a_offset * a_elem_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually need this here - the fact that you got a const pointer to GroupedTensor does not mean that the pointer inside it is const pointer. So you can declare those a_base etc. pointers as regular char* and not have to const_cast them afterwards.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Overview
Adds comprehensive grouped GEMM support via cuBLASLt's grouped matmul API, enabling efficient batched matrix multiplication on tensors with varying shapes. The implementation bridges the gap between NVTEGroupedTensor's contiguous buffer + metadata format and cuBLASLt's pointer array requirements.
Key Features Implemented
-
GPU Setup Kernel (
setup_grouped_gemm_kernel): Computes per-tensor pointers and M/N/K dimensions from GroupedTensor metadata, handling both uniform and varying shapes with optional per-tensor offsets. -
FP8 Support: Correctly handles FP8 scale_inv pointers and enforces TN-layout-only constraint on Hopper via columnwise data fallback and transpose flag adjustment.
-
Configuration System: New
GroupedMatmulConfigstruct with optional avg_m/avg_n/avg_k hints for cuBLASLt algorithm heuristics, with proper C++ wrapper for ease of use. -
Workspace Management: Separate setup workspace (pointer arrays) and cuBLAS workspace, with proper alignment and size validation.
-
API Design: Clear separation between C API (
nvte_grouped_gemm) and C++ wrappers (GroupedMatmulConfigWrapper), with comprehensive documentation.
Validation & Testing
- Comprehensive test suite (
test_grouped_gemm.cu) compares againstnvte_multi_tensor_gemmbaseline - Tests all data types: FP8, BF16 with various shape configurations (uniform, varying first/last dimensions)
- All transpose combinations and null C (beta=0) cases covered
- Proper architectural checks for Blackwell+ and cuBLAS 13.1+ requirements
Code Quality
- Proper input validation with detailed error messages
- RAII memory management throughout (CudaPtr, GroupedTensorHandle, NVIDIA's custom deleters)
- Helper functions (
TensorShapeInfo,GroupedOperandSelection) encapsulate complex logic - Comments explain non-obvious decisions (e.g., FP8 TN-layout handling, M/N/K computation semantics)
- Dimension computation is semantically correct for both transposed and non-transposed cases
No Issues Found
The implementation is production-ready with:
- Correct attr_sizes computation using
decltype(...) ::value_type - Proper SM100/100a/103a detection (sm_arch >= 100)
- Safe nullptr handling for optional C tensor (falls back to D)
- Correct workspace layout and alignment
Confidence Score: 5/5
- This PR is safe to merge with minimal risk. Implementation is well-designed, thoroughly tested, and handles edge cases properly.
- Score reflects comprehensive implementation of a complex feature with excellent code quality. All 12 files reviewed showed proper error handling, memory management, and API consistency. Dimension computation logic is correct, FP8 constraints are properly enforced, and test coverage is thorough including edge cases (null C, varying shapes, all transpose combinations, dual data types). Previous review comments have been addressed. No critical, high, or medium severity issues identified.
- No files require special attention. All implementations follow established patterns and have proper validation.
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/gemm/cublaslt_grouped_gemm.cu | 5/5 | Core implementation of grouped GEMM. Bridges NVTEGroupedTensor format to cuBLASLt requirements via GPU setup kernel. Properly handles FP8 TN-layout constraints, varying shapes, and column-wise data fallback. M/N/K dimension computation is correct. C=nullptr fallback to D is properly handled. No critical issues found. |
| transformer_engine/common/gemm/config.h | 5/5 | GroupedMatmulConfig structure with optional avg_m/avg_n/avg_k hints. Correctly uses sizeof(decltype(avg_m)::value_type) = sizeof(int64_t) = 8 bytes, matching the memcpy operations in config.cpp. No issues. |
| transformer_engine/common/gemm/config.cpp | 5/5 | Configuration accessors for grouped matmul. Correctly extracts int64_t values from std::optional and performs memcpy with correct sizes. Matches attr_sizes from config.h. No issues found. |
| transformer_engine/common/include/transformer_engine/gemm.h | 5/5 | API declarations and C++ wrapper classes. Documentation clearly states Blackwell (SM100) + cuBLAS 13.1+ requirements. GroupedMatmulConfigWrapper provides clean C++ interface for setting avg_m/avg_n/avg_k hints. Comprehensive documentation. No issues. |
| tests/cpp/operator/test_grouped_gemm.cu | 5/5 | Comprehensive test suite comparing grouped_gemm against multi_tensor_gemm baseline. Tests FP8/BF16, all shape variations (uniform, varying first/last/all), transpose combinations, and null C case. Uses cudaStreamDefault (passed as 0) which is semantically correct. No issues found. |
| tests/cpp/test_common.cu | 5/5 | build_grouped_tensor implementation with random padding for alignment testing. Correctly computes offsets, logical shapes, and sets grouped tensor parameters. Allocates and copies FP8 columnwise data appropriately. No issues found. |
Sequence Diagram
sequenceDiagram
participant User
participant nvte_grouped_gemm as nvte_grouped_gemm API
participant setup_kernel as setup_grouped_gemm_kernel
participant cublaslt as cuBLASLt
User->>nvte_grouped_gemm: Call with GroupedTensors A, B, C, D
nvte_grouped_gemm->>nvte_grouped_gemm: Validate inputs (tensors, shapes, dtypes)
nvte_grouped_gemm->>nvte_grouped_gemm: Select operand storage (row/col-wise)
nvte_grouped_gemm->>nvte_grouped_gemm: Handle FP8 TN-only constraint on Hopper
nvte_grouped_gemm->>nvte_grouped_gemm: Allocate/validate workspaces
nvte_grouped_gemm->>setup_kernel: Launch setup kernel (GPU)
setup_kernel->>setup_kernel: Compute per-matrix pointers from offsets
setup_kernel->>setup_kernel: Compute M/N/K from tensor shapes
setup_kernel->>setup_kernel: Fill pointer arrays and dimension arrays
nvte_grouped_gemm->>nvte_grouped_gemm: Initialize cuBLASLt descriptors
nvte_grouped_gemm->>nvte_grouped_gemm: Set FP8 scale pointers (if needed)
nvte_grouped_gemm->>nvte_grouped_gemm: Select algorithm via heuristics
nvte_grouped_gemm->>cublaslt: Execute grouped GEMM
cublaslt->>cublaslt: Perform batched matrix multiplication
cublaslt-->>User: Return with results in D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds nvte_grouped_gemm API using cuBLASLt grouped matmul for batched GEMM on tensors with varying shapes, requiring Blackwell (SM100+) and cuBLAS 13.1+.
Implementation Overview:
- Core GEMM flow: GPU kernel (
setup_grouped_gemm_kernel) convertsNVTEGroupedTensorformat (contiguous buffer + offsets) to cuBLAS requirements (pointer arrays + per-matrix M/N/K dimensions) - FP8 support: Handles scale_inv pointers and includes TN layout selection logic for Hopper compatibility
- Configuration:
GroupedMatmulConfigwith optional avg_m/avg_n/avg_k hints for algorithm heuristics - Testing: Comprehensive tests comparing against
nvte_multi_tensor_gemmacross FP8/BF16, various shapes, and transpose configurations
Key Finding:
The setup kernel contains a logic bug in the FP8 TN layout forcing path (lines 453-458) where dimensions are computed incorrectly when columnwise data is selected. However, this code path is unreachable in practice because grouped GEMM requires SM100+ (Blackwell), which supports non-TN FP8 layouts, so the TN-forcing logic never executes. While not a runtime issue, this creates technical debt if requirements change.
Code Quality:
- Clean separation of concerns with helper functions
- Proper error checking and validation
- Good test coverage for supported configurations
- RAII memory management in test infrastructure
Confidence Score: 4/5
- Safe to merge - core functionality is correct and well-tested for the supported SM100+ configuration
- The implementation is sound for its intended SM100+ target. One logic bug exists in unreachable TN-forcing code (P2 severity) that only affects hypothetical Hopper support. All tested paths work correctly. Strong test coverage validates the main use cases.
- transformer_engine/common/gemm/cublaslt_grouped_gemm.cu - contains unreachable but incorrect FP8 TN layout handling code
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/gemm/cublaslt_grouped_gemm.cu | 3/5 | New file implementing nvte_grouped_gemm with cuBLASLt. Contains setup kernel for pointer array generation, FP8 scale handling, and operand selection logic. Has untested FP8 TN layout forcing code that would be incorrect if used on Hopper (but SM100+ requirement prevents this). |
| transformer_engine/common/gemm/config.h | 5/5 | Adds GroupedMatmulConfig struct with optional avg_m/n/k hints and sm_count. attr_sizes correctly uses sizeof(value_type) for std::optional fields. |
| transformer_engine/common/gemm/config.cpp | 5/5 | Implements create/get/set/destroy functions for GroupedMatmulConfig. Properly handles std::optional by transferring value_type (int64_t) with value_or(0) for gets and direct assignment for sets. |
| tests/cpp/operator/test_grouped_gemm.cu | 4/5 | Comprehensive tests comparing nvte_grouped_gemm against nvte_multi_tensor_gemm baseline. Tests FP8/BF16, various transpose configs, and shape variations. Only runs on SM100+ which has non-TN FP8 support. |
| tests/cpp/test_common.cu | 5/5 | Adds build_grouped_tensor helper with RAII memory management. Correctly handles varying shapes, random padding for alignment testing, and creates both rowwise and columnwise data for FP8. |
| tests/cpp/test_common.h | 5/5 | Adds GroupedBuffers struct and build_grouped_tensor declaration. Clean RAII design with CudaPtr and GroupedTensorHandle wrappers. |
Sequence Diagram
sequenceDiagram
participant User
participant nvte_grouped_gemm
participant select_grouped_operand
participant setup_kernel
participant cuBLASLt
User->>nvte_grouped_gemm: Call with A, B, C, D tensors
nvte_grouped_gemm->>nvte_grouped_gemm: Validate inputs (SM100+, cuBLAS 13.1+)
nvte_grouped_gemm->>select_grouped_operand: Select A operand (row/col-wise)
select_grouped_operand-->>nvte_grouped_gemm: A_sel (adjusted trans, dptr, dtype)
nvte_grouped_gemm->>select_grouped_operand: Select B operand (row/col-wise)
select_grouped_operand-->>nvte_grouped_gemm: B_sel (adjusted trans, dptr, dtype)
nvte_grouped_gemm->>nvte_grouped_gemm: Allocate setup + cuBLAS workspaces
nvte_grouped_gemm->>setup_kernel: Launch GPU kernel
setup_kernel->>setup_kernel: Compute pointer arrays (A/B/C/D_ptrs)
setup_kernel->>setup_kernel: Compute M/N/K per matrix
setup_kernel->>setup_kernel: Setup alpha/beta pointers
setup_kernel-->>nvte_grouped_gemm: Populated workspace arrays
nvte_grouped_gemm->>nvte_grouped_gemm: Init matrix layouts (descA/B/C/D)
nvte_grouped_gemm->>nvte_grouped_gemm: Init matmul descriptor with ops
nvte_grouped_gemm->>nvte_grouped_gemm: Set FP8 scale pointers if needed
nvte_grouped_gemm->>nvte_grouped_gemm: Select algorithm via heuristics
nvte_grouped_gemm->>cuBLASLt: cublasLtMatmul
cuBLASLt-->>nvte_grouped_gemm: GEMM results in D
nvte_grouped_gemm-->>User: Return
| // Compute M, N, K dimensions from tensor shapes | ||
| // Input A is stored as {K,M} when !transa, {M,K} when transa | ||
| // Input B is stored as {N,K} when !transb, {K,N} when transb | ||
| M[idx] = static_cast<int>(transa ? a_first : a_last); | ||
| K[idx] = static_cast<int>(transa ? a_last : a_first); | ||
| N[idx] = static_cast<int>(transb ? b_last : b_first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Setup kernel uses wrong dimensions when columnwise data is selected. When select_grouped_operand switches to columnwise data for FP8 TN layout (lines 249-255), it flips the transpose flag but the setup kernel still uses first_dims/last_dims which describe the rowwise shape, not the columnwise shape. For example, if rowwise A is {K, M} and we switch to columnwise with trans=true, the kernel computes M[idx]=a_first=K and K[idx]=a_last=M (swapped). This bug is currently unreachable because grouped GEMM requires SM100+ (line 512) where non-TN FP8 is supported, so the TN-forcing code never executes. However, this creates a maintenance risk if the SM requirement is lowered or the pattern is copied elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, it's bigger problem. it does not result in any bugs, but i think the code needs to be slightly redesigned.
Description
Adds
nvte_grouped_gemmAPI using cuBLASLt grouped matmul for batched GEMM on tensors with varying shapes. A GPU kernel (setup_grouped_gemm_kernel) convertsNVTEGroupedTensorformat (contiguous buffer + offsets) to cuBLAS requirements (pointer arrays + per-matrix M/N/K).New API
Computes
D = alpha * op(A) @ op(B) + beta * Cfor groups of matrices with potentially different shapes.Type of change
Changes
GroupedGemmSetupWorkspacestruct for cuBLAS workspace layouttest_grouped_gemm.cucomparing againstnvte_multi_tensor_gemm(FP8/BF16, various shapes and transpose layouts)Checklist: