Skip to content

model parallelism unit test failed for modeling_pe_audio_video.py and modeling_pe_video.py #42918

@kaixuanliu

Description

@kaixuanliu

System Info

- `transformers` version: 5.0.0.dev0
- Platform: Linux-5.4.292-1.el8.elrepo.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 1.2.2
- Safetensors version: 0.6.2
- Accelerate version: 1.8.1
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': True, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'Gemma3DecoderLayer', 'fsdp_version': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100 80GB PCIe

Who can help?

@zucchini-nlp when I run unit test like tests/models/pe_audio_video/test_modeling_pe_audio_video.py::PeAudioVideoEncoderTest::test_model_parallelism and tests/models/pe_video/test_modeling_pe_video.py::PeVideoEncoderTest::test_model_parallelism, it failed and returned error:


tests/test_modeling_common.py:2529:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
/root/ft_test/frameworks.ai.client-ai.hf-accelerate-client/src/accelerate/hooks.py:175: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/utils/generic.py:790: in wrapper
    output = func(self, *args, **kwargs)
src/transformers/utils/generic.py:945: in wrapper
    outputs = func(self, *args, **kwargs)
src/transformers/models/pe_audio_video/modeling_pe_audio_video.py:593: in forward
    inputs_embeds, padding_mask, audio_output, video_output = self.embedder(
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/pe_audio_video/modeling_pe_audio_video.py:223: in forward
    video_output = self.video_encoder(pixel_values_videos, padding_mask_videos=padding_mask_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/utils/generic.py:790: in wrapper
    output = func(self, *args, **kwargs)
src/transformers/utils/generic.py:945: in wrapper
    outputs = func(self, *args, **kwargs)
src/transformers/models/pe_video/modeling_pe_video.py:530: in forward
    inputs_embeds, padding_mask = self.embedder(pixel_values_videos, padding_mask=padding_mask_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/pe_video/modeling_pe_video.py:182: in forward
    vision_encoder_outputs = self.vision_model(pixel_values_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/timm_wrapper/modeling_timm_wrapper.py:360: in forward
    logits = self.timm_model(pixel_values, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/timm/models/eva.py:993: in forward
    x = self.forward_features(x)
/usr/local/lib/python3.10/dist-packages/timm/models/eva.py:964: in forward_features
    x = blk(x, rope=rot_pos_embed)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = EvaBlock(
  (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
  (attn): AttentionRope(
    (qkv): Linear(i...n_features=256, out_features=64, bias=True)                                                                          (drop2): Dropout(p=0.0, inplace=False)                                                                                               )                                                                                                                                      (drop_path2): Identity()                                                                                                             )                                                                                                                                      x = tensor([[[ 1.2520,  0.1755,  0.2867,  ..., -0.3618,  0.6712,  0.4061],                                                                      [-0.3215, -0.1512, -0.8277,  ..., -0.9...        [-0.5117, -0.0159, -1.1778,  ..., -0.6055, -0.6392,  0.1033]]],                     device='cuda:0', grad_fn=<AddBackward0>)                                                                                        rope = tensor([[-0.9056, -0.9056, -0.9056, -0.9056,  0.4242,  0.4242,  0.4242,  0.4242]],                                                     device='cuda:0'), attn_mask = None                                                                                                                                                                                                                                         def forward(                                                                                                                                   self,                                                                                                                                  x: torch.Tensor,                                                                                                                       rope: Optional[torch.Tensor] = None,                                                                                                   attn_mask: Optional[torch.Tensor] = None,                                                                                      ) -> torch.Tensor:                                                                                                                         if self.gamma_1 is None:                                                                                                                   x = x + self.drop_path1(self.attn(self.norm1(x), rope=rope, attn_mask=attn_mask))                                          >           x = x + self.drop_path2(self.mlp(self.norm2(x)))                                                                           E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

The failing test test_model_parallelism was caused by the timm model (EvaBlock) being split across multiple devices during model parallelism. Since it does not have a mature ways to process model parallelism for the timm model, I think a suitbale manner would be add the TimmWrapperForImageClassification into _no_split_modules, like the drafted PR 42917, but it seems there is almost nothing else to be split, or we can skip related unit test? WDYT?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

git clone transformers
cd transformers
pip install -e .
pytest -rA tests/models/pe_audio_video/test_modeling_pe_audio_video.py::PeAudioVideoEncoderTest::test_model_parallelism

Expected behavior

unit test pass or be skipped

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions