model parallelism unit test failed for `modeling_pe_audio_video.py` and `modeling_pe_video.py`

### System Info

```
- `transformers` version: 5.0.0.dev0
- Platform: Linux-5.4.292-1.el8.elrepo.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 1.2.2
- Safetensors version: 0.6.2
- Accelerate version: 1.8.1
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': True, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'Gemma3DecoderLayer', 'fsdp_version': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100 80GB PCIe
```

### Who can help?

@zucchini-nlp when I run unit test like `tests/models/pe_audio_video/test_modeling_pe_audio_video.py::PeAudioVideoEncoderTest::test_model_parallelism` and `tests/models/pe_video/test_modeling_pe_video.py::PeVideoEncoderTest::test_model_parallelism`, it failed and returned error:
```

tests/test_modeling_common.py:2529:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
/root/ft_test/frameworks.ai.client-ai.hf-accelerate-client/src/accelerate/hooks.py:175: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/utils/generic.py:790: in wrapper
    output = func(self, *args, **kwargs)
src/transformers/utils/generic.py:945: in wrapper
    outputs = func(self, *args, **kwargs)
src/transformers/models/pe_audio_video/modeling_pe_audio_video.py:593: in forward
    inputs_embeds, padding_mask, audio_output, video_output = self.embedder(
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/pe_audio_video/modeling_pe_audio_video.py:223: in forward
    video_output = self.video_encoder(pixel_values_videos, padding_mask_videos=padding_mask_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/utils/generic.py:790: in wrapper
    output = func(self, *args, **kwargs)
src/transformers/utils/generic.py:945: in wrapper
    outputs = func(self, *args, **kwargs)
src/transformers/models/pe_video/modeling_pe_video.py:530: in forward
    inputs_embeds, padding_mask = self.embedder(pixel_values_videos, padding_mask=padding_mask_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/pe_video/modeling_pe_video.py:182: in forward
    vision_encoder_outputs = self.vision_model(pixel_values_videos)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
src/transformers/models/timm_wrapper/modeling_timm_wrapper.py:360: in forward
    logits = self.timm_model(pixel_values, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/timm/models/eva.py:993: in forward
    x = self.forward_features(x)
/usr/local/lib/python3.10/dist-packages/timm/models/eva.py:964: in forward_features
    x = blk(x, rope=rot_pos_embed)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1773: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1784: in _call_impl
    return forward_call(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = EvaBlock(
  (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
  (attn): AttentionRope(
    (qkv): Linear(i...n_features=256, out_features=64, bias=True)                                                                          (drop2): Dropout(p=0.0, inplace=False)                                                                                               )                                                                                                                                      (drop_path2): Identity()                                                                                                             )                                                                                                                                      x = tensor([[[ 1.2520,  0.1755,  0.2867,  ..., -0.3618,  0.6712,  0.4061],                                                                      [-0.3215, -0.1512, -0.8277,  ..., -0.9...        [-0.5117, -0.0159, -1.1778,  ..., -0.6055, -0.6392,  0.1033]]],                     device='cuda:0', grad_fn=<AddBackward0>)                                                                                        rope = tensor([[-0.9056, -0.9056, -0.9056, -0.9056,  0.4242,  0.4242,  0.4242,  0.4242]],                                                     device='cuda:0'), attn_mask = None                                                                                                                                                                                                                                         def forward(                                                                                                                                   self,                                                                                                                                  x: torch.Tensor,                                                                                                                       rope: Optional[torch.Tensor] = None,                                                                                                   attn_mask: Optional[torch.Tensor] = None,                                                                                      ) -> torch.Tensor:                                                                                                                         if self.gamma_1 is None:                                                                                                                   x = x + self.drop_path1(self.attn(self.norm1(x), rope=rope, attn_mask=attn_mask))                                          >           x = x + self.drop_path2(self.mlp(self.norm2(x)))                                                                           E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
```

The failing test test_model_parallelism was caused by the timm model (EvaBlock) being split across multiple devices during model parallelism. Since it does not have a mature ways to process model parallelism for the timm model,  I think a suitbale manner would be add the `TimmWrapperForImageClassification` into `_no_split_modules`, like the drafted PR [42917](https://github.com/huggingface/transformers/pull/42917), but it seems there is almost nothing else to be split, or we can skip related unit test? WDYT?


### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

git clone transformers
cd transformers
pip install -e .
pytest -rA tests/models/pe_audio_video/test_modeling_pe_audio_video.py::PeAudioVideoEncoderTest::test_model_parallelism

### Expected behavior

unit test pass or be skipped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model parallelism unit test failed for `modeling_pe_audio_video.py` and `modeling_pe_video.py` #42918

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model parallelism unit test failed for modeling_pe_audio_video.py and modeling_pe_video.py #42918

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

model parallelism unit test failed for `modeling_pe_audio_video.py` and `modeling_pe_video.py` #42918