Add z-image-omni-base implementation #12857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

RuoyiDu wants to merge 7 commits into huggingface:main from JerryWu-code:z-image-omni-base

+1,407 −26

Contributor

RuoyiDu commented Dec 18, 2025 •

edited

Loading

What does this PR do?

This PR adds support for the Z-Image-Omni-Base model. Z-Image-Omni-Base is a foundation model designed for easy fine-tuning, unifying core capabilities in both image generation and editing to empower the community to explore custom development and innovative applications.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu @apolinario @JerryWu-code


          Add z-image-omni-base implementation

3435ba2

yiyixuxu reviewed

View reviewed changes

Collaborator

yiyixuxu left a comment

thanks a lot for the PR! I left some comments, mainly I'm just trying to simplify the code in the transfomer as much as possible by removing unused code path etc
let me know what you think:)

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

		SEQ_MULTI_OF = 32


		class TimestepEmbedder(nn.Module):

Collaborator

yiyixuxu Dec 18, 2025

can we add a #Coped from ?

Contributor

JerryWu-code Dec 19, 2025

Fixed by merging into one transformer_z_image.

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

		return t_emb


		class ZSingleStreamAttnProcessor:

Collaborator

yiyixuxu Dec 18, 2025

can we add a Copied from ?

Contributor

JerryWu-code Dec 19, 2025

Same as before.

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated



		@maybe_allow_in_graph
		class ZImageTransformerBlock(nn.Module):

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
            class ZImageTransformerBlock(nn.Module):
          
            class ZOmniImageTransformerBlock(nn.Module):

Contributor

JerryWu-code Dec 19, 2025

Ignored due to merging into one.

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

+                      adaln_clean: Optional[torch.Tensor] = None,
+                  ):
+                      if self.modulation:
+                          if noise_mask is not None and adaln_noisy is not None and adaln_clean is not None:

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
                        if noise_mask is not None and adaln_noisy is not None and adaln_clean is not None:

Contributor

JerryWu-code Dec 19, 2025

Current codebase in 4c14cf3, it would be needed. But could be optimized by re-design in next pr.

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

Comment on lines 261 to 266

+                          else:
+                              # Original global modulation
+                              assert adaln_input is not None
+                              scale_msa, gate_msa, scale_mlp, gate_mlp = self.adaLN_modulation(adaln_input).unsqueeze(1).chunk(4, dim=2)
+                              gate_msa, gate_mlp = gate_msa.tanh(), gate_mlp.tanh()
+                              scale_msa, scale_mlp = 1.0 + scale_msa, 1.0 + scale_mlp

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
                        else:
          
                            # Original global modulation
          
                            assert adaln_input is not None
          
                            scale_msa, gate_msa, scale_mlp, gate_mlp = self.adaLN_modulation(adaln_input).unsqueeze(1).chunk(4, dim=2)
          
                            gate_msa, gate_mlp = gate_msa.tanh(), gate_mlp.tanh()
          
                            scale_msa, scale_mlp = 1.0 + scale_msa, 1.0 + scale_mlp

can we remove this code path if it is not used?

Contributor

JerryWu-code Dec 19, 2025

When merging into one, it would be needed.

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

Comment on lines 794 to 795

		patch_size=2,
		f_patch_size=1,

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
                    patch_size=2,
          
                    f_patch_size=1,

I don't think these two arguments are used in the pipeline, can we remove them? could simplify the code a lot I think -> it can help remove the ModuleDict pattern too

Contributor

JerryWu-code Dec 19, 2025

Same as #12857 (comment)

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

Comment on lines 798 to 799

		assert patch_size in self.all_patch_size
		assert f_patch_size in self.all_f_patch_size

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
                    assert patch_size in self.all_patch_size
          
                    assert f_patch_size in self.all_f_patch_size

Contributor

JerryWu-code Dec 19, 2025

Same as #12857 (comment)

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

+                          cap_noise_mask,
+                          siglip_noise_mask
+                      ) = self.patchify_and_embed(
+                          x, cap_feats, siglip_feats, patch_size, f_patch_size, image_noise_mask

Collaborator

yiyixuxu Dec 18, 2025

Suggested change

      
                        x, cap_feats, siglip_feats, patch_size, f_patch_size, image_noise_mask
          
                        x, cap_feats, siglip_feats, image_noise_mask

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

+                      grids = torch.meshgrid(axes, indexing="ij")
+                      return torch.stack(grids, dim=-1)
+                  def patchify_and_embed(

Collaborator

yiyixuxu Dec 18, 2025

this method is really hard to follow here, do you think it's possible to break it into 3?

like

for x, cap_feat, siglip_feat in zip(all_x, all_cap_feats, all_siglip_feats):
    cap_item_cu_len = 1
    
    cap_padded, ..., cap_item_cu_len = self.patchify_and_embed_cap(...)
    all_cap_padded.append(cap_padded)
    
    x_padded, ..., cap_item_cu_len = self.patchify_and_embed_x(..., cap_item_cu_len)
    all_x_padded.append(x_padded)
    ...
    
    siglip_padded, ..., cap_item_cu_len = self.patchify_and_embed_siglip(...,cap_item_cu_len )
    all_siglip_padded.append(siglip_padded)

src/diffusers/models/transformers/transformer_z_image_omni.py Outdated

+                      assert all(_ % SEQ_MULTI_OF == 0 for _ in x_item_seqlens)
+                      x_max_item_seqlen = max(x_item_seqlens)
+                      x = torch.cat(x, dim=0)

Collaborator

yiyixuxu Dec 18, 2025

hopefully we can simplify to x = self.x_embedder(x) here

Contributor

JerryWu-code Dec 19, 2025

Same as #12857 (comment)

Contributor

Beinsezii commented Dec 18, 2025

this gets forgotten all the time lol

diff --git a/src/diffusers/pipelines/auto_pipeline.py b/src/diffusers/pipelines/auto_pipeline.py
index db0268a2a..2c36ce36b 100644
--- a/src/diffusers/pipelines/auto_pipeline.py
+++ b/src/diffusers/pipelines/auto_pipeline.py
@@ -119,7 +119,7 @@ from .stable_diffusion_xl import (
 )
 from .wan import WanImageToVideoPipeline, WanPipeline, WanVideoToVideoPipeline
 from .wuerstchen import WuerstchenCombinedPipeline, WuerstchenDecoderPipeline
-from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
+from .z_image import ZImageImg2ImgPipeline, ZImageOmniPipeline, ZImagePipeline
 
 
 AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
@@ -164,6 +164,7 @@ AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
         ("qwenimage", QwenImagePipeline),
         ("qwenimage-controlnet", QwenImageControlNetPipeline),
         ("z-image", ZImagePipeline),
+        ("z-image-omni", ZImageOmniPipeline),
     ]
 )

Contributor

JerryWu-code commented Dec 19, 2025

thanks a lot for the PR! I left some comments, mainly I'm just trying to simplify the code in the transfomer as much as possible by removing unused code path etc

Thanks for useful comments yiyi, I would review these and fix these modifications today ~ 😊

JerryWu-code added 4 commits

December 19, 2025 07:33


          Merged into one transformer for Z-Image.

3cbb38d


          Merge branch 'main' into z-image-omni-base

3e60fa7


          Fix bugs for controlnet after merging the main branch new feature.


          Fix for auto_pipeline, Add Styling.

4c14cf3

Contributor

JerryWu-code commented Dec 19, 2025 •

edited

Loading

Hi @yiyixuxu, this branch is ready to merge 😊. This would solve most of your concerns before (including copied xxx, cond_latents xxx, auto_pipeline, styling) by merging into one transformer model and incorporating new feats of main branch upon the start point. More feature updates and code cleanify would be update in another pr, you could review current status and leave some comments, and I would updates more asap ~ Thanks !!!

Contributor

JerryWu-code commented Dec 19, 2025

this gets forgotten all the time lol

diff --git a/src/diffusers/pipelines/auto_pipeline.py b/src/diffusers/pipelines/auto_pipeline.py
index db0268a2a..2c36ce36b 100644
--- a/src/diffusers/pipelines/auto_pipeline.py
+++ b/src/diffusers/pipelines/auto_pipeline.py
@@ -119,7 +119,7 @@ from .stable_diffusion_xl import (
 )
 from .wan import WanImageToVideoPipeline, WanPipeline, WanVideoToVideoPipeline
 from .wuerstchen import WuerstchenCombinedPipeline, WuerstchenDecoderPipeline
-from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
+from .z_image import ZImageImg2ImgPipeline, ZImageOmniPipeline, ZImagePipeline
 
 
 AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
@@ -164,6 +164,7 @@ AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
         ("qwenimage", QwenImagePipeline),
         ("qwenimage-controlnet", QwenImageControlNetPipeline),
         ("z-image", ZImagePipeline),
+        ("z-image-omni", ZImageOmniPipeline),
     ]
 )

Thanks!! Fixed in 4c14cf3 ~


          Refactor noise handling and modulation

5bc676c

- Add select_per_token function for per-token value selection
- Separate adaptive modulation logic
- Cleanify t_noisy/clean variable naming
- Move image_noise_mask handler from forward to pipeline

JerryWu-code force-pushed the z-image-omni-base branch from 70bc2c8 to 5bc676c Compare

December 19, 2025 14:58


          Styling & Formatting.

732c527

Contributor

JerryWu-code commented Dec 19, 2025 •

edited

Loading

Ready, let's merge it for 732c527 ~ 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet