Skip to content

Conversation

@dimitrovmaksim
Copy link
Member

@dimitrovmaksim dimitrovmaksim commented Dec 16, 2025

Resolves #6220, resolves #6221

Issue: #6220

Currently, the can_copy_from validation is only performed when a deployment is considered new—that is, when no existing deployment_schema record exists for the given Qm hash.

In the code here, the layout is assigned if the deployment is new. If a deployment_schema record already exists, the code assumes the deployment was previously created and copied successfully, and therefore sets graft_base_layout to None.

However, if can_copy_from later fails (see here), the transaction rolls back all changes except the record in the deployment_schemas table. This leaves the deployment in a partially created state and copy was never attempted

When the same deployment is attempted again, it is no longer treated as new. As a result, graft_base_layout is set to None, and the can_copy_from check is skipped. The deployment then proceeds as if it were valid, the copy process starts, but it fails again—ultimately leaving the deployment stuck in a "never synced" state.

What this PR changes:

Move the can_copy_from check and the site allocation in a transaction so a failing check fully rolls back the allocation.

Remove the duplicate/late compatibility check from deployment_store.rs

Use Catalog + Layout::new(...) to construct the layout and run layout.can_copy_from(&base_layout) before committing.

Minor cleanups (string formatting improvements, unused variables removed).

Removed the can_copy_from check from copy_deployment because it passed the source layout as the graft base and then compared it against the destination layout, which seems as a redundant check since the two layouts are the same.

Result

Now every re-deploy of a failed deployment should behave as a brand new deployment, and the if !exists check should behave correctly.

Note: This approach may result in deployment_schemas_id_seq gaps

@dimitrovmaksim dimitrovmaksim force-pushed the fix/resolve-grafting-bugs branch from 0738b0e to 103da40 Compare December 16, 2025 21:39
@dimitrovmaksim dimitrovmaksim changed the title [WIP] Fix: Resolve grafting bugs Fix: Resolve grafting bugs Dec 16, 2025
@dimitrovmaksim dimitrovmaksim self-assigned this Dec 16, 2025
Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>
Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>
@dimitrovmaksim dimitrovmaksim force-pushed the fix/resolve-grafting-bugs branch from 103da40 to 8926e4b Compare January 6, 2026 11:05
@dimitrovmaksim dimitrovmaksim requested a review from lutter January 6, 2026 11:16
@fordN fordN requested review from isum and lutter and removed request for isum and lutter January 8, 2026 16:50
column: &str,
) -> Result<Vec<i64>, StoreError> {
const QUERY: &str = "select histogram_bounds::text::int8[] bounds \
const QUERY: &str = "select coalesce(histogram_bounds::text::int8[], '{}'::int8[]) as bounds \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops .. nice!

self.primary_conn()
.await?
.record_active_copy(graft_base.site.as_ref(), site.as_ref())
.record_active_copy(graft_base_layout.site.as_ref(), site.as_ref())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also happen in the transaction above? Otherwise, can't we end up in a situation where we set up everything except recording the active copy if graph-node gets killed atthe wrong moment?

let base_layout = self.layout(graft_base).await?;
let entities_with_causality_region =
deployment.manifest.entities_with_causality_region.clone();
let catalog = Catalog::for_tests(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong - for_tests doesn't actually check the database for its capabilities (and the method should be marked as #[cfg(debug_assertions)]). Instead, this needs to use Catalog::for_creation where the connection is a connection to the shard in which we will create the subgraph. It's best to create the catalog object outside of this transaction so that we don't hold multiple db connections at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants