feat: High availabilty via RAFT #2987

alpe · 2026-01-16T20:36:43Z

Replaces #2954

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

* main: chore: reduce log noise (#2864) fix: sync service for non zero height starts with empty store (#2834) build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861) chore: minor improvement for docs (#2862)

* main: chore: bump da (#2866) chore: bump core (#2865)

* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)

* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)

* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)

* main: (34 commits) feat: make reaper poll duration configurable (#2951) chore!: move sequencers to pkg (#2931) feat: Ensure Header integrity on DA (#2948) feat(testda): add header support with GetHeaderByHeight method (#2946) chore: improve code comments clarity (#2947) chore(sequencers): optimize store check (#2945) fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926) feat: use DA timestamp (#2939) chore: improve code comments clarity (#2943) build(deps): bump libp2p (#2937) build(deps): Bump actions/cache from 4 to 5 (#2934) build(deps): Bump actions/download-artifact from 6 to 7 (#2933) build(deps): Bump actions/upload-artifact from 5 to 6 (#2932) feat: DA Client remove interface part 3, replace types with new code (#2910) DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920) fix(syncing): skip forced txs checks for p2p blocks (#2922) build(deps): Bump the all-go group across 5 directories with 5 updates (#2919) chore: loosen syncer state check (#2927) ...

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

* main: chore: adding syncing tracing (#2981) feat(tracing): adding block production tracing (#2980) feat(tracing): Add Store, P2P and Config tracing (#2972) chore: fix upgrade test (#2979) build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974) feat(tracing): adding tracing to DA client (#2968) chore: create onboarding skill (#2971) test: add e2e tests for force inclusion (part 2) (#2970) feat(tracing): adding eth client tracing (#2960) test: add e2e tests for force inclusion (#2964) build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

claude · 2026-01-16T20:37:00Z

Claude finished @alpe's task in 3m 8s —— View job

Code Review: RAFT High Availability Implementation

I've completed a comprehensive review of this RAFT high availability implementation. This is a significant and well-architected feature that introduces critical resilience capabilities to the sequencer. Below are my findings organized by category.

Summary

This PR successfully implements a production-ready RAFT consensus layer for high availability, enabling dynamic leader election and automatic failover between aggregator (leader) and sync (follower) modes. The implementation shows strong attention to safety, particularly around preventing double-signing and state divergence. The e2e tests demonstrate thorough validation of failover scenarios.

Key Strengths:

Strong safety guarantees (FSM synchronization before leadership, quorum checks before block production)
Comprehensive state validation and recovery mechanisms
Well-structured failover architecture with clean separation of concerns
Excellent test coverage for critical failover scenarios

Areas for Improvement:

Test coverage gaps (34% patch coverage)
Missing error context in some critical paths
Configuration validation could be more user-friendly
Some concurrency patterns need hardening

Critical Issues

1. Race Condition in Dynamic Leader Election (pkg/raft/election.go)

Location: pkg/raft/election.go:86-90,158-178

The ticker channel is set to nil after firing, but there's a window where both the leaderCh and ticker.C could fire simultaneously, potentially causing race conditions in worker initialization.

case <-ticker.C:
    // ...
    ticker.Stop()
    ticker.C = nil  // Setting to nil, but select could already be evaluating

Recommendation: Use atomic operations or a mutex to protect the ticker state transitions.

Severity: Medium - Could cause initialization issues during cluster startup

2. Insufficient Error Context in State Validation (block/internal/executing/executor.go, block/internal/syncing/syncer.go)

Locations:

executor.go:241-261
syncer.go:829-853

Critical state validation errors lack sufficient context for debugging production issues. When hash mismatches occur, operators need to know:

Which node detected the mismatch
The full state at both heights
Whether this is a cluster-wide issue or isolated divergence

Current:

return fmt.Errorf("invalid state: block hash mismatch at height %d: raft=%x local=%x", 
    state.LastBlockHeight, raftState.Hash, header.Hash())

Recommendation:

return fmt.Errorf("state divergence detected: block hash mismatch at height %d: "+
    "raft_hash=%x local_hash=%x node_id=%s cluster_state=height:%d raft_index=%d "+
    "This indicates a critical consistency violation requiring manual intervention",
    state.LastBlockHeight, raftState.Hash, header.Hash(), n.NodeID(), 
    raftState.Height, raftState.Index)

Severity: High - Impacts production debugging and incident response

3. Potential Deadlock in Failover State (node/failover.go)

Location: node/failover.go:180-235

The Run method uses errgroup.WithContext but the shutdown sequence in the deferred function could deadlock if bc.Stop() blocks while the context is already cancelled.

defer func() {
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Recommendation: Use a separate context with timeout for shutdown operations:

defer func() {
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Severity: Medium - Could cause graceful shutdown failures

High Priority Issues

4. Missing Nil Check for Raft Node (pkg/raft/election.go)

Location: pkg/raft/election.go:98-99, 109

The code checks leaderID() != NodeID() but leaderID() could return empty string if raft is not ready. This should be explicitly handled.

if d.node.leaderID() != d.node.NodeID() {
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}

Recommendation: Add explicit check:

leaderID := d.node.leaderID()
if leaderID == "" {
    d.logger.Warn().Msg("raft cluster not ready, no leader elected yet")
    continue
}
if leaderID != d.node.NodeID() {
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}

Severity: Medium - Could cause confusing behavior during cluster initialization

5. Configuration Error Messages Not User-Friendly (node/full.go, pkg/config/config.go)

Location: node/full.go:124

The error message "raft config must be used in sequencer setup only" is unclear. It doesn't explain that raft requires the aggregator flag.

Current:

return nil, fmt.Errorf("raft config must be used in sequencer setup only")

Recommendation:

return nil, fmt.Errorf("raft can only be enabled for aggregator nodes: "+
    "set --node.aggregator=true when using --raft.enable=true. "+
    "Non-aggregator nodes (full nodes/light clients) cannot participate in raft consensus")

Severity: Medium - Impacts developer experience and deployment troubleshooting

Note: This was also flagged by Gemini's review.

6. Duplicate Peer Handling Too Strict (pkg/raft/node.go)

Location: pkg/raft/node.go:139-141

The code errors on duplicate peers instead of deduplicating and warning. This makes configuration more brittle than necessary.

Current:

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
    return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}

Recommendation: Log warning and proceed:

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
    n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v -> %v", 
        cfg.Servers, svrs)
    cfg.Servers = svrs
}

Severity: Low - Quality of life improvement

Note: This was also flagged by Gemini's review.

Medium Priority Issues

7. Unclear Recovery Logic (block/internal/executing/executor.go)

Location: executor.go:282-322

The recovery logic in initializeState performs multiple state checks but the flow is hard to follow. The nested if statements checking raft state could benefit from early returns.

Recommendation: Refactor for clarity:

if e.raftNode == nil {
    // No raft, proceed with standard initialization
    e.setLastState(state)
    e.sequencer.SetDAHeight(state.DAHeight)
    return e.syncExecutionLayer(state)
}

// Raft-enabled path
raftState := e.raftNode.GetState()
if err := e.validateStateAgainstRaft(state, raftState); err != nil {
    return err
}
syncTarget := max(state.LastBlockHeight, raftState.Height)
return e.syncToHeight(syncTarget)

Severity: Medium - Code maintainability

8. Missing Logging for Critical State Transitions (pkg/raft/election.go)

Location: election.go:93-139

When transitioning from follower to leader, there's extensive checking but missing logs for key decision points (e.g., when FSM lag causes leadership abdication).

Recommendation: Add structured logging:

if err := d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
    d.logger.Error().
        Err(err).
        Uint64("applied_index", d.node.raft.AppliedIndex()).
        Uint64("last_index", d.node.raft.LastIndex()).
        Msg("failed to wait for messages to land - FSM lagging, abdicating")
    // ...
}

Severity: Medium - Impacts production observability

9. Hardcoded Timeout Values (node/failover.go, pkg/raft/election.go)

Locations:

failover.go:183, 224 - 3-second timeouts
election.go:86 - 300ms ticker

These should be configurable for different deployment scenarios (local dev vs production).

Recommendation: Move to configuration:

type FailoverConfig struct {
    ShutdownTimeout time.Duration  // default: 3s
    SyncCheckInterval time.Duration // default: 300ms
}

Severity: Low - Deployment flexibility

Security Considerations

10. Raft Transport Security

Location: pkg/raft/node.go:88

The RAFT transport uses TCP without TLS. For production deployments, this means raft communication (including block data) is unencrypted.

Current:

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Recommendation:

Document that raft should run on a trusted private network
Consider adding TLS support for the raft transport layer
Add configuration validation to warn if raft is exposed on public interfaces

Severity: Medium - Security posture for production deployments

11. No Authentication for Raft RPC Endpoint

Location: pkg/rpc/server/http.go (new /raft/node endpoint)

The new RPC endpoint exposing raft status has no authentication. While it only reveals leadership status, this could leak cluster topology information.

Recommendation:

Add optional authentication middleware for the raft endpoint
Consider rate limiting to prevent information disclosure attacks
Document security implications in deployment guides

Severity: Low - Information disclosure

Performance & Resource Efficiency

12. Unbounded Channel in FSM (pkg/raft/node.go)

Location: node.go:304-306, 327-333

The applyCh channel is set by the caller, and if full, updates are dropped with only a warning. This could cause follower nodes to miss state updates.

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Recommendation:

Document the required buffer size for the apply channel (currently 100 in raft_retriever.go:71)
Add metrics to track dropped messages
Consider applying backpressure instead of silently dropping

Severity: Medium - Could cause state sync issues under load

13. Inefficient State Cloning (pkg/raft/node.go)

Location: node.go:255-257

Every GetState() call performs a full protobuf clone, which could be expensive if called frequently.

func (n *Node) GetState() *RaftBlockState {
    return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)
}

Recommendation:

Profile to determine if this is a bottleneck
If so, implement copy-on-write or return immutable views
Add comment documenting why cloning is necessary (prevents callers from mutating shared state)

Severity: Low - Potential optimization opportunity

14. Goroutine Leak Potential (pkg/raft/election.go)

Location: election.go:67-84

The startWorker function spawns goroutines but relies on wg.Wait() in the defer. If the function returns early (e.g., due to panic), goroutines could leak.

Recommendation:

Add recovery handling in the worker goroutine
Ensure context cancellation propagates correctly
Add goroutine leak detection in tests

Severity: Low - Defensive programming

Testing & Reliability

15. Low Test Coverage (34.04%)

Key gaps:

pkg/raft/node.go: 11.05% (193 lines missing)
block/internal/executing/executor.go: Many raft-related paths untested
block/internal/syncing/syncer.go: Raft recovery paths at 0%

Recommendation:

Add unit tests for raft state machine transitions
Test all error paths in executor/syncer raft integration
Add tests for edge cases (e.g., raft ahead by >1 block, hash mismatches)
Consider using hashicorp/raft test helpers for cluster simulation

Severity: High - Critical for production confidence

16. Missing Tests for Concurrent State Access (block/internal/executing/executor.go, block/internal/syncing/syncer.go)

Both executor and syncer use atomic pointers for lastState, but there are no tests verifying thread-safety of state updates during raft transitions.

Recommendation:

Add race detector tests for state transitions
Test concurrent reads during leader elections
Verify no torn reads during failover

Severity: Medium - Concurrency correctness

17. E2E Tests Could Be More Comprehensive

Location: test/e2e/failover_e2e_test.go

The e2e tests cover basic failover but could test:

Multiple sequential failovers
Split-brain scenarios (network partitions)
Follower crashes and recovery
Rolling upgrades
Byzantine scenarios (malicious raft messages)

Recommendation: Expand test matrix to cover failure modes listed above.

Severity: Medium - Production resilience validation

Architecture & Code Quality

18. Raft State Validation Logic Duplicated

Locations:

block/internal/executing/executor.go:236-261, 305-322
block/internal/syncing/syncer.go:1252-1277

Both executor and syncer have nearly identical raft state validation logic. This should be extracted to a shared utility.

Recommendation:

// block/internal/common/raft_validator.go
func ValidateStateAgainstRaft(
    localHeight uint64,
    localHash []byte,
    raftState *raft.RaftBlockState,
    store store.Store,
) error {
    // Shared validation logic
}

Severity: Low - Code maintainability and DRY principle

19. Interface Segregation (block/internal/common/raft.go)

Location: block/internal/common/raft.go:9-18

The RaftNode interface is minimal but could be split further:

Leader-only operations (Broadcast, HasQuorum)
Observer operations (IsLeader, GetState)
Follower operations (SetApplyCallback)

Recommendation: Consider splitting if future features require more granular permissions.

Severity: Low - Future extensibility

20. Magic Numbers in Raft Configuration

Locations:

pkg/raft/node.go:88 - maxPool=3, timeout=10*time.Second
pkg/raft/node.go:78 - snapCount passed but not validated

Recommendation:

Extract as named constants with documentation
Add validation for sensible ranges (e.g., snapCount > 0)

Severity: Low - Code clarity

Documentation & Developer Experience

21. Missing Documentation for Recovery Scenarios

The recovery logic in IsSynced and Recover methods is complex but lacks documentation on:

When recovery is triggered
What recovery does (rollback vs replay)
Expected behavior for different height deltas

Recommendation: Add comprehensive godoc:

// IsSyncedWithRaft checks if local state matches raft consensus state.
// Returns: 
//   - 0 if in sync (heights and hashes match)
//   - negative if local is behind raft (safe, will catch up via normal sync)
//   - positive if local is ahead of raft (CRITICAL: requires rollback)
// 
// Hash mismatches at the same height indicate state divergence and return an error.
func (s *Syncer) IsSyncedWithRaft(raftState *raft.RaftBlockState) (int, error)

Severity: Medium - Critical for maintainers

22. Configuration Documentation Gaps

The new raft configuration flags lack comprehensive documentation:

What happens if --raft.bootstrap is set on all nodes?
How to add/remove nodes from an existing cluster?
Recovery procedures for lost quorum

Recommendation:

Add detailed comments to pkg/config/config.go raft section
Create docs/raft-operations.md with runbooks for common scenarios
Add validation to prevent misconfiguration (e.g., bootstrap without peers)

Severity: Medium - Operational readiness

23. Unclear TODO Comment (block/internal/syncing/syncer.go)

Location: syncer.go:654

if event.Source == common.SourceDA { // TODO(@julienrbrt): To be reverted once DA Hints are merged

This TODO references PR #2891 but doesn't explain:

Why this check exists
What DA Hints will change
Timeline for the revert

Recommendation: Expand comment with context or link to ADR/design doc.

Severity: Low - Code maintainability

Positive Observations

✅ Excellent Safety Mechanisms

FSM synchronization before leadership transition (election.go:96-105)
Quorum checks before block production (executor.go:411-413, 480-483)
Hash validation at every state transition
Grace period handling for forced inclusion transactions

✅ Clean Separation of Concerns

failoverState encapsulates mode-switching components cleanly
Dynamic leader election abstracted behind interface
Raft integration via minimal interface reduces coupling

✅ Comprehensive E2E Testing

Leader failover scenarios thoroughly tested
Rolling restart validation
DA gap verification
Double-signing prevention checks

✅ Strong Error Handling

Most error paths have context wrapping
Critical errors sent to dedicated channel
Proper cleanup in defer blocks

Recommendations Summary

Immediate (Before Merge)

Fix race condition in ticker handling (Issue ABCI interface #1)
Improve error context for state validation (Issue Peer discovery #2)
Add nil checks for raft leaderID (Issue Fraud proofs gossiping #4)
Improve configuration error messages (Issue Mempool #5)

Short Term (Next Sprint)

Increase test coverage to >60% (Issue Consider preparing native optimint arguments in cosmos-sdk #15)
Add concurrency tests (Issue Node address format #16)
Document recovery scenarios (Issue Describe peer discovery in lazy-adr #21)
Add raft operations guide (Issue Update lazy-adr 001 #22)

Medium Term (Next Release)

Consider TLS for raft transport (Issue Define Fraud Proof data type #10)
Add authentication for raft RPC endpoint (Issue Fold the validator roots into the state tree #11)
Make timeouts configurable (Issue Add design doc to readme #9)
Expand e2e test scenarios (Issue Add peer discovery #17)

Conclusion

This is a solid implementation of a critical feature. The core raft integration is well-designed with strong safety guarantees. The main concerns are around test coverage, error context for production debugging, and documentation for operators. With the recommended fixes, this will be production-ready.

The failover logic is particularly well-thought-out, with careful handling of edge cases like FSM lag and state divergence. The e2e tests demonstrate that the happy path works reliably.

Recommendation: Approve with requested changes. Address critical issues #1, #2, #4, and #5 before merge. Track medium/low priority issues for follow-up work.

github-actions · 2026-01-16T20:37:04Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 26, 2026, 5:24 PM

gemini-code-assist · 2026-01-16T20:39:53Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how Rollkit nodes handle high availability by integrating the RAFT consensus algorithm. The primary goal is to ensure continuous operation and data consistency across a cluster of sequencer nodes, even in the event of node failures. The changes involve a significant architectural shift to support dynamic role management (leader/follower) and robust state synchronization, allowing nodes to automatically recover and maintain a consistent view of the blockchain state. This feature is crucial for building resilient and fault-tolerant Rollkit deployments.

Highlights

RAFT Consensus Integration: Introduces the Raft consensus algorithm for high availability, enabling dynamic leader election and state replication among sequencer nodes using the hashicorp/raft library.
Dynamic Leader Election & Failover: Implements a DynamicLeaderElection mechanism that allows nodes to seamlessly transition between leader and follower roles. This includes logic for handling leader loss, ensuring FSM synchronization before leadership, and attempting recovery if a new leader is not synced.
Enhanced State Synchronization & Recovery: The Executor and Syncer components are updated to verify local state against the Raft replicated state. This includes automatic rollbacks if the execution layer is ahead, checks for block existence in the store before syncing, and robust recovery mechanisms from Raft state during failover scenarios.
Refactored Node Architecture: The node's P2P client creation and management are refactored, and a new failoverState structure is introduced to encapsulate components that need to be reset during mode switches (aggregator/sync). This simplifies the node's lifecycle management in a HA setup.
New RPC Endpoint for RAFT Status: Adds a /raft/node HTTP endpoint to query the Raft node's leadership status and ID, providing visibility into the consensus state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed feature: high availability for the sequencer using the Raft consensus algorithm. The changes are extensive, refactoring core components to support dynamic leader election and failover between aggregator (leader) and sync (follower) roles. The new architecture appears robust, with strong consistency checks to prevent issues like state divergence or stale leaders. The addition of comprehensive end-to-end tests for failover scenarios is particularly commendable. My review includes a few suggestions to improve configuration handling and error messages, aiming to enhance the operator experience and debuggability of this critical new functionality.

gemini-code-assist · 2026-01-16T20:42:41Z

block/internal/common/replay.go

+		prevState, err = s.store.GetStateAtHeight(ctx, height-1)
 		if err != nil {
 			return fmt.Errorf("failed to get previous state: %w", err)
 		}


For better debuggability, it would be helpful to include the specific height that was being queried in the error message when getting the previous state fails.

Suggested change

prevState, err = s.store.GetStateAtHeight(ctx, height-1)

if err != nil {

return fmt.Errorf("failed to get previous state: %w", err)

}

prevState, err = s.store.GetStateAtHeight(ctx, height-1)

if err != nil {

return fmt.Errorf("failed to get previous state at height %d: %w", height-1, err)

}

gemini-code-assist · 2026-01-16T20:42:42Z

node/full.go

+		return nil, fmt.Errorf("raft config must be used in sequencer setup only")
 	}


The error message "raft config must be used in sequencer setup only" could be more explicit. It seems the design requires a node to be configured as a potential aggregator to participate in a Raft cluster. A clearer message would improve user experience when configuring nodes.

Suggested change

return nil, fmt.Errorf("raft config must be used in sequencer setup only")

}

return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)")

gemini-code-assist · 2026-01-16T20:42:42Z

pkg/raft/node.go

+	if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
+		return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
+	}


Instead of returning an error for duplicate peers in the configuration, it would be more user-friendly to log a warning and proceed with the deduplicated list. The deduplicateServers function already handles the removal of duplicates, so failing fast might be overly strict for what is likely a minor configuration mistake.

Suggested change

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {

return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)

}

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {

n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v", svrs)

cfg.Servers = svrs

}

codecov · 2026-01-19T08:32:34Z

Codecov Report

❌ Patch coverage is 34.01060% with 747 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.41%. Comparing base (2325ace) to head (5fc0b6b).

Files with missing lines	Patch %	Lines
pkg/raft/node.go	11.05%	193 Missing ⚠️
pkg/raft/node_mock.go	45.40%	74 Missing and 21 partials ⚠️
block/internal/executing/executor.go	2.32%	77 Missing and 7 partials ⚠️
block/internal/syncing/syncer.go	0.00%	83 Missing and 1 partial ⚠️
block/internal/syncing/raft_retriever.go	0.00%	75 Missing ⚠️
node/full.go	32.30%	37 Missing and 7 partials ⚠️
pkg/raft/election.go	67.20%	29 Missing and 12 partials ⚠️
node/failover.go	73.38%	24 Missing and 13 partials ⚠️
tools/local-da/rpc.go	0.00%	21 Missing ⚠️
block/internal/common/replay.go	50.00%	14 Missing and 6 partials ⚠️
... and 14 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2987      +/-   ##
==========================================
- Coverage   57.98%   55.41%   -2.57%     
==========================================
  Files         110      116       +6     
  Lines       10522    11505     +983     
==========================================
+ Hits         6101     6376     +275     
- Misses       3770     4429     +659     
- Partials      651      700      +49

Flag	Coverage Δ
combined	`55.41% <34.01%> (-2.57%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* main: feat(tracing): tracing part 9 sequencer (#2990) build(deps): use mainline go-header (#2988)

block/internal/executing/executor.go

block/internal/syncing/raft_retriever.go

block/internal/syncing/syncer_test.go

tac0turtle · 2026-01-21T15:43:59Z

docs/guides/migrating-to-ev-abci.md

 	migrationmngrtypes "github.com/evstack/ev-abci/modules/migrationmngr/types"
 	// ...
 )
 ```


this need to be inverted to keep the count correct

node/failover.go

pkg/store/types.go

tac0turtle

left some comments, nice job.

We should get some docs, its fine in a follow up.

It would be good to do some testing after the merge on leadership rotation latency and what would be the expected time that a block may not be produced while a new leader is being found.

come of the comments from claude look realistic, should have tests for them at least

* main: fix(docs): remove blog link from sidebar to fix 404 (#3014) build(deps): Bump github.com/cometbft/cometbft from 0.38.20 to 0.38.21 in /execution/evm/test in the go_modules group across 1 directory (#3011) refactor: use slices.Contains to simplify code (#3010) chore: Bump mermaid version and dependencies (#3009) chore: Bump github.com/consensys/gnark-crypto only (#3008) test: evm contract interaction (#3006) chore: remove redundant log (#3007) fix: return values correctly not nil (#3004) feat: expose execution client params to ev-node (#2982) feat(tracing): HTTP propagation (#3000) fix: deploy docs token (#3003) feat(tracing): add store tracing (#3001) feat: p2p exchange wrapper (#2855) build(deps): Bump the all-go group across 5 directories with 5 updates (#2999) feat(tracing): adding forced inclusion tracing (#2997) chore: update calculator for strategies (#2995) chore: adding tracing for da submitter (#2993) feat(tracing): part 10 da retriever tracing (#2991) chore: add da posting strategy to docs (#2992)

alpe added 30 commits November 12, 2025 15:16

Add high availabiliy via raft

d4c7e7d

Prefent duplicate peers

820b9a3

Merge branch 'main' into alex/adr023_ha-failover_raft7

35cf0c8

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

Go mod tidy

45a6e27

Minor updates

4562ded

Merge branch 'main' into alex/adr023_ha-failover_raft7

4d47754

Fix test

0804346

Leadership fix

1c2e3f1

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

Fix test

c320a7a

Remove sleep

72ec5fe

Increase buffer size for raft msgs

b34d7a0

Review feedback

406377e

Merge branch 'main' into alex/adr023_ha-failover_raft7

0ff25b8

* main: chore: bump da (#2866) chore: bump core (#2865)

Add replace to go.mod

406ad8b

Fix merge conflicts

abfa8d3

Linter fix

aebbdff

Minor test update

50d7caf

Remove store.Sync

5de9f0e

Merge

695324e

Add last submitted da heights to raft

2463919

Make test compile again

3e1cbca

Complete merge

640924b

Merge branch 'main' into alex/adr023_ha-failover_raft8

11038cd

Fix integration test

095f882

alpe added 8 commits January 12, 2026 15:58

Merge branch 'main' into alex/adr023_ha-failover_raft8

d7fbf79

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

x

7edca17

More safety features, including db.sync

6502a47

Ensure safe restart

a52d538

x

3203ce6

Merge branch 'main' into alex/adr023_ha-failover_raft11

1cf1167

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

Resolve conflicts

e4e94f2

github-actions bot assigned alpe Jan 16, 2026

alpe mentioned this pull request Jan 16, 2026

feat: High availabilty via RAFT #2954

Closed

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

alpe marked this pull request as draft January 18, 2026 10:38

alpe added 2 commits January 18, 2026 12:01

Fix test setup

e444215

Fix dummyDA

3b4fe3f

alpe marked this pull request as ready for review January 19, 2026 08:42

Merge branch 'main' into alex/adr023_ha-failover_raft11

9b80c74

* main: feat(tracing): tracing part 9 sequencer (#2990) build(deps): use mainline go-header (#2988)