-
Notifications
You must be signed in to change notification settings - Fork 247
feat: High availabilty via RAFT #2987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)
* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)
* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits) feat: make reaper poll duration configurable (#2951) chore!: move sequencers to pkg (#2931) feat: Ensure Header integrity on DA (#2948) feat(testda): add header support with GetHeaderByHeight method (#2946) chore: improve code comments clarity (#2947) chore(sequencers): optimize store check (#2945) fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926) feat: use DA timestamp (#2939) chore: improve code comments clarity (#2943) build(deps): bump libp2p (#2937) build(deps): Bump actions/cache from 4 to 5 (#2934) build(deps): Bump actions/download-artifact from 6 to 7 (#2933) build(deps): Bump actions/upload-artifact from 5 to 6 (#2932) feat: DA Client remove interface part 3, replace types with new code (#2910) DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920) fix(syncing): skip forced txs checks for p2p blocks (#2922) build(deps): Bump the all-go group across 5 directories with 5 updates (#2919) chore: loosen syncer state check (#2927) ...
* main: chore: adding syncing tracing (#2981) feat(tracing): adding block production tracing (#2980) feat(tracing): Add Store, P2P and Config tracing (#2972) chore: fix upgrade test (#2979) build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974) feat(tracing): adding tracing to DA client (#2968) chore: create onboarding skill (#2971) test: add e2e tests for force inclusion (part 2) (#2970) feat(tracing): adding eth client tracing (#2960) test: add e2e tests for force inclusion (#2964) build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)
|
Claude finished @alpe's task in 3m 8s —— View job Code Review: RAFT High Availability ImplementationI've completed a comprehensive review of this RAFT high availability implementation. This is a significant and well-architected feature that introduces critical resilience capabilities to the sequencer. Below are my findings organized by category. SummaryThis PR successfully implements a production-ready RAFT consensus layer for high availability, enabling dynamic leader election and automatic failover between aggregator (leader) and sync (follower) modes. The implementation shows strong attention to safety, particularly around preventing double-signing and state divergence. The e2e tests demonstrate thorough validation of failover scenarios. Key Strengths:
Areas for Improvement:
Critical Issues1. Race Condition in Dynamic Leader Election (pkg/raft/election.go)Location: The case <-ticker.C:
// ...
ticker.Stop()
ticker.C = nil // Setting to nil, but select could already be evaluatingRecommendation: Use atomic operations or a mutex to protect the ticker state transitions. Severity: Medium - Could cause initialization issues during cluster startup 2. Insufficient Error Context in State Validation (block/internal/executing/executor.go, block/internal/syncing/syncer.go)Locations:
Critical state validation errors lack sufficient context for debugging production issues. When hash mismatches occur, operators need to know:
Current: return fmt.Errorf("invalid state: block hash mismatch at height %d: raft=%x local=%x",
state.LastBlockHeight, raftState.Hash, header.Hash())Recommendation: return fmt.Errorf("state divergence detected: block hash mismatch at height %d: "+
"raft_hash=%x local_hash=%x node_id=%s cluster_state=height:%d raft_index=%d "+
"This indicates a critical consistency violation requiring manual intervention",
state.LastBlockHeight, raftState.Hash, header.Hash(), n.NodeID(),
raftState.Height, raftState.Index)Severity: High - Impacts production debugging and incident response 3. Potential Deadlock in Failover State (node/failover.go)Location: The defer func() {
if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
}
}()Recommendation: Use a separate context with timeout for shutdown operations: defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
}
}()Severity: Medium - Could cause graceful shutdown failures High Priority Issues4. Missing Nil Check for Raft Node (pkg/raft/election.go)Location: The code checks if d.node.leaderID() != d.node.NodeID() {
d.logger.Info().Msg("lost leadership during sync wait")
continue
}Recommendation: Add explicit check: leaderID := d.node.leaderID()
if leaderID == "" {
d.logger.Warn().Msg("raft cluster not ready, no leader elected yet")
continue
}
if leaderID != d.node.NodeID() {
d.logger.Info().Msg("lost leadership during sync wait")
continue
}Severity: Medium - Could cause confusing behavior during cluster initialization 5. Configuration Error Messages Not User-Friendly (node/full.go, pkg/config/config.go)Location: The error message "raft config must be used in sequencer setup only" is unclear. It doesn't explain that raft requires the aggregator flag. Current: return nil, fmt.Errorf("raft config must be used in sequencer setup only")Recommendation: return nil, fmt.Errorf("raft can only be enabled for aggregator nodes: "+
"set --node.aggregator=true when using --raft.enable=true. "+
"Non-aggregator nodes (full nodes/light clients) cannot participate in raft consensus")Severity: Medium - Impacts developer experience and deployment troubleshooting Note: This was also flagged by Gemini's review. 6. Duplicate Peer Handling Too Strict (pkg/raft/node.go)Location: The code errors on duplicate peers instead of deduplicating and warning. This makes configuration more brittle than necessary. Current: if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}Recommendation: Log warning and proceed: if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v -> %v",
cfg.Servers, svrs)
cfg.Servers = svrs
}Severity: Low - Quality of life improvement Note: This was also flagged by Gemini's review. Medium Priority Issues7. Unclear Recovery Logic (block/internal/executing/executor.go)Location: The recovery logic in Recommendation: Refactor for clarity: if e.raftNode == nil {
// No raft, proceed with standard initialization
e.setLastState(state)
e.sequencer.SetDAHeight(state.DAHeight)
return e.syncExecutionLayer(state)
}
// Raft-enabled path
raftState := e.raftNode.GetState()
if err := e.validateStateAgainstRaft(state, raftState); err != nil {
return err
}
syncTarget := max(state.LastBlockHeight, raftState.Height)
return e.syncToHeight(syncTarget)Severity: Medium - Code maintainability 8. Missing Logging for Critical State Transitions (pkg/raft/election.go)Location: When transitioning from follower to leader, there's extensive checking but missing logs for key decision points (e.g., when FSM lag causes leadership abdication). Recommendation: Add structured logging: if err := d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
d.logger.Error().
Err(err).
Uint64("applied_index", d.node.raft.AppliedIndex()).
Uint64("last_index", d.node.raft.LastIndex()).
Msg("failed to wait for messages to land - FSM lagging, abdicating")
// ...
}Severity: Medium - Impacts production observability 9. Hardcoded Timeout Values (node/failover.go, pkg/raft/election.go)Locations:
These should be configurable for different deployment scenarios (local dev vs production). Recommendation: Move to configuration: type FailoverConfig struct {
ShutdownTimeout time.Duration // default: 3s
SyncCheckInterval time.Duration // default: 300ms
}Severity: Low - Deployment flexibility Security Considerations10. Raft Transport SecurityLocation: The RAFT transport uses TCP without TLS. For production deployments, this means raft communication (including block data) is unencrypted. Current: transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)Recommendation:
Severity: Medium - Security posture for production deployments 11. No Authentication for Raft RPC EndpointLocation: The new RPC endpoint exposing raft status has no authentication. While it only reveals leadership status, this could leak cluster topology information. Recommendation:
Severity: Low - Information disclosure Performance & Resource Efficiency12. Unbounded Channel in FSM (pkg/raft/node.go)Location: The if f.applyCh != nil {
select {
case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
f.logger.Warn().Msg("apply channel full, dropping message")
}
}Recommendation:
Severity: Medium - Could cause state sync issues under load 13. Inefficient State Cloning (pkg/raft/node.go)Location: Every func (n *Node) GetState() *RaftBlockState {
return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)
}Recommendation:
Severity: Low - Potential optimization opportunity 14. Goroutine Leak Potential (pkg/raft/election.go)Location: The Recommendation:
Severity: Low - Defensive programming Testing & Reliability15. Low Test Coverage (34.04%)Key gaps:
Recommendation:
Severity: High - Critical for production confidence 16. Missing Tests for Concurrent State Access (block/internal/executing/executor.go, block/internal/syncing/syncer.go)Both executor and syncer use atomic pointers for Recommendation:
Severity: Medium - Concurrency correctness 17. E2E Tests Could Be More ComprehensiveLocation: The e2e tests cover basic failover but could test:
Recommendation: Expand test matrix to cover failure modes listed above. Severity: Medium - Production resilience validation Architecture & Code Quality18. Raft State Validation Logic DuplicatedLocations:
Both executor and syncer have nearly identical raft state validation logic. This should be extracted to a shared utility. Recommendation: // block/internal/common/raft_validator.go
func ValidateStateAgainstRaft(
localHeight uint64,
localHash []byte,
raftState *raft.RaftBlockState,
store store.Store,
) error {
// Shared validation logic
}Severity: Low - Code maintainability and DRY principle 19. Interface Segregation (block/internal/common/raft.go)Location: The
Recommendation: Consider splitting if future features require more granular permissions. Severity: Low - Future extensibility 20. Magic Numbers in Raft ConfigurationLocations:
Recommendation:
Severity: Low - Code clarity Documentation & Developer Experience21. Missing Documentation for Recovery ScenariosThe recovery logic in
Recommendation: Add comprehensive godoc: // IsSyncedWithRaft checks if local state matches raft consensus state.
// Returns:
// - 0 if in sync (heights and hashes match)
// - negative if local is behind raft (safe, will catch up via normal sync)
// - positive if local is ahead of raft (CRITICAL: requires rollback)
//
// Hash mismatches at the same height indicate state divergence and return an error.
func (s *Syncer) IsSyncedWithRaft(raftState *raft.RaftBlockState) (int, error)Severity: Medium - Critical for maintainers 22. Configuration Documentation GapsThe new raft configuration flags lack comprehensive documentation:
Recommendation:
Severity: Medium - Operational readiness 23. Unclear TODO Comment (block/internal/syncing/syncer.go)Location: if event.Source == common.SourceDA { // TODO(@julienrbrt): To be reverted once DA Hints are mergedThis TODO references PR #2891 but doesn't explain:
Recommendation: Expand comment with context or link to ADR/design doc. Severity: Low - Code maintainability Positive Observations✅ Excellent Safety Mechanisms
✅ Clean Separation of Concerns
✅ Comprehensive E2E Testing
✅ Strong Error Handling
Recommendations SummaryImmediate (Before Merge)
Short Term (Next Sprint)
Medium Term (Next Release)
ConclusionThis is a solid implementation of a critical feature. The core raft integration is well-designed with strong safety guarantees. The main concerns are around test coverage, error context for production debugging, and documentation for operators. With the recommended fixes, this will be production-ready. The failover logic is particularly well-thought-out, with careful handling of edge cases like FSM lag and state divergence. The e2e tests demonstrate that the happy path works reliably. Recommendation: Approve with requested changes. Address critical issues #1, #2, #4, and #5 before merge. Track medium/low priority issues for follow-up work. |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Summary of ChangesHello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request fundamentally changes how Rollkit nodes handle high availability by integrating the RAFT consensus algorithm. The primary goal is to ensure continuous operation and data consistency across a cluster of sequencer nodes, even in the event of node failures. The changes involve a significant architectural shift to support dynamic role management (leader/follower) and robust state synchronization, allowing nodes to automatically recover and maintain a consistent view of the blockchain state. This feature is crucial for building resilient and fault-tolerant Rollkit deployments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-executed feature: high availability for the sequencer using the Raft consensus algorithm. The changes are extensive, refactoring core components to support dynamic leader election and failover between aggregator (leader) and sync (follower) roles. The new architecture appears robust, with strong consistency checks to prevent issues like state divergence or stale leaders. The addition of comprehensive end-to-end tests for failover scenarios is particularly commendable. My review includes a few suggestions to improve configuration handling and error messages, aiming to enhance the operator experience and debuggability of this critical new functionality.
| prevState, err = s.store.GetStateAtHeight(ctx, height-1) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get previous state: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better debuggability, it would be helpful to include the specific height that was being queried in the error message when getting the previous state fails.
| prevState, err = s.store.GetStateAtHeight(ctx, height-1) | |
| if err != nil { | |
| return fmt.Errorf("failed to get previous state: %w", err) | |
| } | |
| prevState, err = s.store.GetStateAtHeight(ctx, height-1) | |
| if err != nil { | |
| return fmt.Errorf("failed to get previous state at height %d: %w", height-1, err) | |
| } |
| return nil, fmt.Errorf("raft config must be used in sequencer setup only") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "raft config must be used in sequencer setup only" could be more explicit. It seems the design requires a node to be configured as a potential aggregator to participate in a Raft cluster. A clearer message would improve user experience when configuring nodes.
| return nil, fmt.Errorf("raft config must be used in sequencer setup only") | |
| } | |
| return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)") |
| if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) { | ||
| return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning an error for duplicate peers in the configuration, it would be more user-friendly to log a warning and proceed with the deduplicated list. The deduplicateServers function already handles the removal of duplicates, so failing fast might be overly strict for what is likely a minor configuration mistake.
| if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) { | |
| return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers) | |
| } | |
| if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) { | |
| n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v", svrs) | |
| cfg.Servers = svrs | |
| } |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2987 +/- ##
==========================================
- Coverage 57.98% 55.41% -2.57%
==========================================
Files 110 116 +6
Lines 10522 11505 +983
==========================================
+ Hits 6101 6376 +275
- Misses 3770 4429 +659
- Partials 651 700 +49
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| migrationmngrtypes "github.com/evstack/ev-abci/modules/migrationmngr/types" | ||
| // ... | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this need to be inverted to keep the count correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments, nice job.
We should get some docs, its fine in a follow up.
It would be good to do some testing after the merge on leadership rotation latency and what would be the expected time that a block may not be produced while a new leader is being found.
come of the comments from claude look realistic, should have tests for them at least
* main: fix(docs): remove blog link from sidebar to fix 404 (#3014) build(deps): Bump github.com/cometbft/cometbft from 0.38.20 to 0.38.21 in /execution/evm/test in the go_modules group across 1 directory (#3011) refactor: use slices.Contains to simplify code (#3010) chore: Bump mermaid version and dependencies (#3009) chore: Bump github.com/consensys/gnark-crypto only (#3008) test: evm contract interaction (#3006) chore: remove redundant log (#3007) fix: return values correctly not nil (#3004) feat: expose execution client params to ev-node (#2982) feat(tracing): HTTP propagation (#3000) fix: deploy docs token (#3003) feat(tracing): add store tracing (#3001) feat: p2p exchange wrapper (#2855) build(deps): Bump the all-go group across 5 directories with 5 updates (#2999) feat(tracing): adding forced inclusion tracing (#2997) chore: update calculator for strategies (#2995) chore: adding tracing for da submitter (#2993) feat(tracing): part 10 da retriever tracing (#2991) chore: add da posting strategy to docs (#2992)
Replaces #2954