Skip to content

Conversation

@chen2021673
Copy link
Contributor

@chen2021673 chen2021673 commented Jan 22, 2026

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

  • Add ResetCounters() method to reset tensor counter at iteration boundaries
  • Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
  • Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

  • Remove baseline comparison functionality (use separate script instead)
  • Remove table format output, keep only simple and md5 formats
  • Add SaveNpy() function with rank subdirectory support
  • Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

  • scripts/precision_check/precision_compare.py - Offline NPY comparison tool
  • scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
  • scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

  • Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

…sion checker

Counter mechanism:
- Add ResetCounters() to clear tensor counter at iteration boundaries
- Move counter management to PrecisionCheckEnv with thread_local storage
- Call ResetCounters() at start of each training step in gpt2/llama3

Precision checker refactoring:
- Remove baseline comparison functionality (use separate script instead)
- Remove table format output, keep only simple and md5 formats
- Add TensorStats struct with min/max/mean/nan_count/inf_count
- Add SaveNpy() function for NPY file saving with rank subdirectories
- Simplify log output format with dtype, shape, stats, and first 6 values
- Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output"
- Use std::filesystem instead of sys/stat.h for directory creation

Documentation and scripts:
- Update docs/precision_checker_guide.md with current implementation
- Add precision_compare.py for offline NPY comparison
- Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants