Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 2, 2026

Summary

This PR implements semantic matching for joins in DataJoint 2.0. Instead of matching attributes purely by name (natural join), DataJoint now tracks attribute lineage (origin) and only allows joins on attributes that share both the same name AND the same lineage.

📄 Full Specification - API reference, user guide, and implementation details

Key Changes

  • Lineage tracking: Each attribute now has a lineage property indicating its origin (schema.table.attribute)
  • ~lineage table: Hidden per-schema table storing lineage information, populated at table declaration time
  • Semantic checks: Joins and restrictions error if namesake attributes have different lineages
  • Schema methods: schema.rebuild_lineage() to restore lineage for legacy schemas, schema.lineage_table_exists property
  • Removed operators: @ (permissive join) and ^ (permissive restriction) replaced by .join(semantic_check=False) and .restrict(semantic_check=False)
  • dj.U * table removed: Join with universal set is no longer supported

New Files

  • src/datajoint/lineage.py - Lineage management module
  • tests/integration/test_semantic_matching.py - 21 comprehensive tests
  • docs/src/design/semantic-matching-spec.md - Full specification with API reference and user guide

Modified Files

  • condition.py - assert_join_compatibility() with semantic checking
  • expression.py - Updated join/restrict methods, removed @/^ operators
  • heading.py - lineage_available property, lineage loading from ~lineage table
  • table.py - _populate_lineage() at declaration, cleanup at drop
  • declare.py - FK attribute mapping for lineage tracking
  • schemas.py - rebuild_lineage() method, lineage_table_exists property

Behavior Summary

Scenario Action
Same name, same lineage Match (join proceeds)
Same name, different lineage Error
Same name, either lineage null Error
~lineage table missing Warning + skip semantic check

Migration for Users

# Removed operators
A @ B                    # Use: A.join(B, semantic_check=False)
A ^ B                    # Use: A.restrict(B, semantic_check=False)
dj.U('a') * B            # Removed (no longer needed with semantic matching)

# Rebuild lineage for legacy schemas
schema.rebuild_lineage()  # Then restart kernel

Test Plan

  • All 21 new semantic matching tests pass
  • Existing relational operand tests pass (63 tests)
  • Pre-commit hooks pass (ruff, codespell, formatting)

🤖 Generated with Claude Code

- Add lineage tracking via ~lineage table per schema
- Track attribute origin (schema.table.attribute) for FK and PK attributes
- Semantic check on joins/restrictions: error if namesakes have different lineage
- Add Schema.rebuild_lineage() to restore lineage for legacy schemas
- Add Schema.lineage_table_exists property
- Remove @ and ^ operators (use .join/.restrict with semantic_check=False)
- Remove dj.U * table pattern (use dj.U & table instead)
- Warn when parent lineage missing during table declaration
- Skip semantic check with warning if ~lineage table doesn't exist
- Add comprehensive spec with API reference and user guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Jan 2, 2026
@dimitri-yatsenko dimitri-yatsenko mentioned this pull request Jan 2, 2026
dimitri-yatsenko and others added 2 commits January 2, 2026 00:49
- Remove redundant lineage_table_exists check in table.py (already
  handled inside delete_table_lineages)
- Update spec examples to use core DataJoint types (uint32, uint16)
  instead of native types (int)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add get_schema_lineages() function in lineage.py
- Add schema.lineage property returning flat dict mapping
  'schema.table.attribute' to its lineage origin
- Add note about A - B without semantic check in spec
- Document schema.lineage in API reference

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants