Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 1, 2026

Summary

This PR implements a comprehensive redesign of the custom type system, renaming "AttributeType/adapter" terminology to "Codec" and providing a cleaner, more intuitive API. It also simplifies the testing infrastructure with testcontainers.

Codec API Redesign

Key Changes

  • Renamed AttributeType base class to Codec - The new name better reflects the purpose: encoding Python objects for database storage and decoding them on retrieval
  • Auto-registration via __init_subclass__ - Codecs automatically register when their class is defined; no decorator needed
  • New get_dtype(is_external) method - Codecs dynamically return their underlying storage type based on whether external storage is used
  • Unified naming convention - All built-in codecs use <name> or <name@store> syntax consistently
  • Renamed type category ADAPTED to CODEC in internal code

Built-in Codecs

Codec Internal External Description
<blob> bytes <hash@> DataJoint serialization for Python objects
<hash@> N/A json Content-addressed storage with MD5 dedup
<object@> N/A json Path-addressed storage for files/folders
<attach> bytes <hash@> File attachments
<filepath@store> N/A json Reference to existing external files

New Codec API

import datajoint as dj

class GraphCodec(dj.Codec):
    name = "graph"  # Auto-registers on class definition
    
    def get_dtype(self, is_external: bool) -> str:
        return "<blob>"  # Delegates serialization to blob codec
    
    def encode(self, graph, *, key=None, store_name=None):
        return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())}
    
    def decode(self, stored, *, key=None):
        G = nx.Graph()
        G.add_nodes_from(stored['nodes'])
        G.add_edges_from(stored['edges'])
        return G

Testing Infrastructure: Testcontainers

Tests now use testcontainers to automatically manage MySQL and MinIO containers. No manual docker-compose up required.

New Developer Workflow

# Clone and install
git clone https://github.com/datajoint/datajoint-python.git
cd datajoint-python
pip install -e ".[test]"

# Run all tests - containers start automatically
pytest tests/

Benefits

  • Zero setup - Just pip install and pytest
  • Dynamic ports - No conflicts with other services
  • Automatic cleanup - Containers stop when tests finish
  • Simpler CI - No docker-compose orchestration needed

Fallback: External Containers

For development/debugging with persistent containers:

docker compose up -d db minio
DJ_USE_EXTERNAL_CONTAINERS=1 pytest tests/

Dead Code & Terminology Cleanup

Removed

  • Backward compatibility aliases (AttributeType, register_type, list_types, get_type, etc.)
  • Backward compatibility codec aliases (ObjectType, AttachType, XAttachType, FilepathType)
  • Unused build_foreign_key_parser_old() function
  • Legacy feature switches (ADAPTED_TYPE_SWITCH, FILEPATH_FEATURE_SWITCH)
  • Unused enable_filepath_feature test fixture
  • Misleading comments about non-existent DJBlobType/ContentType
  • object-type-spec.md (implementation complete, info now in object.md)
  • pytest-env dependency (testcontainers handles configuration)

Renamed

  • Type category ADAPTEDCODEC in declare.py and heading.py
  • Test files:
    • schema_adapted.pyschema_codecs.py
    • test_adapted_attributes.pytest_codecs.py
    • test_type_composition.pytest_codec_chaining.py
  • Test classes and functions to use Codec terminology

Updated Terminology

  • All comments/docstrings: "AttributeType" → "Codec", "Adapter" → "Codec"
  • Fixed content_registry.py docstring: SHA256 → MD5

Documentation Updates

New Documentation

  • codec-spec.md - Detailed API specification for creating custom codecs
  • codecs.md - User guide with examples (replaces customtype.md)
  • README.md - Comprehensive developer guide with test/pre-commit instructions

Updated Documentation

  • mkdocs.yaml - Navigation updated: customtype.mdcodecs.md
  • attributes.md - Fixed dead links, updated terminology
  • docker-compose.yaml - Clarified it's optional for tests

Removed Documentation

  • object-type-spec.md (redundant with object.md)
  • customtype.md (replaced by codecs.md)

Other Changes in This Branch

Settings System

  • Simplified settings to pure Pydantic without backward compatibility shims
  • Added recursive config file search and secrets separation
  • Removed deprecated save_* methods and set_password function

Type System

  • Added type aliases (int8, int16, int32, int64, uint8, etc.)
  • Added decimal(n,f) to core types
  • Renamed core type 'blob' to 'bytes' for cross-database portability
  • Added text type and documented type modifier policy

External Storage

  • Refactored external storage to use fsspec for unified backend support
  • Implemented content registry with MD5 hashing (removed SHA256)
  • Added garbage collection module for external storage cleanup
  • Implemented <object@> type for managed file/folder storage

Infrastructure

  • Dropped support for Python < 3.10 and MySQL < 8.0
  • Reorganized tests into unit/ and integration/ directories
  • Version bump to 2.0.0a7

Test Plan

  • All 471 tests pass
  • Verified codec auto-registration works correctly
  • Verified external storage chain (<blob@><hash@> → storage) works
  • Verified testcontainers starts/stops containers automatically
  • Verified external container mode works (DJ_USE_EXTERNAL_CONTAINERS=1)
  • Verified renamed test files execute correctly

🤖 Generated with Claude Code

claude and others added 23 commits January 1, 2026 18:37
Add fixed-point decimal as a core DataJoint type, allowing it to be
recorded in field comments using :type: syntax for reconstruction.
This provides scientists with a standardized type for exact numeric
precision use cases (financial data, coordinates, etc.).

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Change the core binary type from 'blob' to 'bytes' to:
- Enable cross-database portability (LONGBLOB in MySQL, BYTEA in PostgreSQL)
- Free up native blob types (tinyblob, blob, mediumblob, longblob)
- Use Pythonic naming that matches the stored/returned type

Update all documentation to include PostgreSQL type mappings alongside
MySQL mappings, making the cross-database support explicit.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Correct the dtype documentation to clarify:
- longblob is a native MySQL type for raw binary data (not serialized)
- <djblob> should be used as dtype for serialized Python objects

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
PostgreSQL supports native ENUM via CREATE TYPE ... AS ENUM, which
provides similar semantics to MySQL ENUM (efficient storage, value
enforcement, definition-order ordering). DataJoint will handle the
separate type creation automatically.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Rewrite attributes.md to prioritize core types over native types
- Add timezone policy: all datetime values stored as UTC
- Timezone conversion is a presentation concern, not database concern
- Update storage-types-spec.md with UTC policy and CURRENT_TIMESTAMP example

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Core types:
- Add `text` as a core type for unlimited-length text (TEXT in both MySQL
  and PostgreSQL)

Type modifiers policy:
- Document that SQL modifiers (NOT NULL, DEFAULT, PRIMARY KEY, UNIQUE,
  COMMENT) are not allowed - DataJoint has its own syntax
- Document that AUTO_INCREMENT is discouraged but allowed with native types
- UNSIGNED is allowed as part of type semantics

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- UTF-8 required: utf8mb4 (MySQL) / UTF8 (PostgreSQL)
- Case-sensitive by default: utf8mb4_bin / C collation
- Database-level configuration via dj.config, not per-column
- CHARACTER SET and COLLATE modifiers not allowed in type definitions
- Like timezone, encoding is infrastructure configuration

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Reorganize "Special DataJoint-only datatypes" as "AttributeTypes"
- Add naming convention explanation (dj prefix, x prefix, @store suffix)
- List all built-in AttributeTypes with categories:
  - Serialization types: <djblob>, <xblob>
  - File storage types: <object>, <content>
  - File attachment types: <attach>, <xattach>
  - File reference types: <filepath>
- Fix inconsistent angle bracket notation throughout docs
- Update example to use int32 core type and include <djblob>
- Expand naming conventions in Key Design Decisions section

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
The @ character now indicates external storage (object store vs database):
- No @ = internal (database): <blob>, <attach>
- @ present = external (object store): <blob@>, <attach@store>
- @ alone = default store: <blob@>
- @name = named store: <blob@cold>

Key changes:
- Rename <djblob> to <blob> (internal) and <xblob> to <blob@> (external)
- Rename <xattach> to <attach@> (external variant of <attach>)
- Mark <object@>, <content@>, <filepath@> as external-only types
- Replace dtype property with get_dtype(is_external) method
- Use core type 'bytes' instead of 'longblob' for portability
- Add type resolution and chaining documentation
- Update Storage Comparison and Built-in AttributeType Comparison tables
- Simplify from 7 built-in types to 5: blob, attach, object, content, filepath

Type chaining at declaration time:
  <blob>      → get_dtype(False) → "bytes"     → LONGBLOB/BYTEA
  <blob@>     → get_dtype(True)  → "<content>" → json → JSON/JSONB
  <object@>   → get_dtype(True)  → "json"      → JSON/JSONB

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Rename <content@> to <hash@> throughout documentation:
- More descriptive: indicates hash-based addressing mechanism
- Familiar concept: works like a hash data structure
- Storage folder: _content/ → _hash/
- Registry: ContentRegistry → HashRegistry

The <hash@> type provides:
- SHA256 hash-based addressing
- Automatic deduplication
- External-only storage (requires @)
- Used as dtype by <blob@> and <attach@>

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Use '= CURRENT_TIMESTAMP : datetime' syntax (not SQL DEFAULT)
- Use uint64 core type instead of 'bigint unsigned' native type

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
DataJoint handles nullability through the default value syntax:
- Attribute is nullable iff default is NULL
- No separate NOT NULL / NULL modifier needed
- Examples: required, nullable, and default value cases

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Hash metadata (hash, store, size) is stored directly in each table's
JSON column - no separate registry table is needed. Garbage collection
now scans all tables to find referenced hashes in JSON fields directly.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
MD5 (128-bit, 32-char hex) is sufficient for content-addressed deduplication:
- Birthday bound ~2^64 provides adequate collision resistance for scientific data
- 32-char vs 64-char hashes reduces storage overhead in JSON metadata
- MD5 is ~2-3x faster than SHA256 for large files
- Consistent with existing dj.hash module (key_hash, uuid_from_buffer)
- Simplifies migration since only storage format changes, not the algorithm

Added Hash Algorithm Choice section documenting the rationale.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- uuid_from_file was never called anywhere in the codebase
- uuid_from_stream only existed to support uuid_from_file
- Inlined the logic directly into uuid_from_buffer
- Removed unused io and pathlib imports

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
The implementation plan was heavily outdated with:
- Old type names (<content>, <xblob>, <xattach> vs <hash@>, <blob@>, <attach@>)
- Wrong hash algorithm (SHA256 vs MD5)
- Wrong paths (_content/ vs _hash/)
- References to removed HashRegistry table

All relevant design information is now in storage-types-spec.md.
Implementation details (ObjectRef API, staged_insert) will be documented
in user-facing API docs when implemented.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Rename DECIMAL to NUMERIC in native types (decimal is in core types)
- Rename TEXT to NATIVE_TEXT (text is in core types)
- Change BLOB references to BYTES in heading.py (bytes is the core type name)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Terminology changes in spec and user docs:
- "AttributeTypes" → "Codec Types" (category name)
- "AttributeType" → "Codec" (base class)
- "@register_type" → "@dj.codec" (decorator)
- "type_name" → "name" (class attribute)

The term "Codec" better conveys the encode/decode semantics of these
types, drawing on the familiar audio/video codec analogy.

Code changes (class renaming, backward-compat aliases) to follow.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Design improvements for Python 3.10+:
- Codecs auto-register when subclassed via __init_subclass__
- No decorator needed - just inherit from dj.Codec and set name
- Use register=False for abstract base classes
- Removed @dj.codec decorator from all examples

New API:
  class GraphCodec(dj.Codec):
      name = "graph"
      def encode(...): ...
      def decode(...): ...

Abstract bases:
  class ExternalOnlyCodec(dj.Codec, register=False):
      ...

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Codec.get_dtype(is_external) now determines storage type based on
  whether @ modifier is present in the declaration
- BlobCodec returns "bytes" for internal, "<hash>" for external
- AttachCodec returns "bytes" for internal, "<hash>" for external
- HashCodec, ObjectCodec, FilepathCodec enforce external-only usage
- Consolidates <blob>/<xblob> and <attach>/<xattach> into unified codecs
- Adds backward compatibility aliases for old type names
- Updates __init__.py with new codec exports (Codec, list_codecs, get_codec)
- Remove legacy codecs (djblob, xblob, xattach, content)
- Use unified codecs: <blob>, <attach>, <hash>, <object>, <filepath>
- All codecs support both internal and external modes via @store modifier
- Fix dtype chain resolution to propagate store to inner codecs
- Fix fetch.py to resolve correct chain for external storage
- Update tests to use new codec API (name, get_dtype method)
- Fix imports: use content_registry for get_store_backend
- Add 'local' store to mock_object_storage fixture

All 471 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename attribute_type.py → codecs.py
- Rename builtin_types.py → builtin_codecs.py
- Rename test_attribute_type.py → test_codecs.py
- Rename get_adapter() → lookup_codec()
- Rename attr.adapter → attr.codec in Attribute namedtuple
- Update all imports and references throughout codebase
- Update comments and docstrings to use codec terminology

All 471 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Jan 1, 2026
dimitri-yatsenko and others added 5 commits January 1, 2026 17:23
- Remove AttributeType alias (use Codec directly)
- Remove register_type function (codecs auto-register)
- Remove deprecated type_name property (use name)
- Remove list_types, get_type, is_type_registered, unregister_type aliases
- Update all internal usages from type_name to name
- Update tests to use new API

The previous implementation was experimental; no backward
compatibility is needed for the v2.0 release.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add codec-spec.md: detailed API specification for creating codecs
- Add codecs.md: user guide with examples (replaces customtype.md)
- Remove customtype.md (replaced by codecs.md)

Documentation covers:
- Codec base class and required methods
- Auto-registration via __init_subclass__
- Codec composition/chaining
- Plugin system via entry points
- Built-in codecs (blob, hash, object, attach, filepath)
- Complete examples for neuroscience workflows

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The detailed implementation specification has served its purpose.
User documentation is now in object.md, codec API in codec-spec.md,
and type architecture in storage-types-spec.md.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code cleanup:
- Remove backward compatibility aliases (ObjectType, AttachType, etc.)
- Remove misleading comments about non-existent DJBlobType/ContentType
- Remove unused build_foreign_key_parser_old function
- Remove unused feature switches (ADAPTED_TYPE_SWITCH, FILEPATH_FEATURE_SWITCH)
- Remove unused os import from errors.py
- Rename ADAPTED type category to CODEC

Documentation fixes:
- Update mkdocs.yaml nav: customtype.md → codecs.md
- Fix dead links in attributes.md pointing to customtype.md

Terminology updates:
- Replace "AttributeType" with "Codec" in all comments
- Replace "Adapter" with "Codec" in docstrings
- Fix SHA256 → MD5 in content_registry.py docstring

Version bump to 2.0.0a6

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Filepath feature is now always enabled; no feature flag needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
dimitri-yatsenko and others added 3 commits January 1, 2026 18:24
File renames:
- schema_adapted.py → schema_codecs.py
- test_adapted_attributes.py → test_codecs.py
- test_type_composition.py → test_codec_chaining.py

Content updates:
- LOCALS_ADAPTED → LOCALS_CODECS
- GraphType → GraphCodec, LayoutToFilepathType → LayoutCodec
- Test class names: TestTypeChain* → TestCodecChain*
- Test function names: test_adapted_* → test_codec_*
- Updated docstrings and comments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests now automatically start MySQL and MinIO containers via testcontainers.
No manual `docker-compose up` required - just run `pytest tests/`.

Changes:
- conftest.py: Add mysql_container and minio_container fixtures that
  auto-start containers when tests run and stop them afterward
- pyproject.toml: Add testcontainers[mysql,minio] dependency, update
  pixi tasks, remove pytest-env (no longer needed)
- docker-compose.yaml: Update docs to clarify it's optional for tests
- README.md: Comprehensive developer guide with clear instructions for
  running tests, pre-commit hooks, and PR submission checklist

Usage:
- Default: `pytest tests/` - testcontainers manages containers
- External: `DJ_USE_EXTERNAL_CONTAINERS=1 pytest` - use docker-compose

Benefits:
- Zero setup for developers - just `pip install -e ".[test]" && pytest`
- Dynamic ports (no conflicts with other services)
- Automatic cleanup after tests
- Simpler CI configuration

Version bump to 2.0.0a7

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update settings tests to accept dynamic ports (testcontainers uses
  random ports instead of default 3306)
- Fix test_top_restriction_with_keywords to use set comparison since
  dj.Top only guarantees which elements are selected, not their order
- Bump version to 2.0.0a8

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko force-pushed the claude/clarify-column-type-names-2dpns branch from 1ae815b to 15412ea Compare January 2, 2026 01:57
- Register requires_mysql and requires_minio marks in pyproject.toml
- Add pytest_collection_modifyitems hook to auto-mark tests based on
  fixture usage
- Remove autouse=True from configure_datajoint fixture so containers
  only start when needed
- Fix test_drop_unauthorized to use connection_test fixture

Tests can now run without Docker:
  pytest -m "not requires_mysql"  # Run 192 unit tests

Full test suite still works:
  DJ_USE_EXTERNAL_CONTAINERS=1 pytest tests/  # 471 tests

Bump version to 2.0.0a9

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko force-pushed the claude/clarify-column-type-names-2dpns branch from 15412ea to fa47f47 Compare January 2, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants