Skip to content

Conversation

@Pasta-coder
Copy link
Contributor

This PR extracts the formula parsing logic currently duplicated within ClassificationGAM.m and RegressionGAM.m into a new, centralized static utility class: FormulaParser.m.

related to #266 .

The Problem:
Previously, both GAM classes implemented their own private parseFormula methods. This led to code duplication and a silent logic bug where interactions="all" used fullfact (generating levels) instead of ff2n (generating the correct binary design matrix). Additionally, the legacy logic could not handle standard Wilkinson notation features needed for future compatibility (like * expansion or intercept removal).

The Solution:
I have introduced FormulaParser.m as a robust, standalone utility. This refactor lays the "infrastructure rails" for future improvements to functions like fitlm and anovan while immediately stabilizing the GAMs.

Key Changes:

  • Centralization: Removed private parsing methods from GAM classes; they now route strictly through FormulaParser.
  • Logic Fix: Swapped fullfact for ff2n in parseInteractions, ensuring correct binary interaction matrices are generated.
  • Enhanced Capabilities: The new parser now supports:
    • LHS Parsing: Extracts the Response Variable (e.g., captures "Cost" from "Cost ~ A + B").
    • Factorial Expansion: Supports the * operator (e.g., A*B expands to A + B + A:B).
    • Intercepts: Detects explicit intercept removal (-1).
  • Compatibility: Implemented using strfind (avoiding contains) to ensure backward compatibility with older Octave versions.
  • Stability: Output terms are now generated in a stable order to match MATLAB behavior.

Verification:

  • Added comprehensive BISTs to FormulaParser.m covering edge cases and syntax errors.
  • Updated existing tests in RegressionGAM and ClassificationGAM to align with the more robust error reporting.
  • Verified that wrappers fitrgam and fitcgam pass all tests without modification.

Note:
This PR focuses strictly on infrastructure and metadata. It does not alter the mathematical backfitting algorithm of the GAMs, nor does it yet refactor fitlm/anovan.

Known Limitations & Future Work:

  1. Term Ordering: Terms generated via the * operator (e.g., A*B) are currently returned in binary counting order (B, A, A:B). While mathematically equivalent, future work could sort these to match standard main-effect priority (A, B, A:B).
  2. Nested Grouping: The current parser handles standard operators (+, :, *, -). Complex nested groupings like (A + B):C are not yet supported and are intended for a future "Phase 2" update alongside fitlm integration.

@pr0m1th3as
Copy link
Member

There is no need to implement FormulaParser as a classdef. A simple function will do just as fine.

The function needs to be as well as documented as possible. Make sure you complete the documentation and also add demos with explanatory comments as examples. Finally, make sure you add BISTs for every style of the Wilkinson notation supported by the function.

@Pasta-coder
Copy link
Contributor Author

okay , working on it .

@Pasta-coder Pasta-coder marked this pull request as ready for review January 23, 2026 15:50
@Pasta-coder
Copy link
Contributor Author

FormulaParser Refactoring Summary

I have refactored FormulaParser.m into a standalone function file as requested, significantly expanding its capabilities and documentation.

Key Changes:

1. Recursive Descent Parsing: Now supports advanced Wilkinson syntax:

  • Nested Grouping: (A + B):C correctly expands to A:C + B:C
  • Factorial Expansion: A * B recursively expands to A + B + A:B
  • Standard Ordering: Terms are now sorted by complexity (Main Effects → Interactions) and variable index (A, B, A:B) to match standard conventions

2. Documentation: Added a comprehensive Texinfo block with detailed sections:

  • Inputs: Formula string, optional metadata parameters
  • Outputs: Term labels, design matrix indices, term information structure
  • Supported Syntax: Complete specification of Wilkinson notation support
  • Explanatory Demos: 6 detailed examples illustrating various use cases

3. Verification: Added extensive BISTs covering:

  • Nested grouping logic
  • Syntax error handling
  • Edge cases and boundary conditions

4. Integration Updates:

  • ClassificationGAM.m and RegressionGAM.m have been updated to use the new function signature
  • All existing tests pass successfully

Technical Improvements:

  • Standalone function design improves reusability
  • Recursive parsing enables complex formula handling
  • Consistent output formatting across all use cases
  • Enhanced error reporting for debugging

@pr0m1th3as
Copy link
Member

Can you add a few more complex notations and to make sure that recursiveness works as expected and doesn't have some hidden breaking point?

@Pasta-coder
Copy link
Contributor Author

You are absolutely right—the initial test suite was insufficient for verifying the robustness of the recursive parser, especially for complex nested interactions. I should have included these edge cases from the start to ensure stability.

I have now expanded the test suite from 8 to 40 tests, specifically targeting "torture test" scenarios to guarantee the recursive logic holds up.

New Complex Test Cases Include:

  • Deep Nesting & Associativity: Verified chains like A:B:C:D and deeply nested groups ((A:B):C):D.
  • High-Order Distribution: Added "FOIL-style" distribution tests like (A+B+C):(D+E).
  • Operator Precedence: Validated rules like A:B*C expanding to (A:B)*C (left-associative).
  • Redundancy & Self-Interaction: Handling A:A, A+A, and (A*B)*(A*B) to ensure term uniqueness.
  • "Kitchen Sink" Scenarios: Mixed operators, partial string matches (e.g., Age vs AgeGroup), and complex factorial expansions.

@pr0m1th3as
Copy link
Member

You are absolutely right—the initial test suite was insufficient for verifying the robustness of the recursive parser, especially for complex nested interactions. I should have included these edge cases from the start to ensure stability.

Please, stop copy pasting Claude output (or whatever LLM-based setup you are using as a coding assistant) as a response to this conversation.

New Complex Test Cases Include:

* **Deep Nesting & Associativity:** Verified chains like `A:B:C:D` and deeply nested groups `((A:B):C):D`.

* **High-Order Distribution:** Added "FOIL-style" distribution tests like `(A+B+C):(D+E)`.

* **Operator Precedence:** Validated rules like `A:B*C` expanding to `(A:B)*C` (left-associative).

* **Redundancy & Self-Interaction:** Handling `A:A`, `A+A`, and `(A*B)*(A*B)` to ensure term uniqueness.

* **"Kitchen Sink" Scenarios:** Mixed operators, partial string matches (e.g., `Age` vs `AgeGroup`), and complex factorial expansions.

Apart from all of the above being statistically meaningless (hence useless), where did you come up with all this, which neither MATLAB defined behavior, nor Wilkinson's original symbolic description. I don't mind contributors using LLM-based assistance tools, but I expect that they are fully responsible for verifying their code before opening a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants