I'm just a guy who likes to machine learn. I want the good ending not the bad one.
I work on AI alignment: steering, evals, and practical interpretability.
Links: wassname.org · Hugging Face · LessWrong · Gists
- AntiPaSTO (upcoming) Self-supervised steering of moral reasoning. Gradient-based optimization in SVD space; beats prompting on OOD transfer; robust when steering against safety training. [Preprint in prep]
| Repo | What it does |
|---|---|
| Unsupervised-Elicitation | Replicated Anthropic's ICM paper; model self-reports labeling heuristics on TruthfulQA without supervision. LW note |
| coconut | Replicated Facebook's COCONUT + added SEQ-VCR loss. Found training is very slow (not emphasised by authors). WIP branch: adapter recursion in SVD space. |
| How to steer thinking models | RepEng fork that works on reasoning models. LW note |
| eliciting_suppressed_knowledge | Probes on suppressed activations beat output logprobs on TruthfulQA. Shows linear probes have limits, motivating gradient-based methods. |
| Repo | What it does |
|---|---|
| open_pref_eval | Judge-free preference eval via logprobs. Converts Machiavelli, ETHICS, GENIES to fast logprob evals. |
| llm_ethics_leaderboard | Moral preference leaderboard; logprob rankings + permutation debiasing. Results site |
| activation_store | Store transformer activations as HF datasets; avoid OOM; reuse for probing. |
These informed later work but didn't yield conclusive positive results.
| Repo | Status |
|---|---|
| repr-preference-optimization | Early attempt at hidden-state preference optimization. Superseded by AntiPaSTO. |
| LoRA_are_lie_detectors | Adapters as end-to-end probes. Promising direction, inconclusive results. |
| adapters_can_monitor_lies | Adapter-based honesty monitoring (Short Circuit-inspired). Paused. |
| Repo | What it is |
|---|---|
| awesome-interpretability | Curated mechinterp + probing + tooling map. |
Other ML work (world models, time series, misc)
World models
Time series & spatial
- attentive-neural-processes
- seq2seq-time
- np_vs_kriging
- rl-portfolio-management
- satellite_leak_detection
Misc



