wassname (Michael J Clark) wassname

wassname

I'm just a guy who likes to machine learn. I want the good ending not the bad one.

I work on AI alignment: steering, evals, and practical interpretability.

Links: wassname.org · Hugging Face · LessWrong · Gists

AntiPaSTO (upcoming) Self-supervised steering of moral reasoning. Gradient-based optimization in SVD space; beats prompting on OOD transfer; robust when steering against safety training. [Preprint in prep]

Repo	What it does
Unsupervised-Elicitation	Replicated Anthropic's ICM paper; model self-reports labeling heuristics on TruthfulQA without supervision. LW note
coconut	Replicated Facebook's COCONUT + added SEQ-VCR loss. Found training is very slow (not emphasised by authors). WIP branch: adapter recursion in SVD space.
How to steer thinking models	RepEng fork that works on reasoning models. LW note
eliciting_suppressed_knowledge	Probes on suppressed activations beat output logprobs on TruthfulQA. Shows linear probes have limits, motivating gradient-based methods.

Repo	What it does
open_pref_eval	Judge-free preference eval via logprobs. Converts Machiavelli, ETHICS, GENIES to fast logprob evals.
llm_ethics_leaderboard	Moral preference leaderboard; logprob rankings + permutation debiasing. Results site
activation_store	Store transformer activations as HF datasets; avoid OOM; reuse for probing.

These informed later work but didn't yield conclusive positive results.

Repo	Status
repr-preference-optimization	Early attempt at hidden-state preference optimization. Superseded by AntiPaSTO.
LoRA_are_lie_detectors	Adapters as end-to-end probes. Promising direction, inconclusive results.
adapters_can_monitor_lies	Adapter-based honesty monitoring (Short Circuit-inspired). Paused.

Repo	What it is
awesome-interpretability	Curated mechinterp + probing + tooling map.

Other ML work (world models, time series, misc)

World models

Time series & spatial

Misc