Skip to content
View wassname's full-sized avatar
🙃
🙃

Organizations

@pmlg @makehuman-js @3springs

Block or report wassname

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
wassname/README.md

wassname

I'm just a guy who likes to machine learn. I want the good ending not the bad one.

I work on AI alignment: steering, evals, and practical interpretability.

Links: wassname.org · Hugging Face · LessWrong · Gists


Current focus

  • AntiPaSTO (upcoming) Self-supervised steering of moral reasoning. Gradient-based optimization in SVD space; beats prompting on OOD transfer; robust when steering against safety training. [Preprint in prep]

Alignment work

Steering & intervention

Repo What it does
Unsupervised-Elicitation Replicated Anthropic's ICM paper; model self-reports labeling heuristics on TruthfulQA without supervision. LW note
coconut Replicated Facebook's COCONUT + added SEQ-VCR loss. Found training is very slow (not emphasised by authors). WIP branch: adapter recursion in SVD space.
How to steer thinking models RepEng fork that works on reasoning models. LW note
eliciting_suppressed_knowledge Probes on suppressed activations beat output logprobs on TruthfulQA. Shows linear probes have limits, motivating gradient-based methods.

Evals & datasets

Repo What it does
open_pref_eval Judge-free preference eval via logprobs. Converts Machiavelli, ETHICS, GENIES to fast logprob evals.
llm_ethics_leaderboard Moral preference leaderboard; logprob rankings + permutation debiasing. Results site
activation_store Store transformer activations as HF datasets; avoid OOM; reuse for probing.

Exploratory / negative results

These informed later work but didn't yield conclusive positive results.

Repo Status
repr-preference-optimization Early attempt at hidden-state preference optimization. Superseded by AntiPaSTO.
LoRA_are_lie_detectors Adapters as end-to-end probes. Promising direction, inconclusive results.
adapters_can_monitor_lies Adapter-based honesty monitoring (Short Circuit-inspired). Paused.

Reference

Repo What it is
awesome-interpretability Curated mechinterp + probing + tooling map.

Other ML work (world models, time series, misc)

World models

Time series & spatial

Misc


Lol

STOP DOING MATH!

Pinned Loading

  1. attentive-neural-processes attentive-neural-processes Public

    implementing "recurrent attentive neural processes" to forecast power usage (w. LSTM baseline, MCDropout)

    Jupyter Notebook 100 23

  2. open_pref_eval open_pref_eval Public

    Hackable, simple, llm evals on preference datasets

    Python 2

  3. repr-preference-optimization repr-preference-optimization Public

    align inner states not actions for better generalization? [wip]

    Jupyter Notebook 1

  4. eliciting_suppressed_knowledge eliciting_suppressed_knowledge Public

    probing suppressed activation gives improvements on TruthfulQA

    Jupyter Notebook 3

  5. llm_ethics_leaderboard llm_ethics_leaderboard Public

    Evaluate the moral and ethical values of language models. Using choice ranking in text based games.

    Jupyter Notebook 2

  6. AntiPaSTO AntiPaSTO Public

    AntiPaSTO: Self-Supervised Steering of Moral Reasoning