AIs Will Increasingly Fak
AIs Will Increasingly Fake Alignment – description
AIs Will Increasingly Fake Alignment – description
This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.
This is by far the best [+90743 chars] – source – Zvi Mowshowitz https://thezvi.substack.com/p/ais-will-increasingly-fake-alignment