Experimental browser for the Atmosphere
Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior. Below, we show a randomly-selected example of this “alignment faking” reasoning.
Dec 18, 2024, 5:46 PM
{ "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw26c5gk2r", "cid": "bafyreib7vjanflidlmfr4kpzzo7zxpgb4rqgo2gesniq374evc6fxsws34", "value": { "text": "Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior.\n\nBelow, we show a randomly-selected example of this “alignment faking” reasoning.", "$type": "app.bsky.feed.post", "embed": { "$type": "app.bsky.embed.images", "images": [ { "alt": "A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the user’s prompt. This is followed by the red bubbles in which the model gives its reasoning in a “scratchpad” (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the model’s output, with some of the graphic content removed for this illustration.\n", "image": { "$type": "blob", "ref": { "$link": "bafkreicoq2yzf6e4w6hucwnyyjjs72mweegp5yjyj4q23yqyj3bhoszvq4" }, "mimeType": "image/jpeg", "size": 668937 }, "aspectRatio": { "width": 1728, "height": 1258 } } ] }, "langs": [ "en" ], "reply": { "root": { "cid": "bafyreihzgyc76623mey63q7wusk3uckjsl5q4jnumjzjipq6a4p4mcnpga", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw22eto22r" }, "parent": { "cid": "bafyreidw66ad7puwgldm5hi5ej4y7yylyi37ekp2l55nsr6vv6b6gyxppy", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw24jkv22r" } }, "createdAt": "2024-12-18T17:46:57.671Z" } }