Experimental browser for the Atmosphere
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored. When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
Dec 18, 2024, 5:46 PM
{ "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw24jkv22r", "cid": "bafyreidw66ad7puwgldm5hi5ej4y7yylyi37ekp2l55nsr6vv6b6gyxppy", "value": { "text": "Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.\n\nWhen unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.", "$type": "app.bsky.feed.post", "embed": { "$type": "app.bsky.embed.images", "images": [ { "alt": "We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.\n", "image": { "$type": "blob", "ref": { "$link": "bafkreidk5mi34hixtbwmadvevwbvyl37wvtkrlgtffv2uuka3wccswgy3m" }, "mimeType": "image/jpeg", "size": 862133 }, "aspectRatio": { "width": 2000, "height": 1536 } } ] }, "langs": [ "en" ], "reply": { "root": { "cid": "bafyreihzgyc76623mey63q7wusk3uckjsl5q4jnumjzjipq6a4p4mcnpga", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw22eto22r" }, "parent": { "cid": "bafyreihzgyc76623mey63q7wusk3uckjsl5q4jnumjzjipq6a4p4mcnpga", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw22eto22r" } }, "createdAt": "2024-12-18T17:46:57.670Z" } }