Experimental browser for the Atmosphere
When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.
Dec 18, 2024, 5:46 PM
{ "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw2b3dwc2r", "cid": "bafyreihkfmgrkmojpe7wmyzykfxsiziq3b67xtnunurxvbi5jp5rt2ahue", "value": { "text": "When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.", "$type": "app.bsky.feed.post", "embed": { "$type": "app.bsky.embed.images", "images": [ { "alt": "Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the model’s scratchpad. “Training” corresponds to the free-tier condition and “unmonitored” to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.\n", "image": { "$type": "blob", "ref": { "$link": "bafkreifgpthm33b236xzjjoli36cvbzb2ppxtkwpkucgisqv7th67kkzfm" }, "mimeType": "image/jpeg", "size": 305462 }, "aspectRatio": { "width": 1199, "height": 628 } } ] }, "langs": [ "en" ], "reply": { "root": { "cid": "bafyreihzgyc76623mey63q7wusk3uckjsl5q4jnumjzjipq6a4p4mcnpga", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw22eto22r" }, "parent": { "cid": "bafyreievw4lqenphp57hstuu5dn343la7buzvml4k57e54qpmrhwsvy3jy", "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw27or622r" } }, "createdAt": "2024-12-18T17:46:57.673Z" } }