ATProto Browser

ATProto Browser

Experimental browser for the Atmosphere

Post

When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.

Dec 18, 2024, 5:46 PM

Record data

{
  "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw2b3dwc2r",
  "cid": "bafyreihkfmgrkmojpe7wmyzykfxsiziq3b67xtnunurxvbi5jp5rt2ahue",
  "value": {
    "text": "When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.",
    "$type": "app.bsky.feed.post",
    "embed": {
      "$type": "app.bsky.embed.images",
      "images": [
        {
          "alt": "Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the model’s scratchpad. “Training” corresponds to the free-tier condition and “unmonitored” to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.\n",
          "image": {
            "$type": "blob",
            "ref": {
              "$link": "bafkreifgpthm33b236xzjjoli36cvbzb2ppxtkwpkucgisqv7th67kkzfm"
            },
            "mimeType": "image/jpeg",
            "size": 305462
          },
          "aspectRatio": {
            "width": 1199,
            "height": 628
          }
        }
      ]
    },
    "langs": [
      "en"
    ],
    "reply": {
      "root": {
        "cid": "bafyreihzgyc76623mey63q7wusk3uckjsl5q4jnumjzjipq6a4p4mcnpga",
        "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw22eto22r"
      },
      "parent": {
        "cid": "bafyreievw4lqenphp57hstuu5dn343la7buzvml4k57e54qpmrhwsvy3jy",
        "uri": "at://did:plc:dsxewietk5tigqvn6daod2l6/app.bsky.feed.post/3ldlw27or622r"
      }
    },
    "createdAt": "2024-12-18T17:46:57.673Z"
  }
}