ATProto Browser

Experimental browser for the Atmosphere

Post

1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:

Mar 4, 2025, 8:59 PM

Loading post...

Record data

{
  "uri": "at://did:plc:pvsx2xwrpr255ezomllnazk4/app.bsky.feed.like/3ljlt5fjcrl2c",
  "cid": "bafyreif2xt26ubm53ep5ckyjaekn7qpquadcc7tgkjcq6hdzjnmtynwji4",
  "value": {
    "$type": "app.bsky.feed.like",
    "subject": {
      "cid": "bafyreigzkftzwetpm7j2xlg4sxwz5uvkyi7ygapcqrppktdxqz4jti3lh4",
      "uri": "at://did:plc:7e3hw64shux7ikibrebi6xx5/app.bsky.feed.post/3ljleat4sz22n"
    },
    "createdAt": "2025-03-05T01:26:18.412Z"
  }
}