Experimental browser for the Atmosphere
Loading post...
{ "uri": "at://did:plc:pvsx2xwrpr255ezomllnazk4/app.bsky.feed.like/3ljlt5fjcrl2c", "cid": "bafyreif2xt26ubm53ep5ckyjaekn7qpquadcc7tgkjcq6hdzjnmtynwji4", "value": { "$type": "app.bsky.feed.like", "subject": { "cid": "bafyreigzkftzwetpm7j2xlg4sxwz5uvkyi7ygapcqrppktdxqz4jti3lh4", "uri": "at://did:plc:7e3hw64shux7ikibrebi6xx5/app.bsky.feed.post/3ljleat4sz22n" }, "createdAt": "2025-03-05T01:26:18.412Z" } }
1.5 yrs ago, we set out to answer a seemingly simple question: what are we *actually* getting out of RL in fine-tuning? I'm thrilled to share a pearl we found on the deepest dive of my PhD: the value of RL in RLHF seems to come from *generation-verification gaps*. Get ready to 🤿:
Mar 4, 2025, 8:59 PM