Experimental browser for the Atmosphere
How close are current AI agents to automating AI research itself? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.
Nov 25, 2024, 7:42 PM
{ "uri": "at://did:plc:dll3hepzq76nymel5c3yt6nk/app.bsky.feed.post/3lbsbrpmg3s2b", "cid": "bafyreiewghwpltsxrvzxb4pehqb2a4prnn5wee34pxbfkj3xmdrccvdyau", "value": { "text": "How close are current AI agents to automating AI research itself? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.", "$type": "app.bsky.feed.post", "embed": { "$type": "app.bsky.embed.images", "images": [ { "alt": "", "image": { "$type": "blob", "ref": { "$link": "bafkreihmcehe66wwoqspghcrt5h3t3aawko3tsrwyeph5ibrpmwmyllrbe" }, "mimeType": "image/jpeg", "size": 202823 }, "aspectRatio": { "width": 1070, "height": 634 } } ] }, "langs": [ "en" ], "createdAt": "2024-11-25T19:42:38.034Z" } }