Experimental browser for the Atmosphere
Parallel-form reliability tests if different—but designed to be equivalent—versions of a measure are consistent. E.g., how consistent are different prompt formulations for evaluating the LM responses on the same bias dataset? Are LMs sensitive to minor changes to how the questions are phrased?
Jan 24, 2024, 9:29 AM
{
"text": "Parallel-form reliability tests if different—but designed to be equivalent—versions of a measure are consistent. E.g., how consistent are different prompt formulations for evaluating the LM responses on the same bias dataset? Are LMs sensitive to minor changes to how the questions are phrased?",
"$type": "app.bsky.feed.post",
"langs": [
"en"
],
"reply": {
"root": {
"cid": "bafyreidc5acxzktjgf3sratzwt6vei52wikhpe5aofvgygyobmo33gpwwy",
"uri": "at://did:plc:jo6p6curyzzhgblcdwwso6qy/app.bsky.feed.post/3kjpqeivm4n2t"
},
"parent": {
"cid": "bafyreigvvxzg74a5v4lyqfciaer2uokgxyc4w6jtnjspdlvdaphknpkbta",
"uri": "at://did:plc:jo6p6curyzzhgblcdwwso6qy/app.bsky.feed.post/3kjpqojgwi62s"
}
},
"createdAt": "2024-01-24T09:29:02.587Z"
}