Experimental browser for the Atmosphere
I wanted to post of a few of my favorite #EMNLP2024 papers, starting with a couple in tokenization. Fishing For Magicarp explores the problem of undertrained "glitch" tokens, and how they can be identified from their embedding vectors. aclanthology.org/2024.emnlp-m...
Dec 20, 2024, 6:45 PM
{
"text": "I wanted to post of a few of my favorite #EMNLP2024 papers, starting with a couple in tokenization. Fishing For Magicarp explores the problem of undertrained \"glitch\" tokens, and how they can be identified from their embedding vectors. aclanthology.org/2024.emnlp-m...",
"$type": "app.bsky.feed.post",
"embed": {
"$type": "app.bsky.embed.external",
"external": {
"uri": "https://aclanthology.org/2024.emnlp-main.649/",
"thumb": {
"$type": "blob",
"ref": {
"$link": "bafkreieokywoejc62j3nlmeyoymje6s5lzug4qq7xnlc47nplt5nx4mdfq"
},
"mimeType": "image/jpeg",
"size": 843740
},
"title": "Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models",
"description": "Sander Land, Max Bartolo. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024."
}
},
"langs": [
"en"
],
"facets": [
{
"index": {
"byteEnd": 51,
"byteStart": 41
},
"features": [
{
"tag": "EMNLP2024",
"$type": "app.bsky.richtext.facet#tag"
}
]
},
{
"index": {
"byteEnd": 269,
"byteStart": 237
},
"features": [
{
"uri": "https://aclanthology.org/2024.emnlp-main.649/",
"$type": "app.bsky.richtext.facet#link"
}
]
}
],
"createdAt": "2024-12-20T18:45:49.349Z"
}