Experimental browser for the Atmosphere
Then BPE Gets Picky aclanthology.org/2024.emnlp-m... describes on possible solution. They add a deletion stop to the BPE mere process to get rid of scaffold or intermediate tokens. That solves the problem (if you train your tokenizer on the same data as your LLM pretraining.)
Dec 20, 2024, 6:45 PM
{
"text": "Then BPE Gets Picky aclanthology.org/2024.emnlp-m... describes on possible solution. They add a deletion stop to the BPE mere process to get rid of scaffold or intermediate tokens. That solves the problem (if you train your tokenizer on the same data as your LLM pretraining.)",
"$type": "app.bsky.feed.post",
"embed": {
"$type": "app.bsky.embed.external",
"external": {
"uri": "https://aclanthology.org/2024.emnlp-main.925/",
"thumb": {
"$type": "blob",
"ref": {
"$link": "bafkreihz52v5hkpks5qhf6ud6tlsxvocxv2ytiugvsqxhl5usactxwpw3a"
},
"mimeType": "image/jpeg",
"size": 907968
},
"title": "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training",
"description": "Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P. Yamshchikov. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024."
}
},
"langs": [
"en"
],
"reply": {
"root": {
"cid": "bafyreictfbybzbhtvpo4cfus4h45zrh64b3qqgcnfr5phmn3wimmbrhgdm",
"uri": "at://did:plc:54lqvssae6v2kio2cu26yktz/app.bsky.feed.post/3ldr2b5ec722r"
},
"parent": {
"cid": "bafyreictfbybzbhtvpo4cfus4h45zrh64b3qqgcnfr5phmn3wimmbrhgdm",
"uri": "at://did:plc:54lqvssae6v2kio2cu26yktz/app.bsky.feed.post/3ldr2b5ec722r"
}
},
"facets": [
{
"index": {
"byteEnd": 52,
"byteStart": 20
},
"features": [
{
"uri": "https://aclanthology.org/2024.emnlp-main.925/",
"$type": "app.bsky.richtext.facet#link"
}
]
}
],
"createdAt": "2024-12-20T18:45:49.350Z"
}