Experimental browser for the Atmosphere
By June 2023, the data team had compiled Dolma, a dataset of 3 trillion tokens, ready to train a language model. Dolma was formed from a diverse mix of web content, academic publications, code, books, and encyclopedic materials, all acquired through a transparent process.
May 6, 2025, 8:55 PM
{ "uri": "at://did:plc:i4kytxgsu3yfsrt2ml3o7tgq/app.bsky.feed.post/3lojrfdvhc32c", "cid": "bafyreidydaqrymrm2x7scevrdattsmysbe27ogbzn45fcgbnlfdx7uhuua", "value": { "text": "By June 2023, the data team had compiled Dolma, a dataset of 3 trillion tokens, ready to train a language model. Dolma was formed from a diverse mix of web content, academic publications, code, books, and encyclopedic materials, all acquired through a transparent process.", "$type": "app.bsky.feed.post", "langs": [ "en" ], "reply": { "root": { "cid": "bafyreifmvesjgu7dbxpy7x6q72n6wmwkgvvj6ae6awysunnjhi6vbdpk3y", "uri": "at://did:plc:i4kytxgsu3yfsrt2ml3o7tgq/app.bsky.feed.post/3lojrfdvahc2c" }, "parent": { "cid": "bafyreicxozwbtkc4av56ew5y2nstzxx4h62wnwadeaxtj5xjmw4kgolrzq", "uri": "at://did:plc:i4kytxgsu3yfsrt2ml3o7tgq/app.bsky.feed.post/3lojrfdvhc22c" } }, "createdAt": "2025-05-06T20:55:36.476Z" } }