Datasets:

lbourdois
/

fineweb-2-trimming

Version of FineWeb2 where only 124 languages were kept.
For each of them we kept the first 200,000 texts (less if there are not as many available for a given language).

The purpose of this dataset is to offer a light version (only 44GB against 8.67 TB for the original dataset) in order to be able to trim models.

For more information on the trimming method, we invite you to consult this blog post.

Citations

FineWeb-2

@misc{penedo2025fineweb2pipelinescale,
  title={FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language}, 
  author={Guilherme Penedo and Hynek Kydlíček and Vinko Sabolčec and Bettina Messmer and Negar Foroutan and Amir Hossein Kargaran and Colin Raffel and Martin Jaggi and Leandro Von Werra and Thomas Wolf},
  year={2025},
  eprint={2506.20920},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.20920}, 
}

Trimming blog post

@misc{hf_blogpost_trimming,
      title={Introduction to Trimming}, 
      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
      year={2026},
      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, 
}