DataComp-LM: In Search of the Next Generation of Training Sets for Language Models Apple Machine Learning Research
[[{“value”:”This paper was accepted at the NeurIPS Datasets and Benchmarks Workshop at NeurIPS 2024 We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T… Read More »DataComp-LM: In Search of the Next Generation of Training Sets for Language Models Apple Machine Learning Research