Conference Paper (published)
Details
Citation
Zain NU, Naseem MR & Adeel A (2025) Single layer tiny Co4 outpaces {GPT}-2 and {GPT}-{BERT}. In: Charpentier L, Choshen L, Cotterell R, Gul MO, Hu MY, Liu J, Jumelet J, Linzen T, Mueller A, Ross C, Shah RS, Warstadt A, Wilcox EG & Williams A (eds.) Proceedings of the First BabyLM Workshop, volume Proceedings of the First BabyLM Workshop. Empirical Methods in Natural Language Processing, Hybrid, 04.11.2025. Association for Computational Linguistics, pp. 313-322. https://doi.org/10.18653/v1/2025.babylm-main.24
Abstract
We show that a tiny Co4 machine (CITATION) with a single layer, two heads, and 8M parameters, operating at O(N) computational cost (where N is the number of input tokens), in just 2 epochs outpaces GPT-2 (124M, 12 layers, O(N2)) and GPT-BERT (30M, 12 layers, O(N
2), both trained for 10 epochs. Co4 achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating sample-efficient pretraining. On the BabyLM challenge evaluation pipeline, Co4 performs comparably or better across complex benchmarks, showing strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co4 outperforms GPT-2 in 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT in 4 out of 7 metrics in both cases. These results strongly suggest a need to rethink prevailing deep learning paradigms and associated scaling laws.
| Status | Published |
|---|---|
| Funders | Advanced Research and Invention Agency |
| Publication date | 31/12/2025 |
| Publication date online | 30/11/2025 |
| Publisher | Association for Computational Linguistics |
| ISBN | TODO |
| Conference | Empirical Methods in Natural Language Processing |
| Conference location | Hybrid |
| Dates |
People (3)
Assoc. Prof. in Artificial Intelligence, Computing Science and Mathematics - Division
Research Assistant, Computing Science
PhD Researcher, Computing Science and Mathematics - Division