The research projects I work on generally aim at building useful artefacts for the community such as lightweight but powerful models, high-quality datasets or open recipes. You can find a more exhaustive list on the Hugging Face Science page.
BigCode: We started working on code LLMs while finishing the NLP with Transformers book which resulted in CodeParrot, a GPT-2 like model trained on GitHub code. When CoPilot was released we decided to scale the project up in a community effort called BigCode to build fully open LLMs for code. With 1000+ community members we built the The Stack v1 and The Stack v2, both TB scale code datasets for pretraining, and trained the fully open models StarCoder and StarCoder2.
SmolLM: The SmolLM model family is a set of models with maximal performance at small size that can run locally or on-device. We released three generations so far (1, 2, and 3) along with the full training pipeline. The models were also adapted to images and videos with SmolVLM and SmolVLM2 and SmolVLA for robotics. In collaboration with IBM we build SmolDocling specifically for OCR tasks.
TRL: The Transformer Reinforcement Learning (TRL) library originally started as a reproduction project in 2020 to get myself into NLP and has since become a popular fine-tuning library for transformer models with 15k GitHub stars and over 1M monthly pip installs. It serves as the foundation for many projects like our Zephyr model and the Open-R1 project replciating the DeepSeek-R1 pipeline.
FineDatasets: Large and high-quality datasets are the foundation of the success of LLMs however they are rarely released these days. Similar to The Stack datasets we worked on FineWeb, FineWeb-Edu, FineWeb2, FineVideo and others to enable more people to train great models.
Agents: There is a lot of interest (and hype) around agents but there are not proportionally mayny quality resources out there, yet. We experiment on what useful agents could do and built for example Jupyter Agents and Computer Use Agents as first projects.