Open-Source NLP Research Projects

I'm a Machine Learning Research Engineer focused on NLP, specially on data and evaluations. And I always say yes to NLP in Spanish projects!

Last update: June 2024 | For up-to-date information check my Hugging Face profile!

SomosNLP

My 💛 project
Did you know that we are 600 million Spanish-speaking individuals around the world? SomosNLP.org is an international community aiming to represent in AI the linguistic diversity of the languages spoken by all these persons.

Open Leaderboard for the languages of Spain and LATAM [WIP]

LLM Evaluation
NLP in Spanish
Open leaderboard to evaluate LLM memorization, reasoning and linguistic capabilities in the languages of Spain, LATAM and the Caribbean. Developed as part of the #Somos600M Project thanks to the donation of high-quality datasets by IIC, LenguajeNaturalAI, UPM, HiTZ, and BSC.
1st of July: Leaderboard v1 live!

Validation of machine-translated evaluation datasets [WIP]

Translation
Bias
NLP in Spanish
Community effort to validate the machine-translated Spanish versions of 3 widely-used evaluation datasets (MMLU, RAC-C, and HellaSwag) and the prompt dataset from the Data Is Better Together (DIBT) initiative. Efforts co-organized by SomosNLP, Hugging Face & Argilla. Join us!

Dataset collection campaign [WIP]

Data
NLP in Spanish
At SomosNLP we are collecting datasets in the languages spoken in LATAM, the Caribbean and Spain. Collaborate and help us collect diverse data!

Hackathon SomosNLP 2024: #Somos600M

Instruction-tuned LLMs
NLP in Spanish
Third edition of the largest open-source hackathon of NLP in Spanish. This year's edition counted with +600 participants and 12 amazing speakers.
Check the recorded talks and keynotes!

Spanish NLP Initiatives [WIP]

NLP in Spanish
Discover the initiatives driving NLP advancements in Spanish and other low-resource languages spoken in LatAm and Spain.

Transparency Self-Assessment

Responsible AI
This tool allows you to self-assess the transparency of your model development based on the Foundation Model Transparency Index (FMTI) published by the Center for Research on Foundation Models.

Hackathon SomosNLP 2023: Los LLMs hablan español

LLMs
NLP in Spanish
Second edition of the largest open-source hackathon of NLP in Spanish. This year's edition counted with +500 participants, 17 speakers, and 7 mentors.
Check the awarded projects and the recorded talks and keynotes!

Somos Mujeres NLP

Women in AI
NLP
Organized two initiatives to promote both the work and research of women in NLP and also the projects that apply NLP to fight sexism.

NLP Course by Hugging Face

NLP
Education
Contributing to the translation of the NLP Course by Hugging Face to Spanish.

BigCode Project: LLMs for Code

NLP
Research
Contributing to BigCode. Project in progress.

EleutherAI: Polyglot Romance

NLP
Research
BERTIN Project
Contributing to EleutherAI's research project "Polyglot Romance". Project in progress.

Hackathon SomosNLP 2022: NLP en Español

NLP in Spanish
With more than 500 participants from 39 countries, it is the largest open-source hackathon of NLP in Spanish. The recorded talks and workshops have already more than 5k visualizations! Organized by SomosNLP and sponsored by Hugging Face, Platzi and Paperspace. Check the awarded projects!

BigScience Research Workshop

NLP
Hugging Face
Research
A one-year long international research workshop on large multilingual models and datasets. We created, among other cool things, ROOTS: A 1.6TB Composite Multilingual Dataset that was then used to train BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

BERTIN Project: Perplexity Sampling

NLP
Hugging Face
Research
BERTIN is a series of RoBERTa-based models in Spanish trained using a novel sampling technique that we call "perplexity sampling". More detailed info can be found in the model card and the paper BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling.

Course: NLP de 0 a 100 con Hugging Face

NLP
Education
The first NLP course from zero to hero in Spanish. It's open-source and was organized by SomosNLP with the support of Spain AI. I taught the classes on sequential models and the Transformer architecture.

Pre-training GPT-2, T5 & Wav2Vec2 models in Spanish

NLP
Hugging Face
HF Hackathon
A series of Spanish language models trained with Flax/Jax and using TPUs sponsored by Google during the Flax/Jax Community Week organized by Hugging Face in June 2021. Here are the model cards: GPT-2 model , T5 model and Wav2Vec2 model.

WaiACCELERATE Program

Entrepreneurship
Women in AI & Robotics
A program where we provide women entrepreneurs with the tools, knowledge, mentoring and network to successfully realize their startup/business idea in the AI sector.

Making Spanish NLP datasets available in the HF Hub

NLP
Hugging Face
HF Hackathon
Addition of 3 datasets in Spanish to the huggingface/datasets library during the open-sprint organized by Hugging Face in Dec 2020. The datasets are HEAD-QA (a multi-choice HEAlthcare Dataset), the dataset of the eHealth-KD Challenge at IberLEF 2020, and the Spanish Billion Words Corpus.

Quality Analysis of ML Models

Python Package
AI Performance
AI Robustness
PyPI package to perform quality analyses on ML models. It focuses on the three quality pillars: functionality, robustness and explainability.

Chatbot COVID-19

Conversational AI
Backend
Frontend
DevOps
Math Thesis
Chatbot that understands and answers questions about the COVID-19: symptoms, prevention, regulation, the situation in Spain. Don't hesitate to chat with AURORA!
The chatbot understands correctly on the 1st attempt 92% of the requests and helped 1500+ people during the first months of the pandemic. Collaboration with Accenture’s Gijón office.

Neural Network for the study of the Higgs Boson with data from the LHC (CERN)

Machine Learning
Physics Thesis
Implementation of a Neural Network that predicts - with a correlation coefficient of 0.778 - characteristics of the Higgs Boson produced in the particle collider. Collaboration with the university's high energy particle research team.