Open-Source NLP Research Projects

I'm a Machine Learning Research Engineer focused on NLP, XAI and Adversarial Robustness.

Hackathon of NLP in Spanish

NLP
Somos NLP
With more than 500 participants from 39 countries, it is the largest open-source hackathon of NLP in Spanish. The recorded events have already more than 5k visualizations! Organized by Somos NLP and sponsored by Hugging Face, Platzi and Paperspace.

BigScience Research Workshop

NLP
Hugging Face
Research
A one-year long international research workshop on large multilingual models and datasets. I was part of the data tooling working group. Here is the model paper BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

BERTIN

NLP
Hugging Face
Research
BERTIN is a series of RoBERTa-based models in Spanish trained using a novel sampling technique that we call "perplexity sampling". More detailed info can be found in the model card and the paper BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling.

Course: NLP de 0 a 100 con Hugging Face

NLP
Somos NLP
The first NLP course from zero to hero in Spanish. It's open-source and was organized by Somos NLP with the support of Spain AI. I taught the classes on sequential models and the Transformer architecture.

Pre-training GPT-2, T5 & Wav2Vec2 models in Spanish

NLP
Hugging Face
HF Hackathon
A series of Spanish language models trained with Flax/Jax and using TPUs sponsored by Google during the Flax/Jax Community Week organized by Hugging Face in June 2021. Here are the model cards: GPT-2 model , T5 model and Wav2Vec2 model.

WaiACCELERATE Program

Entrepreneurship
Women in AI & Robotics
A program where we provide women entrepreneurs with the tools, knowledge, mentoring and network to successfully realize their startup/business idea in the AI sector.

Adding NLP datasets in Spanish

NLP
Hugging Face
HF Hackathon
Addition of 3 datasets in Spanish to the huggingface/datasets library during the open-sprint organized by Hugging Face in Dec 2020. The datasets are HEAD-QA (a multi-choice HEAlthcare Dataset), the dataset of the eHealth-KD Challenge at IberLEF 2020, and the Spanish Billion Words Corpus.

Chatbot COVID-19

Conversational AI
Backend
Frontend
DevOps
Math Thesis
Chatbot that understands and answers questions about the COVID-19: symptoms, prevention, regulation, the situation in Spain. Don't hesitate to chat with AURORA!
The chatbot understands correctly on the 1st attempt 92% of the requests and helped 1500+ people during the first months of the pandemic. Collaboration with Accenture’s Gijón office.

NN for the study of the Higgs Boson with data from the LHC

Machine Learning
Physics Thesis
Implementation of a Neural Network that predicts - with a correlation coefficient of 0.778 - characteristics of the Higgs Boson produced in the particle collider. Collaboration with the university's high energy particle research team.

Quality Analysis of ML Models

Python Package
AI Performance
AI Robustness
PyPI package to perform quality analyses on ML models. It focuses on the three quality pillars: functionality, robustness and explainability.

The Annotated Transformer

NLP
Transformers
Team Project
Detailed and interactive explanation of the Transformer architecture. Based on the Harvard NLP notebook "The Annotated Transformer", where the paper "Attention Is All You Need" is explained and implemented.

World Development Indicators - Operation Fistula

Data Visualization
VizForSocialGood
Tableau
Visualization for the #VizForSocialGood project to support the organization Operation Fistula on their "mission to end fistula for every woman everywhere".