Nick Doiron / ML Portfolio
Contact ndoiron@mapmeld.com

No handwritten digits or irises here. Some interesting projects:

Language

Training new BERT models - uploaded the first Transformers models for Hindi, Bengali, Tamil, and Dhivehi on HuggingFace; pretrained with Google ELECTRA on OSCAR CommonCrawl and latest Wikipedia. Finetuning showed better results than Multilingual BERT on movie review (regression), news topic (classification), and other tasks.
Researchers and developers use these models in their work. The Bengali model was evaluated in the paper Bangla Documents Classification using Transformer Based Deep Learning Models, presented at the 2nd International Conference on Sustainable Technologies for Industry 4.0, in Dhaka.
The Hindi model was evaluated in the paper Hostility Detection in Hindi leveraging Pre-Trained Language Models, tackling the shared task at CONSTRAINT 2021.
This model was also recommended by the spaCy docs until I suggested a newer, more accurate model from Indian institution AI4Bharat.
The Tamil model was cited in two EACL papers: Multilingual Hope Speech Detection and Tamil Lyrics Corpus: Analysis and Experiments.
Gender bias in Spanish BERT - Adapted Word Embedding Association Tests for Spanish (parallel masculine and feminine word lists), and made a script to mirror sentences (el actor británico <-> la actriz británica). The outputs can be used for counterfactuals or for data augmentation, measuring fairness and improving accuracy.
blog post and notebooks - WEAT-ES - seq2seq neural model
Arabic NLP - Comparing sentiment analysis libraries, training predictive models on multiple Arabic dialect datasets (notebook), and developing a seq2seq counterfactual model
My dialect-controlling GPT-2 model was cited in the AraGPT2 paper by researchers at the American University of Beirut.
Esperanto LSTM - trained on a Wikipedia corpus; generated text was nonsensical, but grammatical correctness was ultimately useful for editing Wikipedia
blog post - code

Privacy in ML

No Phone GPT-2 - swapped number tokens and moved embeddings such that memorized US phone numbers and associated personal information are nullified. This is the first step toward removing more complex pieces of information (such as addresses and calendar dates). This was shared on the OpenMined blog.

Social Media

AOC Reply Dataset - scraped 110k Twitter replies to Congresswoman @AOC, explored dataset with weakly supervised troll detection, first with Google AutoML, then later with SciKit-Learn and GPT-2 . I continue to collaborate with researchers who analyze this dataset.
dataset - blog post
DeepClapback - downloaded Reddit comments into PostgreSQL to find short replies which greatly outscored their parent (for example: "[citation needed]") and predict the best response to any post. Used Google AutoML.
blog post
Data Engineering with Kedro - re-labeling Twitter disinfo data releases by language, using data pipeline library Kedro with new visualizations (see below)
blog post - code

Tabular Data

Student Dropout Contract - processed 11 million rows of student records into a PostGIS database on contract with the Inter-American Development Bank.
Developed a data visualization (map and dashboard). SciKit-Learn (Bayesian Ridge, XGBoost) to model dropout rates, crime rates, and nearby geography (from OpenStreetMap), and school stats. Used ELI5 to measure significance of each column.
Dropout rate was not very predictable, especially in smaller rural schools (where one student leaving = 20% dropout rate), but we found a key point is when students must change schools to enter 7th grade.
code
Airbnb Price Prediction with AutoKeras - Trained a model using Airbnb prices and OpenStreetMap local data. Used Uber's Manifold to visualize the significance of each column (see below).
blog post - notebook

Non-ML Coding

GitHub

Fortran.IO

AutoKeras examples

OpenStreetMap (osm.org and iD Editor)

NextStrain

a quantum computing library

a cryptocurrency wallet

Writing and Outreach

AI and International Law

blog post

Other Work

I was credited in the paper for MuRIL (Google's Indian languages model) for helping set up a HuggingFace/PyTorch version.

In 2021 my feedback led to a small edit and acknowledgement in the "Stochastic Parrots" AI Ethics paper. I applaud the full team behind this paper, who should all have been allowed credit on their work, and recognize their contributions are much more than my own.

Nick Doiron / ML Portfolio Contact ndoiron@mapmeld.com