Nick Doiron / ML Portfolio

No handwritten digits or irises here. Some interesting projects:


  • Training new BERT models - uploaded the first Transformers models for Hindi, Bengali, Tamil, and Dhivehi on HuggingFace; pretrained with Google ELECTRA on OSCAR CommonCrawl and latest Wikipedia. Finetuning showed better results than Multilingual BERT on movie review (regression), news topic (classification), and other tasks.
    Researchers and developers use these models in their work. The Bengali model was evaluated in the paper Bangla Documents Classification using Transformer Based Deep Learning Models, presented at the 2nd International Conference on Sustainable Technologies for Industry 4.0, in Dhaka.
    The Hindi model was evaluated in the paper Hostility Detection in Hindi leveraging Pre-Trained Language Models, tackling the shared task at CONSTRAINT 2021.
    This model was also recommended by the spaCy docs until I suggested a newer, more accurate model from Indian institution AI4Bharat.
    The Tamil model was cited in two EACL papers: Multilingual Hope Speech Detection and Tamil Lyrics Corpus: Analysis and Experiments.
  • Gender bias in Spanish BERT - Adapted Word Embedding Association Tests for Spanish (parallel masculine and feminine word lists), and made a script to mirror sentences (el actor británico <-> la actriz británica). The outputs can be used for counterfactuals or for data augmentation, measuring fairness and improving accuracy.
    blog post and notebooks - WEAT-ES - seq2seq neural model
  • Arabic NLP - Comparing sentiment analysis libraries, training predictive models on multiple Arabic dialect datasets (notebook), and developing a seq2seq counterfactual model
    My dialect-controlling GPT-2 model was cited in the AraGPT2 paper by researchers at the American University of Beirut.
  • Esperanto LSTM - trained on a Wikipedia corpus; generated text was nonsensical, but grammatical correctness was ultimately useful for editing Wikipedia
    blog post - code
  • Privacy in ML

  • No Phone GPT-2 - swapped number tokens and moved embeddings such that memorized US phone numbers and associated personal information are nullified. This is the first step toward removing more complex pieces of information (such as addresses and calendar dates). This was shared on the OpenMined blog.
  • Social Media

  • AOC Reply Dataset - scraped 110k Twitter replies to Congresswoman @AOC, explored dataset with weakly supervised troll detection, first with Google AutoML, then later with SciKit-Learn and GPT-2 . I continue to collaborate with researchers who analyze this dataset.
    dataset - blog post
  • DeepClapback - downloaded Reddit comments into PostgreSQL to find short replies which greatly outscored their parent (for example: "[citation needed]") and predict the best response to any post. Used Google AutoML.
    blog post
  • Data Engineering with Kedro - re-labeling Twitter disinfo data releases by language, using data pipeline library Kedro with new visualizations (see below)
    blog post - code
  • Tabular Data

  • Student Dropout Contract - processed 11 million rows of student records into a PostGIS database on contract with the Inter-American Development Bank.
    Developed a data visualization (map and dashboard). SciKit-Learn (Bayesian Ridge, XGBoost) to model dropout rates, crime rates, and nearby geography (from OpenStreetMap), and school stats. Used ELI5 to measure significance of each column.
    Dropout rate was not very predictable, especially in smaller rural schools (where one student leaving = 20% dropout rate), but we found a key point is when students must change schools to enter 7th grade.
  • Airbnb Price Prediction with AutoKeras - Trained a model using Airbnb prices and OpenStreetMap local data. Used Uber's Manifold to visualize the significance of each column (see below).
    blog post - notebook
  • Non-ML Coding

    I'm a full-stack developer with geospatial experience. Previously I worked in a Software Engineer / Data Scientist role at McKinsey & Company, and as an open data expert for Code for America, the City of Boston, ESRI, and the Asia Foundation.
    GitHub: I've contributed to Fortran.IO, AutoKeras examples, OpenStreetMap ( and iD Editor), NextStrain, a quantum computing library, and a cryptocurrency wallet.

    Writing and Outreach

    In 2020, I participated in an AI and International Law workshop at the Asser Instituut in The Hague. It was a good view of AI/ML from a legal and ethical perspective: blog post.

    Other Work

    I was credited in the paper for MuRIL (Google's Indian languages model) for helping set up a HuggingFace/PyTorch version.

    In 2021 my feedback led to a small edit and acknowledgement in the "Stochastic Parrots" AI Ethics paper. I support the full team behind this paper, who should all have been allowed credit on their work, and recognize their contributions are much more than my own.

    In recent years I conducted data visualization and PostGIS workshops at PyCon India, PyCon Zimbabwe, and refugee code schools in Turkey and Iraqi Kurdistan.