No handwritten digits or irises here. Some interesting projects:
Training new BERT models -
uploaded the first Transformers models for Hindi, Bengali, Tamil, and Dhivehi on HuggingFace; pretrained with Google ELECTRA
on OSCAR CommonCrawl and latest Wikipedia. Finetuning showed better results than Multilingual BERT
on movie review (regression), news topic (classification), and other tasks.
Researchers and developers use these models in their work. The Bengali model was evaluated in the paper Bangla Documents Classification using Transformer
Based Deep Learning Models, presented at the 2nd International Conference on Sustainable Technologies for Industry 4.0, in Dhaka.
The Hindi model was evaluated in the paper Hostility Detection in Hindi leveraging Pre-Trained Language Models,
tackling the shared task at CONSTRAINT 2021.
This model was also recommended by the spaCy docs
until I suggested a newer, more accurate model from Indian institution AI4Bharat.
The Tamil model was cited in two EACL papers: Multilingual Hope Speech Detection and Tamil Lyrics Corpus: Analysis and Experiments.
Gender bias in Spanish BERT -
Adapted Word Embedding Association Tests for Spanish (parallel masculine and feminine word lists), and
made a script to mirror sentences (el actor británico <-> la actriz británica). The outputs can be used for
counterfactuals or for data augmentation, measuring fairness and improving accuracy.
blog post and notebooks
- WEAT-ES - seq2seq neural model
Arabic NLP -
Comparing sentiment analysis libraries,
training predictive models on
multiple Arabic dialect datasets
and developing a seq2seq counterfactual model
My dialect-controlling GPT-2 model was cited in
the AraGPT2 paper by researchers at the American University of Beirut.
Esperanto LSTM -
trained on a Wikipedia corpus; generated text was nonsensical, but grammatical correctness was ultimately useful for editing Wikipedia
blog post - code
Privacy in ML
No Phone GPT-2 -
swapped number tokens and moved embeddings such that memorized US phone numbers and associated personal information are nullified.
This is the first step toward removing more complex pieces of information (such as addresses and calendar dates).
shared on the OpenMined blog.
AOC Reply Dataset -
scraped 110k Twitter replies to Congresswoman @AOC, explored dataset with weakly supervised troll detection,
first with Google AutoML, then later with
. I continue to collaborate with researchers who analyze
- blog post
downloaded Reddit comments into PostgreSQL to find short replies which greatly outscored their parent (for example: "")
and predict the best response to any post.
Used Google AutoML.
Data Engineering with Kedro -
re-labeling Twitter disinfo data releases by language, using data pipeline library Kedro with new visualizations (see below)
blog post - code
Student Dropout Contract -
processed 11 million rows of student records into a PostGIS database on contract with the Inter-American Development Bank.
Developed a data visualization (map and dashboard).
SciKit-Learn (Bayesian Ridge, XGBoost) to model dropout rates, crime rates, and nearby geography (from OpenStreetMap), and school stats.
Used ELI5 to measure significance of each column.
Dropout rate was not very predictable, especially in smaller rural schools (where one student leaving = 20% dropout rate),
but we found a key point is when students must change schools to enter 7th grade.
Airbnb Price Prediction with AutoKeras -
Trained a model using Airbnb prices and OpenStreetMap local data. Used Uber's Manifold to visualize the significance of each column (see below).
blog post - notebook
I'm a full-stack developer with geospatial experience. Previously I worked in a Software Engineer / Data Scientist role at McKinsey & Company, and
as an open data expert for Code for America, the City of Boston, ESRI, and the Asia Foundation.
GitHub: I've contributed to
OpenStreetMap (osm.org and iD Editor),
a quantum computing library,
and a cryptocurrency wallet.
Writing and Outreach
In 2020, I participated in an AI and International Law workshop at the Asser Instituut in The Hague.
It was a good view of AI/ML from a legal and ethical perspective:
I was credited in the paper for MuRIL (Google's Indian languages model) for helping
set up a HuggingFace/PyTorch version.
In 2021 my feedback led to a small edit and acknowledgement in the "Stochastic Parrots" AI Ethics paper.
I support the full team behind this paper, who should all have been allowed credit on their work, and recognize their contributions are much more than my own.
In recent years I conducted data visualization and PostGIS workshops at PyCon India, PyCon Zimbabwe, and refugee code schools in Turkey and Iraqi Kurdistan.