No handwritten digits or irises here, just brand new projects
South Asian BERT -
uploaded the first Transformers models for Hindi, Bengali, Tamil, and Dhivehi on HuggingFace; pretrained with Google ELECTRA
on OSCAR CommonCrawl and latest Wikipedia. Finetuning showed better results than Multilingual BERT
on (typically movie review, news topic, and QA datasets).
Why monolingual models? -
- Tamil pretraining CoLab
- SimpleTransformers vs. finetuned ELECTRA - Transformers benchmark
Gender bias in Spanish BERT -
Adapted Word Embedding Association Tests for Spanish (parallel masculine and feminine word lists), and
made a script to mirror sentences (el actor británico <-> la actriz británica). The outputs can be used for
counterfactuals or for data augmentation, measuring fairness and improving accuracy.
blog post and notebooks
- WEAT-ES - seq2seq neural model
Arabic NLP -
comparing sentiment analysis libraries,
applying FastText embeddings,
evaluating AWS Comprehend,
and training predictive models on
multiple Arabic dialect datasets
Esperanto LSTM -
trained on a Wikipedia corpus; generated text was nonsensical, but grammatical correctness was ultimately useful for editing Wikipedia
blog post - code
AOC Reply Dataset -
scraped 110k Twitter replies to Congresswoman @AOC, explored dataset with weakly supervised troll detection,
first with Google AutoML, then later with
. I continue to collaborate with researchers who analyze
- blog post
downloaded Reddit comments into PostgreSQL to find short replies which greatly outscored their parent (for example: "")
and predict the best response to any post.
Used Google AutoML.
Data Engineering with Kedro -
re-labeling Twitter disinfo data releases by language, using data pipeline library Kedro with new visualizations (see below)
blog post - code
Student Dropout Contract -
processed 11 million rows of student records into a PostGIS database on contract with the Inter-American Development Bank.
Developed a data visualization (map and dashboard).
SciKit-Learn (Bayesian Ridge, XGBoost) to model dropout rates, crime rates, and nearby geography (from OpenStreetMap), and school stats.
Used ELI5 to measure significance of each column.
Dropout rate was not very predictable, especially in smaller rural schools (where one student leaving = 20% dropout rate),
but we found a key point is when students must change schools to enter 7th grade.
Airbnb Price Prediction with AutoKeras -
Trained a model using Airbnb prices and OpenStreetMap local data. Used Uber's Manifold to visualize the significance of each column (see below).
blog post - notebook
I'm a full-stack developer with geospatial experience. Previously I worked in a Software Engineer / Data Scientist role at McKinsey & Company, and
as an open data expert for Code for America, the City of Boston, ESRI, and the Asia Foundation.
GitHub: I've contributed to
OpenStreetMap (osm.org and iD Editor),
a quantum computing library,
and a cryptocurrency wallet.
Writing and Outreach
I participated in an AI and International Law workshop at the Asser Instituut in The Hague.
It was a good view of AI/ML from a legal and ethical perspective:
In recent years I conducted data visualization and PostGIS workshops at PyCon India, PyCon Zimbabwe, and refugee code schools in Turkey and Iraqi Kurdistan.