Pure ML Proto-Language Reconstruction

Investigated whether purely ML approaches could reconstruct proto-languages from modern descendants. Built multiple PyTorch architectures: GNNs treating language families as graph structures, VAEs for phonological feature encoding, and attention mechanisms for sound correspondences. Tested on Romance cognates with Latin as ground truth, achieving 7-20% accuracy, confirming the problem is underdetermined without linguistic priors to constrain the solution space.

Culinary Atlas of Indonesia

Exploring Indonesian cuisine through data science. Scraped 50,000+ Indonesian recipes using Beautiful Soup and used unsupervised ML (along with UMAP dimensionality reduction, GMM clustering with BIC model selection, and iterative chi-square feature selection) to identify regional culinary families. Deployed a FastAPI backend and Leaflet.js frontend allowing users to explore geographic culinary patterns and discover recipe recommendations based on cosine similarity between dishes.

Diksonari blong Melanesia

Quadrilingual dictionary of Melanesian creoles. Collated and translated data from various sources into a single database. Automated data curation using Bash regex to rapidly generate a MySQL database powering an interactive online dictionary built using HTML and JavaScript.

Digitising the Holle Lists

Digitised 11 volumes of typewritten vocabularies of 300 indigenous Indonesian languages, collected over the past century, into a relational database using Tesseract OCR and Bash. Populated a relational database and deployed an interactive web app to preserve endangered languages.

Kamus Dwibahasa Ambon-Inggris

Ambonese Malay–English dictionary. Cleaned and standardised orthography across multiple sources, automated curation with Bash regex to generate a MySQL database powering an interactive online dictionary, and produced a print volume typeset with a custom TeX class.