Bayesian Risk-Based Authentication

Compared three Bayesian approaches to detecting account takeovers in login data with a ~0.05% attack rate, under a constraint that no more than 5% of legitimate users should be challenged with MFA. A logistic regression model with domain-informed priors, a group-structured Bayesian network decomposing risk into novelty, geographic, reputation and velocity signals, and a Dirichlet process mixture model that learns normal login behaviour unsupervised and flags logins that fail to fit any known cluster. All three use ADVI inference and produce per-login explanations suitable for customer support.

Pure ML Proto-Language Reconstruction

Investigated whether purely ML approaches could reconstruct proto-languages from modern descendants. Built multiple PyTorch architectures: GNNs treating language families as graph structures, VAEs for phonological feature encoding, and attention mechanisms for sound correspondences. Tested on Romance cognates with Latin as ground truth, achieving 7-20% accuracy, confirming the problem is underdetermined without linguistic priors to constrain the solution space.

Two-Stage Bayesian Claim Modelling

Predicting ultimate insurance claim amounts using a hurdle model built in PyMC. Stage one classifies claims as high or low value via logistic regression; stage two fits a continuous distribution (lognormal, gamma, or Weibull) to the amount. Full posterior propagation gives calibrated credible intervals rather than point estimates, achieving ~45% improvement in MAE over a median baseline with 96% of actuals falling within the 95% credible interval.

Culinary Atlas of Indonesia

Exploring Indonesian cuisine through data science. Scraped 50,000+ Indonesian recipes using Beautiful Soup and used unsupervised ML (along with UMAP dimensionality reduction, GMM clustering with BIC model selection, and iterative chi-square feature selection) to identify regional culinary families. Deployed a FastAPI backend and Leaflet.js frontend allowing users to explore geographic culinary patterns and discover recipe recommendations based on cosine similarity between dishes.

Diksonari blong Melanesia

Quadrilingual dictionary of Melanesian creoles. Collated and translated data from various sources into a single database. Automated data curation using Bash regex to rapidly generate a MySQL database powering an interactive online dictionary built using HTML and JavaScript.

Digitising the Holle Lists

Digitised 11 volumes of typewritten vocabularies of 300 indigenous Indonesian languages, collected over the past century, into a relational database using Tesseract OCR and Bash. Populated a relational database and deployed an interactive web app to preserve endangered languages.

Kamus Dwibahasa Ambon-Inggris

Ambonese Malay–English dictionary. Cleaned and standardised orthography across multiple sources, automated curation with Bash regex to generate a MySQL database powering an interactive online dictionary, and produced a print volume typeset with a custom TeX class.