Culinary Atlas of Indonesia

Exploring Indonesian cuisine through data science

This map shows the dominant culinary family in each region and how strongly it dominates.

Search for a dish above to see where to find other similar dishes, or see below for details on how this data has been compiled.

Loading statistics...

This system applies unsupervised machine learning to discover latent culinary patterns across Indonesian cuisine (as represented on Cookpad), revealing ingredient-based "culinary families" and their geographic distributions through probabilistic clustering and interactive geospatial visualisation.

We search for 'khas [region]' on Cookpad and take the first 1,000 results for each search. We scrape the ingredient block from each recipe, converting raw Indonesian ingredient names to normalised English terms using a comprehensive mapping dictionary. Quantities, measurements, and qualifiers are stripped using regex patterns, leaving only the core ingredients. This creates a sparse binary matrix of dimensions (n_dishes × n_ingredients), where each cell indicates presence (1) or absence (0) of an ingredient in a dish.

We compute pairwise cosine similarity between all dishes based on these ingredient vectors—a straightforward measure of shared ingredients that serves as our distance metric. To identify culinary families, we use Gaussian Mixture Model (GMM) clustering with iterative chi-square feature selection. Traditional chi-square feature selection can overfit to a single random initialisation of GMM clustering, so the iterative approach addresses this instability. We run preliminary GMM clustering with different random seeds, where each iteration uses BIC to select the optimal number of components, performs GMM clustering on the full ingredient matrix, and validates cluster balance by rejecting any result with an imbalance ratio above 20:1. For each successful iteration, we compute chi-square statistics between ingredient presence and cluster assignments, applying a p-value threshold of p ≤ 0.05 to identify significant ingredients and rejecting iterations with too few significant ingredients (fewer than 10). We then weight each iteration by its cluster balance (inverse of imbalance ratio), calculate the weighted average of ingredient importance scores, apply a stability penalty to ingredients that behave inconsistently across iterations, and identify stable significant ingredients—those appearing as significant in ≥70% of iterations. The final clustering fits GMM with the optimal number of components, uses full covariance matrices to capture ingredient correlations, and assigns each dish both to a single cluster (argmax probability) and returns the full posterior distribution across all clusters.

For the main choropleth, we group dishes by level 2 geographic subdivision (kabupaten/kota), count how many dishes in that region belong to each family, identify the most dominant one, and colour each region by its dominant family. The opacity indicates the proportion of dishes in that region belonging to the dominant cluster—darker regions have more homogeneous cuisine. We also display a pie chart for each region showing the full distribution of cluster memberships, revealing the mixture of culinary traditions present. For a specific dish, we display a cluster membership pie chart showing the probability distribution across all culinary families, revealing cases where dishes straddle multiple traditions. We also use UMAP embedding for dimensionality reduction to project the high-dimensional ingredient space into 2D, displaying the dish alongside its five nearest neighbours based on cosine similarity. This reveals both the broad structure of Indonesian culinary traditions and the fine-grained similarities between individual dishes.

Key Ingredients
Cluster Membership Probabilities
Key Representative Ingredients
Most Typical Dishes
Most Similar Regional Cuisines