RanjanKumar.in - AI & ML Engineering

AI engineers spend a lot of time building, training, and iterating on models. But as pipelines grow more complex, it becomes difficult to answer simple but crucial questions:

Which dataset version trained this model?
Which parameters were used?
Who triggered this training job?
Can I reproduce this run six months later?

Without structured provenance tracking, reproducibility and compliance become almost impossible. In regulated domains, this is not optional — it’s mandatory.

In this article, we’ll show how to integrate W3C PROV-O (a standard for provenance modeling) with MLflow (a popular experiment tracking framework) in a PyTorch pipeline. The result: every training run not only logs metrics and artifacts but also generates a machine-readable provenance graph for accountability, auditability, and governance.

🔎 Background: Why PROV-O + MLflow?

MLflow is widely used for experiment tracking. It records metrics, parameters, and artifacts like models and logs. However, MLflow’s logs are application-specific and not standardized for knowledge sharing across systems.
W3C PROV-O is a semantic ontology (built on RDF/OWL2) that provides a standardized vocabulary for describing provenance: Entities, Activities, and Agents, and their relationships (prov:used, prov:wasGeneratedBy, prov:wasAttributedTo).

By combining the two:

MLflow provides the data source of truth for training runs.
PROV-O provides the interoperable representation of lineage, useful for audits, governance, and integration into knowledge graphs.

🏗️ Architecture Overview

Our integration maps MLflow concepts to PROV-O concepts:

MLflow Concept	PROV-O Equivalent	Example
MLflow Run	`prov:Activity`	Training job run ID `f4a22`
MLflow Artifact (model)	`prov:Entity`	`model_v1.pth`
Dataset (input)	`prov:Entity`	`dataset.csv`
Metrics (loss, accuracy)	`prov:Entity`	`metrics.json`
MLflow User/System	`prov:Agent`	Engineer triggering the run

⚙️ Step 1: Setup

We need a combination of MLflow (for tracking) and rdflib (for provenance graph generation).

pip install mlflow torch rdflib prov

mlflow → tracks experiments, models, metrics, and artifacts.
torch → used for building the PyTorch model.
rdflib → builds and serializes RDF/PROV-O graphs.
prov → utilities for working with W3C PROV specifications.

🧑‍💻 Step 2: PyTorch Training with MLflow Logging

We start with a simple PyTorch script that trains a small neural network while logging to MLflow.

code

import torchimport torch.nn as nnimport torch.optim as optimimport mlflowimport mlflow.pytorch# Fake datasetX = torch.randn(100, 10)y = torch.randint(0, 2, (100,))# Simple NNmodel = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))loss_fn = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)with mlflow.start_run() as run:    for epoch in range(5):        optimizer.zero_grad()        preds = model(X)        loss = loss_fn(preds, y)        loss.backward()        optimizer.step()        mlflow.log_metric("loss", loss.item(), step=epoch)    mlflow.log_param("lr", 0.001)    mlflow.pytorch.log_model(model, "model")

At this point, MLflow is recording metrics (loss), params (lr), and the trained model artifact. But it doesn’t capture semantic provenance — for example, which dataset was used, who ran this job, and how results are connected.

🔗 Step 3: Provenance Tracker for MLflow

Here’s where PROV-O comes in. We build a Provenance Tracker that:

Defines entities (datasets, models, metrics).
Defines activities (the MLflow run).
Defines agents (engineer, system).
Links them using PROV-O relations.
Serializes into Turtle (.ttl) or JSON-LD.

code

from rdflib import Graph, Namespace, URIRef, Literalfrom rdflib.namespace import RDF, FOAFimport mlflowPROV = Namespace("http://www.w3.org/ns/prov#")EX = Namespace("http://example.org/")def log_provenance(run):    g = Graph()    g.bind("prov", PROV)    g.bind("ex", EX)    # Agent (engineer/system)    user = EX["engineer"]    g.add((user, RDF.type, PROV.Agent))    g.add((user, FOAF.name, Literal("AI Engineer")))    # Activity (the MLflow run)    activity = EX[f"run_{run.info.run_id}"]    g.add((activity, RDF.type, PROV.Activity))    # Input dataset    dataset = EX["dataset.csv"]    g.add((dataset, RDF.type, PROV.Entity))    g.add((activity, PROV.used, dataset))    # Model entity    model = EX[f"model_{run.info.run_id}.pth"]    g.add((model, RDF.type, PROV.Entity))    g.add((model, PROV.wasGeneratedBy, activity))    g.add((model, PROV.wasAttributedTo, user))    # Metrics entity    metrics = EX[f"metrics_{run.info.run_id}.json"]    g.add((metrics, RDF.type, PROV.Entity))    g.add((metrics, PROV.wasGeneratedBy, activity))    g.add((metrics, PROV.wasAttributedTo, user))    # Serialize + store    prov_file = f"prov_{run.info.run_id}.ttl"    g.serialize(prov_file, format="turtle")    mlflow.log_artifact(prov_file, artifact_path="provenance")    print(f"✅ Provenance logged in {prov_file}")

📦 Step 4: Integrate Tracker

Modify the training script to call log_provenance(run) after training completes.

code

with mlflow.start_run() as run:    # Training loop (as above) ...    mlflow.pytorch.log_model(model, "model")    # Capture provenance    log_provenance(run)

Now every MLflow run will automatically create a provenance graph and store it alongside model artifacts.

Final script train-small-nn-pytorch.py:

code

import torchimport torch.nn as nnimport torch.optim as optimimport mlflowimport mlflow.pytorchfrom rdflib import Graph, Namespace, URIRef, Literalfrom rdflib.namespace import RDF, FOAFimport mlflow# Fake datasetX = torch.randn(100, 10)y = torch.randint(0, 2, (100,))# Simple NNmodel = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))loss_fn = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)# Provenance Tracker for MLflowPROV = Namespace("http://www.w3.org/ns/prov#")EX = Namespace("http://example.org/")def log_provenance(run):    g = Graph()    g.bind("prov", PROV)    g.bind("ex", EX)    # Agent (engineer/system)    user = EX["engineer"]    g.add((user, RDF.type, PROV.Agent))    g.add((user, FOAF.name, Literal("AI Engineer")))    # Activity (the MLflow run)    activity = EX[f"run_{run.info.run_id}"]    g.add((activity, RDF.type, PROV.Activity))    # Input dataset    dataset = EX["dataset.csv"]    g.add((dataset, RDF.type, PROV.Entity))    g.add((activity, PROV.used, dataset))    # Model entity    model = EX[f"model_{run.info.run_id}.pth"]    g.add((model, RDF.type, PROV.Entity))    g.add((model, PROV.wasGeneratedBy, activity))    g.add((model, PROV.wasAttributedTo, user))    # Metrics entity    metrics = EX[f"metrics_{run.info.run_id}.json"]    g.add((metrics, RDF.type, PROV.Entity))    g.add((metrics, PROV.wasGeneratedBy, activity))    g.add((metrics, PROV.wasAttributedTo, user))    # Serialize + store    prov_file = f"prov_{run.info.run_id}.ttl"    g.serialize(prov_file, format="turtle")    mlflow.log_artifact(prov_file, artifact_path="provenance")    print(f"✅ Provenance logged in {prov_file}")# MLflowwith mlflow.start_run() as run:    for epoch in range(5):        # Training loop        optimizer.zero_grad()        preds = model(X)        loss = loss_fn(preds, y)        loss.backward()        optimizer.step()        mlflow.log_metric("loss", loss.item(), step=epoch)                mlflow.pytorch.log_model(model, "model")        # Capture provenance        log_provenance(run)    mlflow.log_param("lr", 0.001)    mlflow.pytorch.log_model(model, "model")

📂 Step 5: Example Output

Provenance graph (Turtle format) prov_70d8b46c6451416d92a0ae7cac4c8602.ttl:

code

@prefix ex: <http://example.org/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix prov: <http://www.w3.org/ns/prov#> .ex:metrics_70d8b46c6451416d92a0ae7cac4c8602.json a prov:Entity ;    prov:wasAttributedTo ex:engineer ;    prov:wasGeneratedBy ex:run_70d8b46c6451416d92a0ae7cac4c8602 .ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth a prov:Entity ;    prov:wasAttributedTo ex:engineer ;    prov:wasGeneratedBy ex:run_70d8b46c6451416d92a0ae7cac4c8602 .ex:dataset.csv a prov:Entity .ex:engineer a prov:Agent ;    foaf:name "AI Engineer" .ex:run_70d8b46c6451416d92a0ae7cac4c8602 a prov:Activity ;    prov:used ex:dataset.csv .

This graph is machine-readable and interoperable with semantic web tools, knowledge graphs, and governance platforms.

🔍 Step 6: Query Provenance

Since PROV-O is RDF-based, we can load graphs into a triple store and query with SPARQL. The following are a few example queries:

1️⃣_Which dataset was used to generate a given model?_

code

SELECT ?dataset WHERE {  ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth prov:wasGeneratedBy ?activity .  ?activity prov:used ?dataset .}

This query returns dataset.csv as the dataset that trained model_f4a22.pth.

The SPARQL queries can be run using the following Python script:

code

import rdflib# Create a Graph objectg = rdflib.Graph()# Parse the TTL file into the graphg.parse("prov_70d8b46c6451416d92a0ae7cac4c8602.ttl", format='turtle')# Define your SPARQL querysparql_query = """SELECT ?dataset WHERE {  ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth prov:wasGeneratedBy ?activity .  ?activity prov:used ?dataset .}"""# Execute the queryresults = g.query(sparql_query)# Process the resultsfor row in results:	print(row)

2️⃣_All models generated by a given engineer_

code

SELECT ?modelWHERE {  ?model a prov:Entity ;         prov:wasAttributedTo ex:engineer .}

👉 Returns all model URIs that were attributed to the engineer ex:engineer.

3️⃣_All datasets used in the last month_

If your provenance tracker adds prov:generatedAtTime or similar timestamps on entities/activities, you can filter by date. Example:

code

SELECT ?dataset ?timeWHERE {  ?activity a prov:Activity ;            prov:used ?dataset ;            prov:endedAtTime ?time .  ?dataset a prov:Entity .  FILTER (?time >= "2025-07-28T00:00:00Z"^^xsd:dateTime &&           ?time <= "2025-08-28T23:59:59Z"^^xsd:dateTime)}

👉 This finds all prov:Entity datasets used by any activity that ended in the last month.

4️⃣_Provenance chains across multiple runs (for auditing)_

Here we want to trace lineage from dataset → activity → model → metrics.

code

SELECT ?dataset ?activity ?model ?metricsWHERE {  ?dataset a prov:Entity .  ?activity a prov:Activity ;            prov:used ?dataset ;            prov:generated ?model, ?metrics .  ?model a prov:Entity .  ?metrics a prov:Entity .}

👉 This gives a table of full provenance chains, so you can audit multiple runs together.

5️⃣_Find all runs that reused the same dataset_

Useful for detecting data reuse:

code

SELECT ?dataset (GROUP_CONCAT(?model; separator=", ") AS ?models)WHERE {  ?activity prov:used ?dataset ;            prov:generated ?model .}GROUP BY ?datasetHAVING (COUNT(?model) > 1)

👉 Returns datasets that were reused in multiple model generations.

⚡ These queries assume you have prov:used, prov:generated, prov:wasAttributedTo, and timestamps (prov:endedAtTime or prov:generatedAtTime) in your TTL logs.

✅ Why This Matters

By extending MLflow with PROV-O, AI engineers gain:

Reproducibility → Every model is linked to the exact data and parameters that generated it.
Auditability → Regulators and compliance teams can trace how outputs were produced.
Transparency → Business stakeholders can understand lineage without relying on tribal knowledge.
Interoperability → Since PROV-O is a W3C standard, provenance metadata can be integrated into external governance, data catalog, and knowledge graph systems.

🚀 What We Learnt

We’ve seen how to:

Train a PyTorch model with MLflow.
Capture provenance automatically using PROV-O.
Serialize provenance graphs as RDF/Turtle.
Query lineage with SPARQL.