← Back to Home

Provenance in AI: Tracking AI Lineage with Signed Provenance Logs in Python - Part 2

ai_mlcomputer-visiondeep-learningdeep-reinforcement-learningexplainable-aigenaillmsmedical-imagingnatural-language-processing-nlpsoftware-development
#ai#cryptographic-provenance#data-lineage-tracking#data-provenance#debugging#genai#inference-logging#inference-provenance#model-lineage#model-provenance

In modern AI pipelines, provenance — the lineage of datasets, models, and inferences — is becoming as important as accuracy metrics. Regulators, auditors, and even downstream consumers increasingly demand answers to questions like:

  • Which dataset was this model trained on?

  • What code commit produced this artifact?

  • How do we know logs weren’t tampered with after training?

To learn more about provenance in AI, read my previous article: Provenance in AI: Why It Matters for AI Engineers - Part 1

To answer the above-raised questions, let’s walk through a Python-based provenance tracker that logs lineage events, cryptographically signs them, and maintains schema versioning for forward compatibility.

1. The Provenance Tracker: Key Features

The ProvenanceTracker implements three important ideas:

  1. Multiple dataset support

    • Models often train on more than one dataset (train + validation + test).

    • This tracker keeps a list of dataset hashes (dataset_hashes) and auto-links them to model logs.

  2. Signed JSONL envelopes

    • Each log entry is wrapped in an envelope:
      { "schema_version": "1.1", "signed_data": "{…}", "signature": "" }

    • signed_data is serialized with stable JSON (sort_keys=True).

    • A digital signature (RSA + PSS padding + SHA-256) is generated using a private key.

  3. Schema versioning

    • schema_version = "1.1" is embedded in every record.

2. The Provenance Tracker: Source Code

Before we get to the provenance tracker code, let's see a companion script generate_keys.py that creates the RSA keypair (private_key.pem, public_key.pem). This is used by the ProvenanceTracker.py to sign the JSONL logs.

code
# generate_keys.pyfrom cryptography.hazmat.primitives.asymmetric import rsafrom cryptography.hazmat.primitives import serialization# Generate RSA private key (2048 bits)private_key = rsa.generate_private_key(    public_exponent=65537,    key_size=2048,)# Save private key (PEM)with open("private_key.pem", "wb") as f:    f.write(        private_key.private_bytes(            encoding=serialization.Encoding.PEM,            format=serialization.PrivateFormat.PKCS8,            encryption_algorithm=serialization.NoEncryption(),        )    )# Save public key (PEM)public_key = private_key.public_key()with open("public_key.pem", "wb") as f:    f.write(        public_key.public_bytes(            encoding=serialization.Encoding.PEM,            format=serialization.PublicFormat.SubjectPublicKeyInfo,        )    )print("✅ RSA keypair generated: private_key.pem & public_key.pem")

Run once to create your keypair:

python generate_keys.py

Here’s a secure ProvenanceTracker (schema version 1.1) that:

  • Supports multiple datasets

  • Includes schema version

  • Signs JSONL using RSA private key

code
# ProvenanceTracker.pyimport hashlibimport jsonimport osimport platformimport socketimport subprocessimport base64from datetime import datetimefrom typing import Any, Dict, Listfrom cryptography.hazmat.primitives import hashes, serializationfrom cryptography.hazmat.primitives.asymmetric import paddingclass ProvenanceTracker:    SCHEMA_VERSION = "1.1"    def __init__(self, storage_path: str = "provenance_logs.jsonl", private_key_path: str = "private_key.pem"):        self.storage_path = storage_path        self._dataset_hashes: List[str] = []  # track datasets used        self.private_key = self._load_private_key(private_key_path)    def _load_private_key(self, path: str):        with open(path, "rb") as f:            return serialization.load_pem_private_key(f.read(), password=None)    def _get_git_commit(self) -> str:        try:            return subprocess.check_output(                ["git", "rev-parse", "HEAD"], stderr=subprocess.DEVNULL            ).decode("utf-8").strip()        except Exception:            return "N/A"    def _hash_file(self, file_path: str) -> str:        h = hashlib.sha256()        with open(file_path, "rb") as f:            while chunk := f.read(8192):                h.update(chunk)        return h.hexdigest()    def _sign(self, payload: str) -> str:        signature = self.private_key.sign(            payload.encode("utf-8"),            padding.PSS(                mgf=padding.MGF1(hashes.SHA256()),                salt_length=padding.PSS.MAX_LENGTH,            ),            hashes.SHA256(),        )        return base64.b64encode(signature).decode("utf-8")    def _log(self, record: Dict[str, Any]):        record["timestamp"] = datetime.utcnow().isoformat()        record["schema_version"] = self.SCHEMA_VERSION        # Serialize signed_data separately (stable JSON encoding)        signed_data = json.dumps(record, sort_keys=True)        signature = self._sign(signed_data)        envelope = {            "schema_version": self.SCHEMA_VERSION,            "signed_data": signed_data,            "signature": signature,        }        with open(self.storage_path, "a") as f:            f.write(json.dumps(envelope) + "\n")    def log_dataset(self, dataset_path: str, description: str = ""):        dataset_hash = self._hash_file(dataset_path)        self._dataset_hashes.append(dataset_hash)        record = {            "type": "dataset",            "path": dataset_path,            "hash": dataset_hash,            "description": description,        }        self._log(record)        return dataset_hash    def log_model(self, model_name: str, hyperparams: Dict[str, Any]):        record = {            "type": "model",            "model_name": model_name,            "hyperparameters": hyperparams,            "git_commit": self._get_git_commit(),            "environment": {                "python_version": platform.python_version(),                "platform": platform.system(),                "hostname": socket.gethostname(),            },            "dataset_hashes": self._dataset_hashes,  # link all datasets        }        self._log(record)    def log_inference(self, model_name: str, input_data: Any, output_data: Any):        record = {            "type": "inference",            "id": f"inf-{hashlib.sha1(json.dumps(input_data).encode()).hexdigest()[:12]}",  # deterministic ID            "model_name": model_name,            "input": input_data,            "output": output_data,        }        self._log(record)if __name__ == "__main__":    tracker = ProvenanceTracker()    # 1. Log datasets    ds1 = tracker.log_dataset("data/training.csv", "Customer churn dataset")    ds2 = tracker.log_dataset("data/validation.csv", "Validation set")    # 2. Log model (linked to all datasets seen so far)    tracker.log_model("churn-predictor-v2", {        "algorithm": "XGBoost",        "n_estimators": 200,        "max_depth": 12,    })    # 3. Log inference    tracker.log_inference(        "churn-predictor-v2",        {"customer_id": 54321, "features": [0.4, 1.7, 0.2]},        {"churn_risk": 0.42}    )    print("✅ Signed provenance logs recorded in provenance_logs.jsonl")

3. Under the Hood

3.1 Datasets

Datasets are logged with a SHA-256 file hash, ensuring that even if file names change, the integrity check remains stable.

code
ds1 = tracker.log_dataset("data/training.csv", "Customer churn dataset")ds2 = tracker.log_dataset("data/validation.csv", "Validation set")

Resulting record (inside signed_data):

code
{  "type": "dataset",  "path": "data/training.csv",  "hash": "a41be7b96f...",  "description": "Customer churn dataset",  "timestamp": "2025-08-28T10:12:34.123456",  "schema_version": "1.1"}

3.2 Models

When logging a model, the tracker attaches:

  • Model metadata (name, hyperparameters)

  • Git commit hash (if available)

  • Runtime environment (Python version, OS, hostname)

  • All dataset hashes seen so far

code
tracker.log_model("churn-predictor-v2", {    "algorithm": "XGBoost",    "n_estimators": 200,    "max_depth": 12,})

This creates a strong lineage link:
Dataset(s)Model

3.3 Inferences

Every inference is logged with a deterministic ID, computed as a SHA-1 hash of the input payload. This ensures repeat queries generate the same inference ID (helpful for deduplication).

code
tracker.log_inference(    "churn-predictor-v2",    {"customer_id": 54321, "features": [0.4, 1.7, 0.2]},    {"churn_risk": 0.42})

Graphically:
ModelInference

4. Signed Envelopes for Tamper-Proofing

Each record is not stored raw but wrapped in a signed envelope:

code
{  "schema_version": "1.1",  "signed_data": "{\"description\": \"Validation set\", \"hash\": \"c62...\"}",  "signature": "MEUCIQDgtd...xyz..."}

To verify:

  • Load the public key.

  • Verify the signature against the serialized signed_data.

  • If modified, verification fails → tampering detected.

This is exactly the mechanism PKI systems and blockchain protocols use for immutability.

5. Example End-to-End Run

When running ProvenanceTracker.py:

code
$ python ProvenanceTracker.py Signed provenance logs recorded in provenance_logs.jsonl

The log file (provenance_logs.jsonl) will contain three signed envelopes — one each for datasets, the model, and an inference.

Following is provenance_logs.jsonl after run:

code
{"schema_version": "1.1", "signed_data": "{\"description\": \"Customer churn dataset\", \"hash\": \"a41be7b96fb85110521bf03d1530879e9ca94b9f5de19866757f6d184300fff7\", \"path\": \"data/training.csv\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.062695\", \"type\": \"dataset\"}", "signature": "MnCRJ+Acg0F1UledjnMwQMp24wAIPmLPaZonI7hvdFvdi7d8CaZDOIamNq0KnRgcZgttJnI1L675tqT1O1M5N2FRNuy/Wj6elzpyM9w56Kd2mBcQLFumhVHiGZHtwKj2wQtXND0SCqWo5jxQPLPl0dSFClA+FKzpxfazwMtrHAE7aoUmyt2cv1Wiv9uZxsE+Il226J91rBk03lpLcArqqxTtfstkayOK5AON9ETXs65ERf26oURngR/0HS9jnO0IH1DxZOcHcfWZMrLwGqdjRF1sSDYcH70XV61yeYzSeIb8KDODttuxxfzsIlb0897tv/ZZ/X4tv/FFICei7LeAuw=="}{"schema_version": "1.1", "signed_data": "{\"description\": \"Validation set\", \"hash\": \"330af932f2dc1cae917f3bd0fb29395c4021319dd906189b7dc257d0ad58a617\", \"path\": \"data/validation.csv\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.070827\", \"type\": \"dataset\"}", "signature": "pu8IvzPriN6eP9HTQGlIog8nfXV0FOEw818aw6uJS8oPKiQPjN3odzbP9zaeB+ZW4Nu9bBL5fm1btiiOSm9ziWUJWUzFRoHwlYTv2rgp/IXR0oWfTpXsdVeBj7NYVjUywLPofTeEE1C4J7XzZmusuCU9ZiKJzXU442E6Gsrj6tjRJxZoylONuekxegdTot4LwIcmCRtgigi1t3rQYBGdknmTFdW/I2h1Gguh+Shc/WG/jVuMq10vFNNM8iUJJAxAEktbpfhGw0of6lrZu9yn4wAmxvq0DFICKMEJlsyvEZ/mDaPkR4D55xuJh+dLlFbzNZvyw0woMII0hbIarNmG+w=="}{"schema_version": "1.1", "signed_data": "{\"dataset_hashes\": [\"a41be7b96fb85110521bf03d1530879e9ca94b9f5de19866757f6d184300fff7\", \"330af932f2dc1cae917f3bd0fb29395c4021319dd906189b7dc257d0ad58a617\"], \"environment\": {\"hostname\": \"GlamorPC\", \"platform\": \"Windows\", \"python_version\": \"3.10.11\"}, \"git_commit\": \"N/A\", \"hyperparameters\": {\"algorithm\": \"XGBoost\", \"max_depth\": 12, \"n_estimators\": 200}, \"model_name\": \"churn-predictor-v2\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.117627\", \"type\": \"model\"}", "signature": "tq/y6Blz04u2iYZh5OqfyZChADA+osNIzwb9Z2g++AZjFu2hkywazf19rbTMsdx9J5s4BDz6rglfcFczRW/TXMECD3k91ZmAds/e0I+Xw42xeTnr7+jHKq5kPdV6Pan8yFVd9ikGso93ZDatX72rx+orIg41BggFN7ifYlKNnGD87zCypahI7Eev0frnD6w8GybmPcBMnCVLYlIo2nWpLgJELkVpwwagQ9rKA+WOlBbLe41ZizooSL/hhGJOXTuwYrkJpBZ69TIwCzihINr+joZBqYrPF+0E+CFohdc03b0SFv1OuNTo7dyqL9qpWdCMSi1iK0LfCukCO41Bvr2yHA=="}{"schema_version": "1.1", "signed_data": "{\"id\": \"inf-0276b2064ad0\", \"input\": {\"customer_id\": 54321, \"features\": [0.4, 1.7, 0.2]}, \"model_name\": \"churn-predictor-v2\", \"output\": {\"churn_risk\": 0.42}, \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.118634\", \"type\": \"inference\"}", "signature": "Lf9r1vcXOaCxSc11UKNvuDjx7FophWXBxAobYlixIJgNIk2toFtEdjB2zzJtQI5cYEAImhNHB8hdssKUv3Dths0SpKeMQjpb0x0aKvXolnNsJMnEnGP443IRfMTpkcHpRjCVjIfEvP8EtAh58z4yHE77cy2IlSUFu3exwSEcRFVqBXvIKlojQTEneERUwEDZjfniluomSCLXiVFYMIB+LefPHGkChCVVulmyFJ9ITquD4Wymp2/c2/knopqXSP00EFON4SBOD9/RyQAXAl5UxP0s6faD7NeZxAdJWh3CY31+5V3Vv8b9y/jroAvxWjbpuCZT20gkHemArawDae3s0w=="}

The following is the standalone verification code validate_logs.py:

code
#!/usr/bin/env python3"""Usage:    python validate_logs.py provenance_logs.jsonl public_key.pem"""import jsonimport base64import sysfrom cryptography.hazmat.primitives import hashesfrom cryptography.hazmat.primitives.asymmetric import paddingfrom cryptography.hazmat.primitives.serialization import load_pem_public_keyEXPECTED_SCHEMA = "1.1"def load_public_key(path: str):    with open(path, "rb") as f:        return load_pem_public_key(f.read())def verify_signature(public_key, signed_data: str, signature_b64: str) -> bool:    try:        signature = base64.b64decode(signature_b64)        public_key.verify(            signature,            signed_data.encode("utf-8"),            padding.PSS(                mgf=padding.MGF1(hashes.SHA256()),                salt_length=padding.PSS.MAX_LENGTH,            ),            hashes.SHA256(),        )        return True    except Exception:        return Falsedef validate_file(jsonl_path: str, pubkey_path: str):    public_key = load_public_key(pubkey_path)    valid_count = 0    failed_count = 0    schema_mismatch = 0    with open(jsonl_path, "r") as f:        for line_no, line in enumerate(f, start=1):            try:                envelope = json.loads(line.strip())            except json.JSONDecodeError:                print(f"❌ Line {line_no}: invalid JSON")                failed_count += 1                continue            schema = envelope.get("schema_version")            signed_data = envelope.get("signed_data")            signature = envelope.get("signature")            if schema != EXPECTED_SCHEMA:                print(f"⚠️  Line {line_no}: schema version mismatch ({schema})")                schema_mismatch += 1                continue            if not signed_data or not signature:                print(f"❌ Line {line_no}: missing signed_data/signature")                failed_count += 1                continue            if verify_signature(public_key, signed_data, signature):                valid_count += 1            else:                print(f"❌ Line {line_no}: signature verification failed")                failed_count += 1    print("\n--- Validation Report ---")    print(f"✅ Valid entries      : {valid_count}")    print(f"❌ Signature failures : {failed_count}")    print(f"⚠️  Schema mismatches : {schema_mismatch}")    print(f"📄 Total lines        : {valid_count + failed_count + schema_mismatch}")if __name__ == "__main__":    if len(sys.argv) != 3:        print("Usage: python validate_logs.py provenance_logs.jsonl public_key.pem")        sys.exit(1)    jsonl_file = sys.argv[1]    pubkey_file = sys.argv[2]    validate_file(jsonl_file, pubkey_file)

Output:

6. Extending This for Real-World AI Workflows

An AI engineer could extend this design in several directions:

  • Neo4j Importer: Build a provenance graph to visualize dataset → model → inference lineage.

  • Metrics integration: Log evaluation metrics (AUC, F1) into the model record.

  • MLOps pipelines: Integrate into training jobs so every experiment auto-generates signed lineage logs.

  • Cloud KMS keys: Replace PEM private key with keys from AWS KMS, GCP KMS, or HashiCorp Vault.

  • Verification service: Deploy a microservice that validates provenance logs on ingestion.

7. Why This Matters for You

As AI systems leave the lab and enter regulated domains (finance, healthcare, insurance), being able to say:

  • “This prediction came from Model X at commit Y, trained on Dataset Z, verified cryptographically.”

…will be non-negotiable.

Implementing provenance today sets you ahead of compliance requirements tomorrow.

This ProvenanceTracker is a blueprint for trustworthy AI engineering — versioned, signed, and reproducible lineage for every dataset, model, and inference in your pipeline.

Note: The Customer churn dataset can be downloaded from Kaggle and can be renamed and placed in the data directory.