Skip to content

Fasttransform errors related to TTA? #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hack-r opened this issue Mar 19, 2025 · 9 comments
Open

Fasttransform errors related to TTA? #4

hack-r opened this issue Mar 19, 2025 · 9 comments

Comments

@hack-r
Copy link

hack-r commented Mar 19, 2025

TTA predictions apparently required a Python upgrade to >=3.10. Since that upgrade I've been getting errors pertaining to the fastcore-to-fasttransform migration. I do not import fastcore or fasttransform directly.

I've described the issue on Stack Overflow as follows:

Test Time Augmentation (TTA) in FastAI should be easily applied with learn.tta, yet has led to numerous issues in my Cloud Run deployment. I have a working Cloud Run deployment that does base learner and metalearner scoring as a prediction endpoint using load_learner from FastAI.

I want to switch learn.predict to learn.tta but issues keep arising. FastAI requires a slightly different input shape for tta and has different shape of returned values. I wanted to make it more of a direct drop-in replacement for learn.predict. This function worked to accomplish that in a minimalistic test notebook on Colab:

import random
from fastai.vision.all import *

# Function to perform TTA and format the output to match predict
def tta_predict(learner, img):
    # Create a DataLoader for the single image using the test DataLoader
    test_dl = learner.dls.test_dl([img])
    
    # Perform TTA on the single image using the test DataLoader
    preds, _ = learner.tta(dl=test_dl)
    
    # Get the average probabilities
    avg_probs = preds.mean(dim=0)
    
    # Get the predicted class index
    pred_idx = avg_probs.argmax().item()
    
    # Get the class label
    class_label = learner.dls.vocab[pred_idx]
    
    # Format the output to match the structure of the predict method
    return (class_label, pred_idx, avg_probs)

# Use the tta_predict function
prediction = tta_predict(learn, grayscale_img)

# Print the results
print(type(prediction))  # Print the type of the prediction object
print(prediction)  # Print the prediction itself (class label, index, probabilities)
print(prediction[0])  # Print the predicted class label
print(prediction[2])  # Print the average probabilities

Although it seemed to work fine in the notebook, when I add that to the top of my production script and switch learn.predict to tta_predict(learn, img) for my base learners, the entire image starts to fail to build with Python 3.9:

Traceback (most recent call last): File "/app/main.py", line 11, in <module> 
from fastai.vision.all import PILImage, BCEWithLogitsLossFlat, load_learner 
    File "/usr/local/lib/python3.9/site-packages/fastai/vision/all.py", line 4, 
in <module> from .augment import * File "/usr/local/lib/python3.9/
site-packages/fastai/vision/augment.py", line 8, in <module> from .core import * File "/usr/local/lib/python3.9/site-packages/fastai/vision/core.py", line 259, in <module> class PointScaler(Transform): File "/usr/local/lib/python3.9/site-packages/fasttransform/transform.py", line 75, in __new__ if funcs: setattr(new_cls, nm, _merge_funcs(*funcs)) File "/usr/local/lib/python3.9/site-packages/fasttransform/transform.py", line 42, in _merge_funcs res = Function(fs[-1].methods[0].implementation) File "/usr/local/lib/python3.9/site-packages/plum/function.py", line 181, in methods self._resolve_pending_registrations() File "/usr/local/lib/python3.9/site-packages/plum/function.py", line 280, in _resolve_pending_registrations signature = Signature.from_callable(f, precedence=precedence) File "/usr/local/lib/python3.9/site-packages/plum/signature.py", line 88, in from_callable types, varargs = _extract_signature(f) File "/usr/local/lib/python3.9/site-packages/plum/signature.py", line 346, in _extract_signature resolve_pep563(f) File "/usr/local/lib/python3.9/site-packages/plum/signature.py", line 329, in resolve_pep563 beartype_resolve_pep563(f) # This mutates `f`. File "/usr/local/lib/python3.9/site-packages/beartype/peps/_pep563.py", line 263, in resolve_pep563 arg_name_to_hint[arg_name] = resolve_hint( File "/usr/local/lib/python3.9/site-packages/beartype/_check/forward/fwdmain.py", line 308, in resolve_hint return _resolve_func_scope_forward_hint( File "/usr/local/lib/python3.9/site-packages/beartype/_check/forward/fwdmain.py", line 855, in _resolve_func_scope_forward_hint raise exception_cls(exception_message) from exception beartype.roar.BeartypeDecorHintPep604Exception: Stringified PEP 604 type hint 'PILBase | TensorImageBase' syntactically invalid under Python < 3.10 (i.e., TypeError("unsupported operand type(s) for |: 'BypassNewMeta' and 'torch._C._TensorMeta'")). Consider either:
        * Requiring Python >= 3.10. Abandon Python < 3.10 all ye who code here.
        * Refactoring PEP 604 type hints into equivalent PEP 484 type hints: e.g.,
        # Instead of this...
        from __future__ import annotations
        def bad_func() -> int | str: ...
        # Do this. Ugly, yet it works. Worky >>>> pretty.
        from typing import Union

I don't see anything in my code that could've caused that, yet there it is. I noticed somewhere in those messages it mentions "augment", which I take as confirmation that TTA is at fault (it was also the only thing that changed). So, I tried switching the Python version to 3.10. Now it builds but it's clearly broken:

ERROR loading model.pkl: Could not import 'Pipeline' from fastcore.transform - this module has been moved to the fasttransform package.
To migrate your code, please see the migration guide at: https://answerdotai.github.io/fasttransform/fastcore_migration_guide.html

The migration guide it mentions says to change

from fastcore.transform import Transform, Pipeline to

from fasttransform import Transform, Pipeline

but my code never directly imports Pipeline or Transform, nor does it directly import fastcore.

@hack-r
Copy link
Author

hack-r commented Mar 19, 2025

I've fully removed TTA from the script but somehow the error persists after the Python upgrade. This is extra confusing because the errors mention augment.py which makes it still sound TTA-related.

@RensDimmendaal
Copy link
Contributor

Thanks for raising this issue.

Regarding your first error with python=3.9:
I've been able to replicate it locally and I'm looking into how we can fix this.

Regarding the second error with python=3.10:
I have not yet been able to reproduce this error. If possible, could you share a minimum reproducible example for this error?

I do have one educated guess based on this error ERROR loading model.pkl: Could not import 'Pipeline' from fastcore.transform. Are you trying to save a pipeline under the old version of fastai and are you trying to load it with the new version?

Could you perhaps try and save the pipeline with python 3.10 and the latest version of fastai and see if that addresses the issue?

@hack-r
Copy link
Author

hack-r commented Mar 20, 2025 via email

@hack-r
Copy link
Author

hack-r commented Mar 21, 2025

Apologies for the delay. This is a sticky, on-going problem. Here's the scoring code. It glitches on the first load_learner statement with the aforementioned errors about Pipeline being moved from fastcore to fasttransform.

#!/usr/bin/env python
import os
import io
import json
import numpy as np
import pandas as pd
import torch
import traceback
import requests

from fastai.vision.all import PILImage, BCEWithLogitsLossFlat, load_learner
from fastai.tabular.all import *
from fastai.tabular.learner import TabularLearner
from fastai.torch_core import to_float, to_device, tensor, Tensor, defaults

from PIL import Image
from flask import Flask, request
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

# Define the target class labels (in the proper order) for meta learner prediction.
TARGET_CLASS_LABELS = [
    'Class1', 
    'Class2', 
    'Class3',
    'Class4', 
    'Class5', 
    'Class6', 
    'Class7', 
    'Class8', 
    'Class9'
]

def get_learner_vocab(learn):
    try:
        vocab = learn.dls.vocab
        print(f"DEBUG: Retrieved vocab for learner {learn}: {vocab}")
        return vocab
    except Exception as ex:
        print(f"DEBUG: Could not retrieve vocab for learner {learn}: {ex}")
        return None

# No longer in use, but in-case I re-add a multilabel model
def multilabel_get_y(o):
    return [o['label1'], o['label2'], o['label3']]

def find_file(filename, search_path):
    for root, dirs, files in os.walk(search_path):
        if filename in files:
            return os.path.join(root, filename)
    return None
    
# Warm-load base learners using load_learner.
BASE_LEARNER_FILES = [
    "learner1.pkl",
    "learner2.pkl",
    "learner3.pkl",
    "learner4.pkl",
    "learner5.pkl"
]
print(BASE_LEARNER_FILES)

BASE_LEARNERS = {}
for fname in BASE_LEARNER_FILES:
    try:
        key = os.path.splitext(os.path.basename(fname))[0]
        BASE_LEARNERS[key] = load_learner(fname)
        print(f"DEBUG: Successfully loaded learner '{key}' from file: {fname}")
    except Exception as e:
        print(f"ERROR loading {fname}: {e}")

print(f"DEBUG: Finished loading base learners. Keys: {list(BASE_LEARNERS.keys())}")

# Warm-load the meta learner using load_learner.
META_LEARNER_FILE = "metalearner.pkl"
meta_learner = None

print("Starting metalearner load...")
try:
    if not os.path.isfile(META_LEARNER_FILE):
        print(f"DEBUG: '{META_LEARNER_FILE}' not found in the current directory. Searching in subfolders...")
        found_file = find_file(META_LEARNER_FILE, os.getcwd())
        if found_file:
            print(f"DEBUG: Found meta learner file at: {found_file}")
            meta_learner = load_learner(found_file, cpu=True)
        else:
            raise FileNotFoundError(f"Meta learner file '{META_LEARNER_FILE}' does not exist in the current directory or its subfolders.")
    else:
        print("Metalearner pkl found.")
        meta_learner = load_learner(META_LEARNER_FILE, cpu=True)
        print(f"File info print: {meta_learner}")
        print(f"DEBUG: Successfully loaded meta learner from file: {META_LEARNER_FILE}")
except Exception as e:
    meta_learner = None
    print(f"ERROR loading meta-learner: {e}")
    print("DEBUG: Full traceback:")
    print(traceback.format_exc())  

def score_single_image_from_bytes(image_bytes, learners):
    print("DEBUG: Entered score_single_image_from_bytes()")
    print(f"DEBUG: Received image bytes of length: {len(image_bytes)}")
    results = {}
    try:
        img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
        print(f"DEBUG: Loaded image successfully (size={img.size}, mode={img.mode})")
        img_fastai = PILImage.create(img)
    except Exception as e:
        print(f"DEBUG: Exception while processing image: {e}")
        raise Exception("Invalid image input")

    for name, learn in learners.items():
        print(f"DEBUG: Processing learner: {name}")
        try:
            pred, _, probs = learn.predict(img_fastai)
            print(f"pred: {pred}")
            print(f"DEBUG: Learner '{name}' prediction succeeded. Returned -> pred: {pred}, probs: {probs}")
        except Exception as e:
            print(f"DEBUG: Exception during prediction with learner '{name}': {e}")
            if hasattr(learn, "loss_func") and isinstance(learn.loss_func, BCEWithLogitsLossFlat):
                vocab = get_learner_vocab(learn)
                if vocab is None:
                    vocab = list(range(10))
                    print(f"DEBUG: Fallback vocab for learner '{name}': {vocab}")
                for lab in vocab:
                    results[f"{name}_{lab}"] = float("nan")
            continue
        else:
            # For non-BCE learners, extract the probability values.
            prob_vals = probs.detach().numpy()
            for j, p in enumerate(prob_vals):
                key = f"{name}_prob_{j}"
                results[key] = float(p)
                print(f"DEBUG: Learner '{name}', prob index {j}: {results[key]}")
    df = pd.DataFrame([results])
    print(f"DEBUG: Returning DataFrame of predictions:\n{df}")
    return df

# Reorder and impute the features to match what the meta-learner expects.
def predict_meta(df):
    print("DEBUG: Entered predict_meta()")
    print(f"DEBUG: DataFrame received for meta prediction:\n{df}")
    try:
        feat_cols = meta_learner.dls.cont_names
        print(f"DEBUG: Expected features from meta learner: {feat_cols}")
        
        missing_cols = []
        for col in feat_cols:
            if col not in df.columns:
                df[col] = np.nan
                missing_cols.append(col)
                print(f"DEBUG: Column '{col}' missing in base predictions. Adding as NaN.")
        
        if missing_cols:
            print(f"DEBUG: The following columns were missing and set to NaN: {missing_cols}")
        
        df = df[feat_cols].fillna(0.5)
        row = df.iloc[0]
        print(f"DEBUG: Row for meta prediction (after imputation): {row}")

        pred_class, pred_idx, probs = meta_learner.predict(row)
        pred_class = str(pred_class)
        print(f"DEBUG: Meta learner raw prediction -> Class: {pred_class}, Index: {pred_idx}, Probabilities (raw): {probs}")

        pred_probs_tensor = torch.softmax(probs, dim=0)
        pred_probs_list = pred_probs_tensor.tolist()
        formatted_probs = {label: float("{:.6f}".format(prob)) for label, prob in zip(TARGET_CLASS_LABELS, pred_probs_list)}
        print("DEBUG: Meta learner formatted predicted probabilities (JSON):")
        print(json.dumps(formatted_probs, indent=2))

        return pred_class, formatted_probs
    except Exception as e:
        print(f"DEBUG: Exception in meta learner prediction: {e}")
        raise Exception(f"Meta-learner prediction error: {e}")

@app.route('/', methods=['POST'])
def entrypoint():
    print("DEBUG: Received a request at entrypoint")
    if meta_learner is None:
        print("DEBUG: Meta learner is not loaded!")
        return ("Meta learner not loaded", 500)

    try:
        secret_token = request.headers.get("X-Secret-Token")
        if secret_token != "Password123":
            print("DEBUG: Invalid secret token provided.")
            return (json.dumps({"error": "Unauthorized"}), 401, {"Content-Type": "application/json"})

        if "image_url" in request.form:
            image_url = request.form["image_url"]
            print(f"DEBUG: Received image URL: {image_url}")
            try:
                response = requests.get(image_url)
                response.raise_for_status()
                image_bytes = response.content
                print(f"DEBUG: Successfully fetched image from URL (size: {len(image_bytes)} bytes)")
            except Exception as e:
                print(f"DEBUG: Exception while fetching image from URL: {e}")
                return (json.dumps({"error": "Invalid image input"}), 400, {"Content-Type": "application/json"})
        else:
            print("DEBUG: No image_url provided in the request.")
            return (json.dumps({"error": "No image URL provided"}), 400, {"Content-Type": "application/json"})
        
        # Score the image with base learners.
        base_df = score_single_image_from_bytes(image_bytes, BASE_LEARNERS)
        print(f"DEBUG: Base learners predictions DataFrame:\n{base_df}")
        
        pred_class, meta_probs = predict_meta(base_df)
        
        tta_learner = BASE_LEARNERS.get("tta_base_learner") # I'm not doing TTA in this version of the script, but the problem STILL happens (and the learner is only called tta_base_learner because I had aspirations of using TTA with it)
        if tta_learner is None:
            print("DEBUG: tta_base_learner not found among base learners.")
            binary_prediction = None
        else:
            # Load the image as a fastai PILImage as before.
            try:
                img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
                img_fastai = PILImage.create(img)
            except Exception as e:
                print(f"DEBUG: Exception while processing image for tta_base_learner: {e}")
                return (json.dumps({"error": "Invalid image input for tta"}), 400, {"Content-Type": "application/json"})
            try:
                pred, _, probs = tta_learner.predict(img_fastai)
                probs_np = probs.detach().numpy()
                # Compute binary prediction: 1 if probability (index 1) > 0.5; otherwise 0.
                binary_prediction = int(probs_np[1] > 0.5)
                print(f"DEBUG: tta_base_learner binary prediction: {binary_prediction}, probabilities: {probs_np}")
            except Exception as e:
                print(f"DEBUG: Exception during tta_base_learner prediction: {e}")
                binary_prediction = None
        
        rtn = {
            "binary_predictions": binary_prediction,
            "junk_predictions": {  # Recompute junk prediction if appropriate.
                "ham": float("{:.4f}".format(1 - meta_probs.get('BadImage', 0))),
                "spam": float("{:.4f}".format(meta_probs.get('BadImage', 0)))
            },
            "main_predictions": meta_probs
        }
        print(f"DEBUG: Final response: {rtn}")
        return (json.dumps(rtn), 200, {"Content-Type": "application/json"})
    except Exception as e:
        print(f"DEBUG: Exception in entrypoint: {e}")
        err = {"error": str(e)}
        return (json.dumps(err), 500, {"Content-Type": "application/json"})

if __name__ == '__main__':
    port = int(os.environ.get("PORT", 8080))
    print(f"DEBUG: Starting Flask server on host 0.0.0.0 and port {port}")
    app.run(host='0.0.0.0', port=port)

@RensDimmendaal
Copy link
Contributor

It glitches on the first load_learner statement with the aforementioned errors about Pipeline being moved from fastcore to fasttransform.

I believe that happens under python 3.9 right? That issue I was already able to reproduce.

Apologies if I didnt phrase my question clearly. What I meant was that it would be great if you could share a minimum reproducible example that triggers the second error you shared: ERROR loading model.pkl: Could not import 'Pipeline' from fastcore.transform. i.e. some code that: 1. saves a simple model to disk, 2. loads that model from disk, 3. raises the error you shared.

@hack-r
Copy link
Author

hack-r commented Mar 21, 2025

Yes, understood. Sorry for the long code. So, in theory, that actually should be reproducible of both errors because you'd just give it any pkl format vision model to run it. However, when I try to run any version of that in Colab it doesn't reproduce the error.

In case that's not weird enough, so I can additionally report that:

  • The same exact code ran perfectly fine for 10 days. The problems only started when I introduced TTA.
  • The pkl is not at fault; at least a test in colab didn't have any trouble with any of the pkl files.
  • Somehow redeploying the exact same code that used to run perfectly fine now still leads to this issue on Cloud Run.

I will continue trying to isolate the source of the error. Could you let me know if you recognize the error message (the one saying that Pipeline has moved)? I'm thinking maybe I should just set all of the library versions to the last version prior to the introduction of that change...

@hack-r
Copy link
Author

hack-r commented Mar 22, 2025

I think we're getting closer. I can confirm that:

  1. Using fastcore 1.7.29 instead of the current version (1.8) fully resolves all errors.
  2. The script works now on any Python.

So, that must be the reason the errors inexplicably persisted across builds even without caching. It was always pulling the newest fastcore. The only things remaining are:

  1. Understanding how FastAI was intended to integrate the change and why that isn't happening.
  2. Understanding why I wasn't able to reproduce the error in Colab. I'll be working on those two items presently.

@hack-r
Copy link
Author

hack-r commented Mar 22, 2025

gojo;lkAt long last, here's the reproducible example. You just have to make sure the library versions are what you think they are. In Colab, the session has to be restarted to get the latest version of FastAI.

https://colab.research.google.com/drive/1soaCB5Vnz2mdDpaXM3JETfMZR9CPS2Up?usp=sharing

The specific .pkl shouldn't matter. You can even just pkl a dummy file to reproduce the error.

# -*- coding: utf-8 -*-
"""Lin Tech - Library Debugging - Answer.AI and Fastcore Fasttransform Error Replication.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1soaCB5Vnz2mdDpaXM3JETfMZR9CPS2Up
"""

from google.colab import userdata
from google.cloud import storage
import json

key      = userdata.get('GCP_KEY')
key_data = json.loads(key)
client   = storage.Client.from_service_account_info(key_data)

# Define the bucket and blob
bucket_name = "hardware-march-models-prod"
blob_name   = "tta_base_learner.pkl"
destination_file_name = "/content/model.pkl"

# Download the blob to a file
bucket = client.bucket(bucket_name)
blob   = bucket.blob(blob_name)
blob.download_to_filename(destination_file_name)

print(f"Downloaded storage object {blob_name} from bucket {bucket_name} to local file {destination_file_name}.")

!pip install -Uqq fastai==2.8.0
!pip install -Uqq fastcore==1.8.0

print(fastai.__version__)
print(fastcore.__version__)

from fastai import *
from fastai.vision.all import load_learner, PILImage

load_learner("/content/model.pkl")

!pip install fastai==2.8.0
import fastai
print(fastai.__version__)

"""As you can see, part of the problem is that it's really running v2.7.29 when you think it's running 2.8 in Colab. That's why I previously failed to reproduce the error in Colab.

For some reason, it didn't produce an advisory to restart the runtime previously as it has done now.
"""

# After restarting:
import fastai
print(fastai.__version__)

from google.colab import userdata
from google.cloud import storage
import json

key      = userdata.get('GCP_KEY')
key_data = json.loads(key)
client   = storage.Client.from_service_account_info(key_data)

# Define the bucket and blob
bucket_name = "hardware-march-models-prod"
blob_name   = "tta_base_learner.pkl"
destination_file_name = "/content/model.pkl"

# Download the blob to a file
bucket = client.bucket(bucket_name)
blob   = bucket.blob(blob_name)
blob.download_to_filename(destination_file_name)

print(f"Downloaded storage object {blob_name} from bucket {bucket_name} to local file {destination_file_name}.")

print(fastai.__version__)

from fastai import *
from fastai.vision.all import load_learner, PILImage

load_learner("/content/model.pkl")

So, given that the error is caused by the from .core import * in fastai's vision augment.py, if it can be changed on the FastAI side that would likely be best, but I assume it could also be resolved in fastcore by not raising an error. If it would be useful for me to submit a PR patch I'd be happy, but I take it this is probably a quick fix that doesn't really require it?

I wonder if it's not just augment, but perhaps anything that imports * from fastcore that could cause this?

@RensDimmendaal
Copy link
Contributor

Thanks for providing all this info. We've pulled the release of 2.8.0 and are working on a fix. You can see the details in the PR here: fastai/fastai#4083

Could you let me know if you recognize the error message (the one saying that Pipeline has moved)? I'm thinking maybe I should just set all of the library versions to the last version prior to the introduction of that change...

Yeah I recognize it. It shows that the model.pkl used fastcore.transform.Pipeline when it was saved. But it is not available during when it is loaded (fastai>2.8.0). I've also been able to reproduce the error like that myself (see code snippet in the PR). We've now added a more helpful error message that advices people to downgrade when trying to load models exported under fastai<2.8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants