Skip to content

[IX] - Test run (partial run) #7904

Open
@txau

Description

@txau

It is needed to be able to perform a "test run" on information extraction before actually processing the whole pipeline. At the moment, we have a single "Find suggestions" button that will train a model from the scratch and then process ALL of the entities.

When there are too many entities, this process consumes too many resources and takes a long time.

Users could greatly benefit from a "test run", that is, a subset of a maximum of (e.g.) 1000 entities are sent for training, and then the trained model is used to process some (e.g.) 1000 extra entities. This way users can check the results, add labeled data as needed and refine the model before processing the whole database.


Problem statement

Currently, the information extraction pipeline processes ALL entities when a user clicks the "Find suggestions" button. This approach:

  • Trains a model from scratch for each run
  • Processes the entire dataset of entities at once
  • Consumes excessive computational resources
  • Takes a long time to complete when there are many entities
  • Doesn't allow users to validate or refine the model before committing to a full run

Proposed solution

Implement a "Test Run" feature that allows users to:

  • Train the model on a limited subset of entities
  • Test the trained model on another small subset of entities
  • Review results and refine the model (by adding labeled data) before processing the entire dataset

Acceptance criteria

  • Add a new "Test Run" button to the UI alongside the existing "Find suggestions" button (naming of buttons would be adjusted)
  • When "Test Run" is clicked, the system should:
    • Select a maximum of 2,000 entities for training the model (as today)
    • Train the model using only these entities
    • Process an additional 1,000 entities using the trained model (these should not be the ones used for training)
    • Display results to the user (should be displayed first in the list)
  • Users should be able to review the test results and add labeled data as needed
  • After reviewing, users should have the option to:
    • Run another test run
    • Proceed with processing the entire dataset
  • The UI should clearly indicate when a test run is in progress vs. a full processing run

General considerations

  • The current pipeline architecture needs to be modified to support partial processing
  • Need to implement logic for selecting representative subsets of entities for training and testing
  • Question: After how many entities does this feature provides substantial value and is considerable different than the current default action of "Find suggestions"?

Error states and messages

TBD

UI designs

To be added by @juanmnl

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions