Saving and Loading

sieves provides functionality to save your pipeline configurations to disk and load them later. This is useful for:

Sharing pipeline configurations with others
Versioning your pipelines
Deploying pipelines to production

Basic Pipeline Serialization

Here's a simple example of saving and loading a classification pipeline:

import outlines
from sieves import Pipeline, tasks, Doc
from pathlib import Path

# Create a basic classification pipeline
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.transformers(model_name)
classifier = tasks.predictive.Classification(labels=["science", "politics"], model=model)
pipeline = Pipeline([classifier])

# Save the pipeline configuration
config_path = Path("classification_pipeline.yml")
pipeline.dump(config_path)

# Load the pipeline configuration
loaded_pipeline = Pipeline.load(config_path, [{"model": outlines.models.transformers(model_name)}])

# Use the loaded pipeline
doc = Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")
results = list(loaded_pipeline([doc]))
print(results[0].results["Classification"])

Dealing with complex third-party objects

sieves doesn't serialize complex third-party objects. When loading pipelines, you need to provide initialization parameters for each task when loading:

import chonkie
import tokenizers
import outlines
import pydantic
from sieves import Pipeline, tasks

# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

model = outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")


class PersonInfo(pydantic.BaseModel):
    name: str
    age: int | None = None
    occupation: str | None = None


extractor = tasks.predictive.InformationExtraction(entity_type=PersonInfo, model=model)

# Create and save the pipeline
pipeline = Pipeline([chunker, extractor])
pipeline.dump("extraction_pipeline.yml")

# Load the pipeline with initialization parameters for each task
loaded_pipeline = Pipeline.load(
    "extraction_pipeline.yml",
    [
        {"tokenizer": tokenizers.Tokenizer.from_pretrained("bert-base-uncased")},
        {"model": outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")},
    ]
)

Understanding Pipeline Configuration Files

Pipeline configurations are saved as YAML files. Here's an example of what a configuration file looks like:

cls_name: sieves.pipeline.core.Pipeline
version: 0.11.1
tasks:
  is_placeholder: false
  value:
    - cls_name: sieves.tasks.preprocessing.chunkers.Chunker
      tokenizer:
        is_placeholder: true
        value: tokenizers.Tokenizer
      chunk_size:
        is_placeholder: false
        value: 512
      chunk_overlap:
        is_placeholder: false
        value: 50
      task_id:
        is_placeholder: false
        value: Chunker
    - cls_name: sieves.tasks.predictive.information_extraction.core.InformationExtraction
      engine:
        is_placeholder: false
        value:
          cls_name: sieves.engines.outlines_.Outlines
          model:
            is_placeholder: true
            value: outlines.models.transformers

The configuration file contains:

The full class path of the pipeline and its tasks
Version information
Task-specific parameters and their values
Placeholders for components that need to be provided during loading

Info

When loading pipelines, provide all required initialization parameters (e.g. models) and ensure you're loading a pipeline with a compatible sieves version. GenerationSettings is optional unless you want to override defaults.

Warning

Model weights are not saved in the configuration files
Complex third-party objects (everything beyond primitives or collections thereof) may not be serializable
API keys and credentials must be managed separately