Saving and Loading
sieves
provides functionality to save your pipeline configurations to disk and load them later. This is useful for:
- Sharing pipeline configurations with others
- Versioning your pipelines
- Deploying pipelines to production
Basic Pipeline Serialization
Here's a simple example of saving and loading a classification pipeline:
import outlines
from sieves import Pipeline, engines, tasks, Doc
from pathlib import Path
# Create a basic classification pipeline
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
engine = engines.outlines_.Outlines(model=outlines.models.transformers(model_name))
classifier = tasks.predictive.Classification(
labels=["science", "politics"],
engine=engine
)
pipeline = Pipeline([classifier])
# Save the pipeline configuration
config_path = Path("classification_pipeline.yml")
pipeline.dump(config_path)
# Load the pipeline configuration
loaded_pipeline = Pipeline.load(
config_path,
[{"engine": {"model": outlines.models.transformers(model_name)}}]
)
# Use the loaded pipeline
doc = Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")
results = list(loaded_pipeline([doc]))
print(results[0].results["Classification"])
Dealing with complex third-party objects
sieves
doesn't serialize complex third-party objects. When loading pipelines, you need to provide initialization parameters for each task when loading:
import chonkie
import tokenizers
import outlines
import pydantic
from sieves import Pipeline, engines, tasks
# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.preprocessing.Chonkie(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Create an information extraction task
engine = engines.outlines_.Outlines(model=outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct"))
class PersonInfo(pydantic.BaseModel):
name: str
age: int | None = None
occupation: str | None = None
extractor = tasks.predictive.InformationExtraction(
entity_type=PersonInfo,
engine=engine
)
# Create and save the pipeline
pipeline = Pipeline([chunker, extractor])
pipeline.dump("extraction_pipeline.yml")
# Load the pipeline with initialization parameters for each task
loaded_pipeline = Pipeline.load(
"extraction_pipeline.yml",
[
# Parameters for the chunker
{"tokenizer": tokenizers.Tokenizer.from_pretrained("bert-base-uncased"),},
# Parameters for the extractor
{"engine": {"model": outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")}}
]
)
Understanding Pipeline Configuration Files
Pipeline configurations are saved as YAML files. Here's an example of what a configuration file looks like:
cls_name: sieves.pipeline.core.Pipeline
version: 0.7.0
tasks:
is_placeholder: false
value:
- cls_name: sieves.tasks.preprocessing.chunkers.Chunker
tokenizer:
is_placeholder: true
value: tokenizers.Tokenizer
chunk_size:
is_placeholder: false
value: 512
chunk_overlap:
is_placeholder: false
value: 50
task_id:
is_placeholder: false
value: Chunker
- cls_name: sieves.tasks.predictive.information_extraction.core.InformationExtraction
engine:
is_placeholder: false
value:
cls_name: sieves.engines.outlines_.Outlines
model:
is_placeholder: true
value: outlines.models.transformers
The configuration file contains:
- The full class path of the pipeline and its tasks
- Version information
- Task-specific parameters and their values
- Placeholders for components that need to be provided during loading
Info
When loading pipelines, provide all required initialization parameters and ensure you're loading a pipeline with a compatible sieves
version.
Warning
- Model weights are not saved in the configuration files
- Complex third-party objects (everything beyond primitives or collections thereof) may not be serializable
- API keys and credentials must be managed separately