Marker
Note: This task depends on optional ingestion libraries that are not installed by default. You can install them via the ingestion extra, or install the library directly.
Examples:
pip install "sieves[ingestion]" # installs ingestion deps via extra
# or install the library directly (e.g., the Marker PDF package)
pip install marker # or the appropriate marker package variant
Marker task for converting PDF documents to text.
Marker
Bases: Task
Marker task for converting PDF documents to text.
Source code in sieves/tasks/preprocessing/ingestion/marker_.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
id
property
Return task ID.
Used by pipeline for results and dependency management.
Returns:
| Type | Description |
|---|---|
str
|
Task ID. |
__add__(other)
Chain this task with another task or pipeline using the + operator.
This returns a new Pipeline that executes this task first, followed by the
task(s) in other. The original task(s)/pipeline are not mutated.
Cache semantics:
- If other is a Pipeline, the resulting pipeline adopts other's
use_cache setting (because the left-hand side is a single task).
- If other is a Task, the resulting pipeline defaults to use_cache=True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Task | Pipeline
|
A |
required |
Returns:
| Type | Description |
|---|---|
Pipeline
|
A new |
Raises:
| Type | Description |
|---|---|
TypeError
|
If |
Source code in sieves/tasks/core.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
__call__(docs)
Execute task with conditional logic.
Checks the condition for each document without materializing all docs upfront. Passes all documents that pass the condition to _call() for proper batching. Documents that fail the condition have results[task_id] set to None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docs
|
Iterable[Doc]
|
Docs to process. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[Doc]
|
Processed docs (in original order). |
Source code in sieves/tasks/core.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
__init__(converter=None, export_format='markdown', task_id=None, include_meta=False, batch_size=-1, extract_images=False, condition=None)
Initialize the Marker task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
converter
|
Converter | None
|
Custom PdfConverter or TableConverter instance. If None, a default one will be created. |
None
|
export_format
|
str
|
Format to export the document in ("markdown", "html", or "json"). |
'markdown'
|
task_id
|
str | None
|
Task ID. |
None
|
include_meta
|
bool
|
Whether to include meta information generated by the task. |
False
|
batch_size
|
int
|
Batch size to use for processing. Use -1 to process all documents at once. |
-1
|
extract_images
|
bool
|
Whether to extract images from the PDF. |
False
|
condition
|
Callable[[Doc], bool] | None
|
Optional callable that determines whether to process each document. |
None
|
Source code in sieves/tasks/preprocessing/ingestion/marker_.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
deserialize(config, **kwargs)
classmethod
Generate Task instance from config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Config
|
Config to generate instance from. |
required |
kwargs
|
dict[str, Any]
|
Values to inject into loaded config. |
{}
|
Returns:
| Type | Description |
|---|---|
Task
|
Deserialized Task instance. |
Source code in sieves/tasks/core.py
144 145 146 147 148 149 150 151 152 153 | |
serialize()
Serialize task.
Returns:
| Type | Description |
|---|---|
Config
|
Config instance. |
Source code in sieves/tasks/core.py
137 138 139 140 141 142 | |