Marker
Marker task for converting PDF documents to text.
Marker
Bases: Task
Marker task for converting PDF documents to text.
Source code in sieves/tasks/preprocessing/ocr/marker_.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
id
property
Returns task ID. Used by pipeline for results and dependency management.
Returns:
Type | Description |
---|---|
str
|
Task ID. |
__call__(docs)
Process documents using Marker.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docs
|
Iterable[Doc]
|
Documents to process. |
required |
Returns:
Type | Description |
---|---|
Iterable[Doc]
|
Processed documents. |
Source code in sieves/tasks/preprocessing/ocr/marker_.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
__init__(converter=None, export_format='markdown', task_id=None, show_progress=True, include_meta=False, extract_images=False)
Initialize the Marker task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
converter
|
PdfConverter | TableConverter
|
Custom PdfConverter or TableConverter instance. If None, a default one will be created. |
None
|
export_format
|
str
|
Format to export the document in ("markdown", "html", or "json"). |
'markdown'
|
task_id
|
str | None
|
Task ID. |
None
|
show_progress
|
bool
|
Whether to show progress bar for processed documents. |
True
|
include_meta
|
bool
|
Whether to include meta information generated by the task. |
False
|
extract_images
|
bool
|
Whether to extract images from the PDF. |
False
|
Source code in sieves/tasks/preprocessing/ocr/marker_.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
deserialize(config, **kwargs)
classmethod
Generate Task instance from config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
Config
|
Config to generate instance from. |
required |
kwargs
|
dict[str, Any]
|
Values to inject into loaded config. |
{}
|
Returns:
Type | Description |
---|---|
Task
|
Deserialized Task instance. |
Source code in sieves/tasks/core.py
56 57 58 59 60 61 62 63 64 |
|
serialize()
Serializes task.
Returns:
Type | Description |
---|---|
Config
|
Config instance. |
Source code in sieves/tasks/core.py
50 51 52 53 54 |
|