unstructured
File preprocessing for converting raw files into documents.
Unstructured
Bases: Task
Parser wrapping the unstructured library to convert files into documents.
Source code in sieves/tasks/preprocessing/unstructured_.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
id
property
Returns task ID. Used by pipeline for results and dependency management.
Returns:
Type | Description |
---|---|
str
|
Task ID. |
__call__(docs)
Parse resources using docling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docs
|
Iterable[Doc]
|
Resources to process. |
required |
Returns:
Type | Description |
---|---|
Iterable[Doc]
|
Parsed documents. |
Source code in sieves/tasks/preprocessing/unstructured_.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
__init__(partition=unstructured.partition.auto.partition, cleaners=(), task_id=None, show_progress=True, include_meta=False, **kwargs)
Initialize the docling parser.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
partition
|
PartitionType
|
Function to use for partitioning. |
partition
|
cleaners
|
tuple[CleanerType, ...]
|
Cleaning functions to apply. |
()
|
task_id
|
str | None
|
Task ID. |
None
|
show_progress
|
bool
|
Whether to show progress bar for processed documents |
True
|
include_meta
|
bool
|
Whether to include meta information generated by the task. |
False
|
kwargs
|
dict[str, Any]
|
Kwargs to be supplied to partitioning call. |
{}
|
Source code in sieves/tasks/preprocessing/unstructured_.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
_require()
staticmethod
Download all necessary resources that have to be installed from within Python.
Source code in sieves/tasks/preprocessing/unstructured_.py
46 47 48 49 50 51 52 53 54 55 |
|
deserialize(config, **kwargs)
classmethod
Generate Task instance from config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
Config
|
Config to generate instance from. |
required |
kwargs
|
dict[str, Any]
|
Values to inject into loaded config. |
{}
|
Returns:
Type | Description |
---|---|
Task
|
Deserialized Task instance. |
Source code in sieves/tasks/core.py
56 57 58 59 60 61 62 63 64 |
|
serialize()
Serializes task.
Returns:
Type | Description |
---|---|
Config
|
Config instance. |
Source code in sieves/tasks/core.py
50 51 52 53 54 |
|