### Install xml2db Package Source: https://cre-dev.github.io/xml2db/getting_started Install the xml2db package using pip. It is recommended to do this within a virtual environment. ```bash pip install xml2db ``` -------------------------------- ### Install xml2db in Editable Mode with Extras Source: https://cre-dev.github.io/xml2db/getting_started For development, clone the git repository and install the package in editable mode with documentation and test dependencies. ```bash pip install -e .[docs,tests] ``` -------------------------------- ### Instantiate DataModel Source: https://cre-dev.github.io/xml2db/api/data_model Example of creating a DataModel instance with an XSD file path and a database connection string. ```python >>> data_model = DataModel( >>> xsd_file="path/to/file.xsd", >>> connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb", >>> ) ``` -------------------------------- ### Get All Create Index Statements Source: https://cre-dev.github.io/xml2db/api/data_model Yields SQL CREATE INDEX statements for all tables in the data model. ```python def get_all_create_index_statements(self) -> Iterable[CreateIndex]: """Yield create index statements for all tables""" for tb in self.fk_ordered_tables: yield from tb.get_create_index_statements() ``` -------------------------------- ### Multiprocessing Example for Loading XML Data Source: https://cre-dev.github.io/xml2db/api/overview Demonstrates how to load multiple XML files into a database concurrently using multiprocessing. It serializes database I/O operations using a lock to ensure data integrity, while XML parsing is performed in parallel. ```python import multiprocessing from xml2db import DataModel def load_one_file(xml_path, xsd_path, connection_string, lock): # Each process creates its own DataModel with a unique temp_prefix. model = DataModel( xsd_file=xsd_path, connection_string=connection_string, ) # XML parsing is CPU-bound and runs in parallel across all processes. doc = model.parse_xml(xml_path) # Serialise all database I/O across processes. with lock: doc.insert_into_target_tables() model.engine.dispose() if __name__ == "__main__": xsd_path = "schema.xsd" connection_string = "duckdb:///data.duckdb" xml_files = ["file1.xml", "file2.xml", "file3.xml"] lock = multiprocessing.Lock() processes = [ multiprocessing.Process( target=load_one_file, args=(xml_path, xsd_path, connection_string, lock), ) for xml_path in xml_files ] for p in processes: p.start() for p in processes: p.join() if p.exitcode != 0: raise RuntimeError(f"Worker failed with exit code {p.exitcode}") ``` -------------------------------- ### Get All Create Table Statements Source: https://cre-dev.github.io/xml2db/api/data_model Yields SQLAlchemy CREATE TABLE statements for all tables. Can generate statements for temporary tables if the 'temp' parameter is set to True. ```python def get_all_create_table_statements( self, temp: bool = False ) -> Iterable[CreateTable]: """Yield sqlalchemy `create table` statements for all tables Args: temp: If `False`, yield create table statements for target tables (unprefixed). If True, yield create table statements for temporary tables (prefixed). """ for tb in self.fk_ordered_tables: yield from tb.get_create_table_statements(temp) ``` -------------------------------- ### Extract Data from Database Source: https://cre-dev.github.io/xml2db/api/document This function extracts data starting from a specified root table with a given WHERE clause. It recursively fetches related data through defined relationships (one-to-many and many-to-one). ```python def extract_from_database(self, root_table_name, root_select_where, force_tz=None): """Extract data from database. Args: root_table_name (str): The name of the root table to start extraction from. root_select_where (str): The SQL WHERE clause to filter the root table records. force_tz (str, optional): Timezone to force for datetime objects. Defaults to None. Returns: dict: A dictionary containing the extracted data, structured by tables and relations. """ flat_tables = {} with self.model.engine.connect() as conn: _do_extract_table( self.model.tables[root_table_name], text(root_select_where), None, [], flat_tables, conn, ) self.data = flat_tables return flat_tables ``` -------------------------------- ### Extract Subtree from Database Source: https://cre-dev.github.io/xml2db/api/document This function extracts a subtree from a database, starting from a specified root table and applying a WHERE clause. It can optionally enforce a timezone for datetime objects. The extracted data is returned as a dictionary of flat data tables. ```python def extract_from_database( self, root_table_name: str, root_select_where: str, force_tz: Union[str, None] = None, ) -> dict: """Extract a subtree from the database and store it in a flat format Args: root_table_name: The root table name to start from root_select_where: A where clause to apply to this root table force_tz: Apply this timezone if database returns timezone-naïve datetime Returns: A shallow dict of flat data tables """ if force_tz: force_tz = ZoneInfo(force_tz) def _fetch_data( sqla_table: Table, key_column: Column, join_sequence: list[tuple[Column, Table, Column]], top_where_clause: TextClause, order_by: Union[None, tuple[Column]], append_to: list, conn: Connection, ): """Fetch data from a specific table and write fetched rows in a dict keyed by the first row column""" quer = select(*(sqla_table.columns.values())) join_sequence = join_sequence.copy() if len(join_sequence) > 0: left_col, join_tb, right_col = join_sequence.pop() sub_quer = select(right_col) prev_join_col = left_col for left_col, join_tb, right_col in reversed(join_sequence): sub_quer = sub_quer.join(join_tb, right_col == prev_join_col) prev_join_col = left_col sub_quer = sub_quer.where(top_where_clause) quer = quer.where(key_column.in_(sub_quer)) else: quer = quer.where(top_where_clause) if order_by: quer = quer.order_by(*order_by) def add_tz(x): if ( force_tz and isinstance(x, datetime.datetime) and (x.tzinfo is None or x.tzinfo.utcoffset(x) is None) ): x = x.replace(tzinfo=force_tz) return x col_names = sqla_table.columns.keys() for row in conn.execute(quer): append_to.append({key: add_tz(val) for key, val in zip(col_names, row)}) def _do_extract_table( tb, top_where_clause, parent_table, join_sequence, res_dict, conn, ): """Fetch tables and relationship tables recursively""" if tb.type_name not in res_dict: res_dict[tb.type_name] = {"records": []} _fetch_data( tb.table, ( getattr(tb.table.c, f"pk_{tb.name}") if tb.is_reused else getattr(tb.table.c, f"fk_parent_{parent_table.name}") ), join_sequence, top_where_clause, ( None if tb.is_reused or not tb.data_model.model_config["row_numbers"] else ( getattr(tb.table.c, f"fk_parent_{parent_table.name}"), tb.table.c.xml2db_row_number, ) ), res_dict[tb.type_name]["records"], conn, ) join_root = ( [(None, tb.table, getattr(tb.table.c, f"pk_{tb.name}"))] if parent_table is None else [] ) if len(tb.relations_n) > 0: if "relations_n" not in res_dict[tb.type_name]: res_dict[tb.type_name]["relations_n"] = {} for rel in tb.relations_n.values(): if rel.rel_table_name not in res_dict[tb.type_name]["relations_n"]: res_dict[tb.type_name]["relations_n"][rel.rel_table_name] = { ``` -------------------------------- ### extract_from_database Source: https://cre-dev.github.io/xml2db/api/document Extracts a subtree from the database and stores it in a flat format. This function is useful for retrieving related data across multiple tables based on a starting point and filtering conditions. ```APIDOC ## extract_from_database ### Description Extract a subtree from the database and store it in a flat format. This function is useful for retrieving related data across multiple tables based on a starting point and filtering conditions. ### Method Python Function ### Parameters #### Path Parameters - **root_table_name** (str) - Required - The root table name to start from - **root_select_where** (str) - Required - A where clause to apply to this root table - **force_tz** (Union[str, None]) - Optional - Apply this timezone if database returns timezone-naïve datetime. Defaults to None. ### Returns - **dict** - A shallow dict of flat data tables ### Example ```python # Assuming 'db_extractor' is an instance of the class containing this method result = db_extractor.extract_from_database( root_table_name='users', root_select_where="user_id = 1", force_tz='UTC' ) ``` ``` -------------------------------- ### Create DataModel Object from XSD Source: https://cre-dev.github.io/xml2db/getting_started Initialize a DataModel object by providing the path to an XSD file, the target database schema name, and a SQLAlchemy connection string. An optional model configuration can also be provided. ```python from xml2db import DataModel data_model = DataModel( xsd_file="path/to/file.xsd", db_schema="source_data", # the name of the database target schema connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb", model_config={}, ) ``` -------------------------------- ### DataModel Class Initialization Source: https://cre-dev.github.io/xml2db/api/data_model The constructor for the DataModel class, responsible for parsing XSD files, setting up database connections, and configuring the data model. ```python def __init__( self, xsd_file: str, short_name: str = "DocumentRoot", long_name: str = None, base_url: str = None, model_config: dict = None, connection_string: str = None, db_engine: sqlalchemy.Engine = None, db_type: str = None, db_schema: str = None, temp_prefix: str = None, ): self.model_config = self._validate_config(model_config) self.tables_config = model_config.get("tables", {}) if model_config else {} xsd_file_name = xsd_file if base_url is None: base_url = os.path.normpath(os.path.dirname(xsd_file)) xsd_file_name = os.path.basename(xsd_file) self.xml_schema = xmlschema.XMLSchema(xsd_file_name, base_url=base_url) self.lxml_schema = etree.XMLSchema(etree.parse(xsd_file)) self.xml_converter = XMLConverter(data_model=self) self.data_flow_name = short_name self.data_flow_long_name = long_name if connection_string is None and db_engine is None: logger.warning( "DataModel created without connection string cannot do actual imports" ) self.engine = None self.db_type = db_type else: if db_engine: self.engine = db_engine else: engine_options = {} if "mssql" in connection_string: engine_options = { "fast_executemany": True, "isolation_level": "SERIALIZABLE", } self.engine = create_engine( connection_string, **engine_options, ) self.db_type = self.engine.dialect.name self.dialect = get_dialect(self.db_type) self.model_config = self.dialect.validate_model_config(self.model_config) self.db_schema = db_schema self.temp_prefix = str(uuid4())[:8] if temp_prefix is None else temp_prefix self.tables = {} self.names_types_map = {} self.root_table = None self.types_transforms = {} self.fields_transforms = {} self.ordered_tables_keys = [] self.transaction_groups = [] self.source_tree = "" ``` -------------------------------- ### Create All Tables Source: https://cre-dev.github.io/xml2db/api/data_model Creates all tables for the data model. Can create either target tables or temporary tables for data import. This method is often called implicitly by `Document.insert_into_target_tables()`. ```python def create_all_tables(self, temp: bool = False) -> None: """Create tables for the data model, either target tables or temp tables used to import data. You do not have to call this method explicitly when using [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables), which will create tables if they do not exist. Args: temp: If `False`, create target tables (unprefixed). If `True`, create temporary (prefixed) tables. """ for tb in self.fk_ordered_tables: tb.create_tables(self.engine, temp) ``` -------------------------------- ### Initialize Document Object Source: https://cre-dev.github.io/xml2db/api/document Initializes a Document object with a DataModel. The DataModel defines the structure and transformations for the XML data. This constructor sets up the model, initializes an empty data dictionary, and prepares for XML file path assignment. ```python def __init__(self, model: "DataModel"): self.model = model self.data = {} self.xml_file_path = None ``` -------------------------------- ### Write Source and Target Tree Structures to Files Source: https://cre-dev.github.io/xml2db/getting_started Generate text files representing the raw XML schema tree and the simplified target model tree using the source_tree and target_tree attributes. ```python with open(f"source_tree.txt", "w") as f: f.write(data_model.source_tree) with open(f"target_tree.txt", "w") as f: f.write(data_model.target_tree) ``` -------------------------------- ### create_all_tables Source: https://cre-dev.github.io/xml2db/api/data_model Creates all tables for the data model. This can be used for either target tables or temporary tables used for data import. It is not necessary to call this method explicitly when using `Document.insert_into_target_tables()`, as that method will create tables if they do not exist. ```APIDOC ## create_all_tables(temp: bool = False) ### Description Create tables for the data model, either target tables or temp tables used to import data. You do not have to call this method explicitly when using `Document.insert_into_target_tables()`, which will create tables if they do not exist. ### Method POST ### Endpoint /create_all_tables ### Parameters #### Query Parameters - **temp** (bool) - Optional - If `False`, create target tables (unprefixed). If `True`, create temporary (prefixed) tables. Default: `False` ``` -------------------------------- ### DataModel Initialization Source: https://cre-dev.github.io/xml2db/api/data_model Initializes the DataModel with attributes for target tree, metadata, and processing timestamp. It also triggers the internal model building process. ```python self.target_tree = "" self.metadata = MetaData() self.processed_at = datetime.now() self._build_model() ``` -------------------------------- ### Load XML into Database Source: https://cre-dev.github.io/xml2db This snippet demonstrates the basic usage of xml2db to load an XML file into a relational database. It requires an XSD file for schema definition and a database connection string. The `insert_into_target_tables()` method creates tables if they don't exist. ```python from xml2db import DataModel # Create a DataModel object from an XSD file data_model = DataModel( xsd_file="path/to/file.xsd", connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb", ) # Parse an XML file based on this XSD schema document = data_model.parse_xml(xml_file="path/to/file.xml") # Load data into the database, creating target tables if need be document.insert_into_target_tables() ``` -------------------------------- ### Model Configuration Structure Source: https://cre-dev.github.io/xml2db/configuring This dictionary structure outlines the general configuration options available for the data model. It includes settings for document tree hooks, row numbers, columnstore indexes, metadata columns, and table-specific configurations. ```python { "document_tree_hook": None, "document_tree_node_hook": None, "row_numbers": False, "as_columnstore": False, "metadata_columns": None, "tables": { "table1": { "reuse": True, "choice_transform": False, "as_columnstore": False, "fields": { "my_column": { "type": None #default type } }, "extra_args": [] } } } ``` -------------------------------- ### Load Multiple XML Files in Batch Source: https://cre-dev.github.io/xml2db/getting_started Accumulates data from multiple XML files in memory before inserting them into the database in a single batch operation. This is efficient for loading many small files. ```python flat_data = None for xml_file in files: document = data_model.parse_xml( xml_file=xml_file, metadata={"input_file_path": xml_file}, flat_data=flat_data, ) flat_data = document.data document.insert_into_target_tables() ``` -------------------------------- ### get_all_create_table_statements Source: https://cre-dev.github.io/xml2db/api/data_model Yields SQL statements for creating all tables, with an option to generate statements for temporary tables. ```APIDOC ## get_all_create_table_statements(temp=False) ### Description Yield sqlalchemy `create table` statements for all tables. ### Parameters #### Path Parameters - None #### Query Parameters - None #### Request Body - None ### Parameters - **temp** (`bool`) - Optional - If `False`, yield create table statements for target tables (unprefixed). If True, yield create table statements for temporary tables (prefixed). Default: `False` ### Returns - **Iterable[CreateTable]** - Yields sqlalchemy `create table` statements for all tables. ### Request Example ```python # No example provided in source ``` ### Response #### Success Response (200) - **Iterable[CreateTable]** - Yields sqlalchemy `create table` statements for all tables. #### Response Example ```json # No example provided in source ``` ``` -------------------------------- ### XMLConverter Constructor Source: https://cre-dev.github.io/xml2db/api/xml_converter Initializes the XMLConverter with a DataModel and an optional document tree. ```APIDOC ## XMLConverter Constructor ### Description Initializes the XMLConverter with a DataModel and an optional document tree. ### Parameters #### Path Parameters - **data_model** (DataModel) - Required - The DataModel object used to parse XML files - **document_tree** (dict) - Optional - Data in the document tree format (can be built later by the parse_xml method) ``` -------------------------------- ### Create Database Schema Source: https://cre-dev.github.io/xml2db/api/data_model Creates the database schema if it does not already exist. This method is typically managed automatically by `Document.insert_into_target_tables()`. ```python def create_db_schema(self) -> None: """Create database schema if it does not already exist. You do not have to call this method explicitly when using [`Document.insert_into_target_tables()`](document.md#xml2db.document.Document.insert_into_target_tables). """ if self.db_schema is not None: self.dialect.create_schema(self.engine, self.db_schema) logger.info(f"Created schema: {self.db_schema}") ``` -------------------------------- ### get_all_create_index_statements Source: https://cre-dev.github.io/xml2db/api/data_model Yields SQL statements for creating indexes on all tables in the data model. ```APIDOC ## get_all_create_index_statements() ### Description Yield create index statements for all tables. ### Parameters #### Path Parameters - None #### Query Parameters - None #### Request Body - None ### Returns - **Iterable[CreateIndex]** - Yields create index statements for all tables. ### Request Example ```python # No example provided in source ``` ### Response #### Success Response (200) - **Iterable[CreateIndex]** - Yields create index statements for all tables. #### Response Example ```json # No example provided in source ``` ``` -------------------------------- ### Configure Joining Simple Type Values Source: https://cre-dev.github.io/xml2db/configuring This configuration enables joining simple type elements into a single column, separated by commas. It's the default behavior and cannot be opted out. ```Python model_config = { "tables": { "my_table_name": { "fields": { "my_field_name": { "transform": "join" } } } } } ``` -------------------------------- ### XMLConverter Class Initialization Source: https://cre-dev.github.io/xml2db/api/xml_converter Initializes the XMLConverter with a DataModel object and an optional document tree. The DataModel is required for parsing XML files. ```python def __init__(self, data_model: "DataModel", document_tree: dict = None): """A class to convert data from document tree format (nested dict) to and from XML. Args: data_model: The [`DataModel`](./data_model.md#xml2db.model.DataModel) object used to parse XML files document_tree: Data in the document tree format (optional, can be built later by the `parse_xml` method) """ self.model = data_model self.document_tree = document_tree ``` -------------------------------- ### Add SQLAlchemy Index via Extra Arguments Source: https://cre-dev.github.io/xml2db/configuring Pass a SQLAlchemy Index object within 'extra_args' to customize table creation, such as adding an index on specific columns. This requires importing the 'sqlalchemy' library. ```python model_config = { "tables": { "my_table": { "extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"), } } } ``` -------------------------------- ### Parse XML and Insert into Database Source: https://cre-dev.github.io/xml2db/getting_started Parses a single XML file and inserts its content into the target database tables. The `insert_into_target_tables` method handles the data loading. ```python document = data_model.parse_xml( xml_file="path/to/file.xml", ) document.insert_into_target_tables() ``` -------------------------------- ### Insert and Merge Data into Target Tables Source: https://cre-dev.github.io/xml2db/api/document Inserts data into temporary tables and then merges them into the final target tables. Use this function to populate your database from XML data, with options to control transaction scope and batch size for insertions. ```python def insert_into_target_tables( self, single_transaction: bool = True, max_lines: int = -1, ) -> int: """Insert and merge data into the database Insert data into temporary tables and then merge temporary tables into target tables. Args: single_transaction: Should we run all queries in a single transaction, or isolate queries at the minimum scope required to ensure database consistency? max_lines: The maximum number of lines to insert in a single statement when loading data to the temporary tables Returns: The number of inserted rows """ try: self.model.create_db_schema() except Exception as e: logger.error( f"Error while creating database schema '{self.model.db_schema}'" ) logger.error(e) raise try: self.insert_into_temp_tables(max_lines) except Exception as e: logger.error( f"Error while importing into temporary tables from {self.xml_file_path}" ) logger.error(e) raise else: logger.info( f"Merging temporary tables into target tables for {self.xml_file_path}" ) try: self.model.create_all_tables() # Create target tables if not exist inserted_rows = self.merge_into_target_tables(single_transaction) except Exception as e: logger.error( f"Error while merging temporary tables into target tables for {self.xml_file_path}" ) logger.error(e) raise finally: logger.info(f"Dropping temporary tables for {self.xml_file_path}") self.model.drop_all_temp_tables() return inserted_rows ``` -------------------------------- ### Python: Convert Flat Data to Document Tree Source: https://cre-dev.github.io/xml2db/api/document This method converts data stored in flat tables into a document tree. It first indexes the data for efficient access and then recursively builds the tree structure. ```python def flat_data_to_doc_tree(self) -> tuple:\n """Convert the data stored in flat tables into a document tree\n\n Returns:\n A tuple (node_type, content, hash) containing the document tree\n """\n data_index = {}\n\n # convert data to keyed dict for easier access\n temp = (\n ""\n if f"pk_{self.model.tables[self.model.root_table].name}"\n in self.data[self.model.root_table]["records"][0]\n else "temp_"\n )\n for tb in self.model.tables.values():\n data_index[tb.type_name] = {\n "records": {},\n "relations_n": {},\n }\n if tb.type_name in self.data:\n data_index[tb.type_name]["records"] = {\n row[f"{temp}pk_{tb.name}"]: row\n for row in self.data[tb.type_name]["records"]\n }\n for rel in tb.relations_n.values():\n index = {}\n if rel.other_table.is_reused:\n if tb.type_name in self.data:\n for row in self.data[tb.type_name]["relations_n"ционного][\n rel.rel_table_name\n ]["records"]:\n if row[f"{temp}fk_{tb.name}"] not in index:\n index[row[f"{temp}fk_{tb.name}"]] = []\n index[row[f"{temp}fk_{tb.name}"]].append(\n row[f"{temp}fk_{rel.other_table.name}"]\n )\n else:\n if rel.other_table.type_name in self.data:\n for row in self.data[rel.other_table.type_name]["records"]:\n if row[f"{temp}fk_parent_{tb.name}"] not in index:\n index[row[f"{temp}fk_parent_{tb.name}"]] = []\n index[row[f"{temp}fk_parent_{tb.name}"]].append(\n row[f"{temp}pk_{rel.other_table.name}"]\n )\n data_index[tb.type_name]["relations_n"][rel.rel_table_name] = index\n\n def _build_node(node_type: str, node_pk: int) -> tuple:\n """Build a dict node recursively\n\n Args:\n node_type: The node type\n node_pk: The node primary key\n\n Returns:\n A node as a tuple (node_type, content, hash)\n """\n tb = self.model.tables[node_type]\n content = {}\n\n record = data_index[node_type]["records"][node_pk]\n for field_type, rel_name, rel in tb.fields:\n if field_type == "col" and record[rel_name] is not None:\n content_key = (\n (\n f"{rel_name[:-5]}__attr"\n if rel.has_suffix\n else f"{rel_name}__attr"\n )\n if rel.is_attr\n else rel_name\n )\n if isinstance(record[rel_name], datetime.datetime):\n content[content_key] = [\n record[rel_name].isoformat(timespec="milliseconds")\n ]\n else:\n content[content_key] = (\n list(csv.reader([str(record[rel_name])], escapechar="\"))[0]\n if "," in str(record[rel_name])\n else [record[rel_name]]\n )\n elif (\n field_type == "rel1"\n and record[f"{temp}{rel.field_name}"] is not None\n ):\n content[rel_name] = [\n _build_node(\n rel.other_table.type_name, record[f"{temp}{rel.field_name}"]\n )\n ]\n elif (\n field_type == "reln"\n and node_pk\n in data_index[tb.type_name]["relations_n"][rel.rel_table_name]\n ):\n content[rel_name] = [\n _build_node(rel.other_table.type_name, pk)\n for pk in data_index[tb.type_name]["relations_n"ционного][\n rel.rel_table_name\n ][node_pk]\n ]\n return node_type, content\n\n return _build_node(\n self.model.root_table,\n int(list(data_index[self.model.root_table]["records"].keys())[0]),\n )\n ``` -------------------------------- ### Insert Data into Temporary Tables Source: https://cre-dev.github.io/xml2db/api/document Recreates temporary tables and inserts data into them from an XML file. This function is a prerequisite for merging data into target tables. It supports batch insertion controlled by `max_lines`. ```python def insert_into_temp_tables(self, max_lines: int = -1) -> None: """Insert data into temporary tables (Re)creates temp tables before inserting data. Args: max_lines: The maximum number of lines to insert in a single statement """ logger.info(f"Dropping temp tables if exist for {self.xml_file_path}") self.model.drop_all_temp_tables() logger.info(f"Creating temp tables for {self.xml_file_path}") self.model.create_all_tables(temp=True) logger.info(f"Inserting data into temporary tables from {self.xml_file_path}") # insert data (order does not really matter) for tb in self.model.fk_ordered_tables: for query, data in tb.get_insert_temp_records_statements( self.data.get(tb.type_name, None) ): if max_lines is None or max_lines < 0: max_lines = len(data) start_idx = 0 while start_idx < len(data): with self.model.engine.begin() as conn: self.model.dialect.bulk_insert( conn, query.table, data[start_idx : (start_idx + max_lines)], ) start_idx = start_idx + max_lines ``` -------------------------------- ### Convert Document Tree to XML Source: https://cre-dev.github.io/xml2db/api/xml_converter Use this method to convert a nested dictionary representing a document tree into an XML file. You can specify an output file path, a namespace map, and the indentation string for the XML output. The method returns the root XML node. ```python def to_xml( self, out_file: str = None, nsmap: dict = None, indent: str = " " ) -> etree.Element: """Convert a document tree (nested dict) into an XML file Args: out_file: If provided, write output to a file. nsmap: An optional namespace mapping. indent: A string used as indentin XML output. Returns: The etree object corresponding to the root XML node. """ doc = self._make_xml_node( self.document_tree, self.model.tables[self.document_tree[0]].name, nsmap, ) if self.model.tables[self.model.root_table].is_virtual_node: child = None for child in doc: break doc = child if out_file: etree.indent(doc, space=indent) with open(out_file, "wt") as f: f.write( etree.tostring( doc, pretty_print=True, encoding="utf-8", xml_declaration=True, ).decode("utf-8") ) return doc ``` -------------------------------- ### create_db_schema Source: https://cre-dev.github.io/xml2db/api/data_model Creates the database schema if it does not already exist. This method does not need to be called explicitly when using `Document.insert_into_target_tables()`. ```APIDOC ## create_db_schema() ### Description Create database schema if it does not already exist. You do not have to call this method explicitly when using `Document.insert_into_target_tables()`. ### Method POST ### Endpoint /create_db_schema ``` -------------------------------- ### Drop All Tables Source: https://cre-dev.github.io/xml2db/api/data_model Drops all target (unprefixed) tables associated with the data model. Use with extreme caution as this action is irreversible and performed without further confirmation. ```python def drop_all_tables(self): """Drop the data model target (unprefixed) tables. Danger: BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE! """ for tb in self.fk_ordered_tables_reversed: tb.drop_tables(self.engine) ``` -------------------------------- ### to_xml Source: https://cre-dev.github.io/xml2db/api/xml_converter Converts a document tree (nested dictionary) into an XML file. It can optionally write the output to a specified file and supports custom namespace mappings and indentation. ```APIDOC ## to_xml ### Description Converts a document tree (nested dictionary) into an XML file. It can optionally write the output to a specified file and supports custom namespace mappings and indentation. ### Method This is a method of a class, not a direct HTTP endpoint. ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Parameters - **out_file** (str) - Optional - If provided, write output to a file. - **nsmap** (dict) - Optional - An optional namespace mapping. - **indent** (str) - Optional - A string used as indentin XML output. Defaults to ' '. ### Returns - **Element** - The etree object corresponding to the root XML node. ### Request Example ```python # Assuming 'converter' is an instance of the class containing to_xml converter.to_xml(out_file='output.xml', nsmap={'ns': 'http://example.com'}) ``` ### Response #### Success Response - **Element** - The etree object corresponding to the root XML node. #### Response Example ```python # The returned etree object can be further processed or inspected ``` ``` -------------------------------- ### get_entity_rel_diagram Source: https://cre-dev.github.io/xml2db/api/data_model Builds an entity relationship diagram for the data model using mermaid.js syntax. This diagram can be visualized in various platforms like PyCharm or GitHub. ```APIDOC ## get_entity_rel_diagram ### Description Generates a string representation of the Entity Relationship Diagram (ERD) for the data model. The ERD uses mermaid.js syntax for visualization. ### Method Python Function ### Parameters #### Arguments - **text_context** (bool) - Optional - Determines whether to include a title, text explanation, and other contextual information along with the ERD. Defaults to True. ### Returns - **str** - A string containing the ERD definition in mermaid.js format. ``` -------------------------------- ### Extract Data from Database to XML Source: https://cre-dev.github.io/xml2db/getting_started Extracts data from the database based on a WHERE clause and saves it into an XML file. This is useful for testing and round-trip data validation. ```python document = data_model.extract_from_database( root_select_where="xml2db_input_file_path='path/to/file.xml'", ) document.to_xml("extracted_file.xml") ``` -------------------------------- ### Drop All Temporary Tables Source: https://cre-dev.github.io/xml2db/api/data_model Drops all temporary (prefixed) tables used for data import. This operation is irreversible and should be used with caution. ```python def drop_all_temp_tables(self): """Drop the data model temporary (prefixed) tables. Danger: BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE! """ for tb in self.fk_ordered_tables_reversed: tb.drop_temp_tables(self.engine) ``` -------------------------------- ### insert_into_target_tables Source: https://cre-dev.github.io/xml2db/api/document Inserts and merges data into the database by first loading it into temporary tables and then merging those into the final target tables. It supports transactional integrity and controls the batch size for insertions. ```APIDOC ## insert_into_target_tables(single_transaction=True, max_lines=-1) ### Description Insert and merge data into the database. This method inserts data into temporary tables and then merges these temporary tables into the final target tables. It ensures database consistency by optionally running all queries within a single transaction. ### Method This is a Python method call. ### Parameters #### Arguments - **single_transaction** (bool) - Optional - Should we run all queries in a single transaction, or isolate queries at the minimum scope required to ensure database consistency? Defaults to True. - **max_lines** (int) - Optional - The maximum number of lines to insert in a single statement when loading data to the temporary tables. Defaults to -1. ### Returns - **int** - The number of inserted rows. ### Source `xml2db/document.py` ``` -------------------------------- ### Python: Parse XML Document Source: https://cre-dev.github.io/xml2db/api/document Parses an XML file, applies transformations, and prepares data for database insertion. Supports validation skipping, iterative parsing, and recovery from malformed XML. Can append to existing data structures. ```python def parse_xml( self, xml_file: Union[str, BytesIO], metadata: dict = None, skip_validation: bool = True, iterparse: bool = True, recover: bool = False, flat_data: dict = None, ) -> None: """Parse an XML document and apply transformation corresponding to the target data model This method will first parse the XML file into a dict (document tree) using lxml and then compute hash for all nodes based on their content, and finally convert the document tree to tables data, creating primary keys and relations, ready to be inserted in the database. Args: xml_file: The path or the file object of an XML file to parse metadata: A dict of metadata values to add to the root table (a value for each key defined in `metadata_columns` passed to model config) skip_validation: Should we validate the document against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) flat_data: An existing `document.data` dict from a previously parsed document. When provided, records from this XML file are appended to it rather than starting fresh, allowing multiple files to be accumulated in memory and inserted together with a single [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. """ self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "" document_tree = self.model.xml_converter.parse_xml( xml_file=xml_file, file_path=self.xml_file_path, skip_validation=skip_validation, recover=recover, iterparse=iterparse, ) if self.model.model_config["document_tree_hook"] is not None: logger.info(f"Running document_tree_hook function for {self.xml_file_path}") document_tree = self.model.model_config["document_tree_hook"](document_tree) logger.info(f"Adding records to data model for {self.xml_file_path}") self.data = self.doc_tree_to_flat_data( document_tree, metadata=metadata, flat_data=flat_data, ) logger.debug(self.__repr__()) ``` -------------------------------- ### insert_into_temp_tables Source: https://cre-dev.github.io/xml2db/api/document Inserts data into temporary tables. This method first recreates the temporary tables and then populates them with data, allowing control over the maximum number of lines inserted per statement. ```APIDOC ## insert_into_temp_tables(max_lines=-1) ### Description Insert data into temporary tables. This method first (re)creates the temporary tables and then inserts data into them. It allows specifying the maximum number of lines to insert in a single statement for efficiency. ### Method This is a Python method call. ### Parameters #### Arguments - **max_lines** (int) - Optional - The maximum number of lines to insert in a single statement. Defaults to -1. ### Returns - **None** ### Source `xml2db/document.py` ``` -------------------------------- ### Generate Entity Relationship Diagram (ERD) Source: https://cre-dev.github.io/xml2db/getting_started Write a markdown file containing a visual representation of the data model as an Entity Relationship Diagram using the get_entity_rel_diagram method. ```python with open(f"target_data_model_erd.md", "w") as f: f.write(data_model.get_entity_rel_diagram()) ``` -------------------------------- ### drop_all_tables Source: https://cre-dev.github.io/xml2db/api/data_model Drops all target (unprefixed) tables for the data model. Use with caution as this method drops tables without further notice. ```APIDOC ## drop_all_tables() ### Description Drop the data model target (unprefixed) tables. BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE! ### Method DELETE ### Endpoint /drop_all_tables ``` -------------------------------- ### Convert Document Tree to XML Source: https://cre-dev.github.io/xml2db/api/document Use this function to convert a nested dictionary representation of a document into an XML file. You can specify an output file path, a namespace map, and the indentation string for the XML output. ```python def to_xml( self, out_file: str = None, nsmap: dict = None, indent: str = " " ) -> etree.Element: """Convert a document tree (nested dict) into an XML file Args: out_file: If provided, write output to a file. nsmap: An optional namespace mapping. indent: A string used as indent in XML output. Returns: The etree object corresponding to the root XML node. """ converter = XMLConverter(self.model) converter.document_tree = self.flat_data_to_doc_tree() return converter.to_xml(out_file=out_file, nsmap=nsmap, indent=indent) ``` -------------------------------- ### Generate Entity Relationship Diagram Source: https://cre-dev.github.io/xml2db/api/data_model Generates a string representation of the data model's ERD using Mermaid.js syntax. Use this method to visualize the data model structure, with an option to include explanatory text. ```python def get_entity_rel_diagram(self, text_context: bool = True) -> str: """Build an entity relationship diagram for the data model The ERD syntax is used by mermaid.js to create a visual representation of the diagram, which is supported by Pycharm IDE or GitHub in markdown files, among others Args: text_context: Should we add a title, a text explanation, etc. or just the ERD? Returns: A string representation of the ERD """ out = ["erDiagram"] for tb in self.fk_ordered_tables_reversed: out += tb.get_entity_rel_diagram() if text_context: out = ( [ f"# {self.data_flow_long_name}\n", f"### Data model name: `{self.data_flow_name}`\n", ( "The following *Entity Relationships Diagram* represents the target data model, after the " "simplification of the source data model, but before the transformations performed to optimize " "data storage (transformation of `1-1` and `1-n` relationships into `n-1` and `n-n` " "relationships, respectively.\n" ), ( "As a consequence, not all tables of the actual data model used in the database are shown. " "Specifically, `1-n` relationships presented may be stored in the database using an additional " "relationship table (noted with an asterisk in the relationship name).\n" ), "```mermaid", ] + out + [ "```", ( "`-N` suffix in field type indicates that the field can have multiple values, which will be " "stored as comma separated values." ), ] ) return "\n".join(out) ``` -------------------------------- ### Extract Document from Database Source: https://cre-dev.github.io/xml2db/api/data_model Extracts a Document object from the database using a WHERE clause on the root table. This method is not optimized for performance and is primarily used for integration testing. ```python def extract_from_database( self, root_select_where: str, force_tz: Union[str, None] = None, ) -> Document: """Extract a document from the database, based on a where clause applied to the root table. For instance, you can use the column `xml2db_input_file_path` to filter the data loaded from a specific file. It will query all the data in the database corresponding to the rows that you select from the root table of your data model. Typically, a single XML file will correspond to a single row in the root table. This function will query the data tree below this record. This method was not optimized for performance and can be quite slow. It is used in integration tests to check the output against the data inserted into the database. Args: root_select_where: A where clause to filter the root table of the model, as a string force_tz: Apply this timezone if database returns timezone-naïve datetime Returns: A [`Document`](document.md) object containing extracted data Examples: """ doc = Document(self) doc.extract_from_database(self.root_table, root_select_where, force_tz=force_tz) return doc ``` -------------------------------- ### drop_all_temp_tables Source: https://cre-dev.github.io/xml2db/api/data_model Drops all temporary (prefixed) tables for the data model. Use with caution as this method drops tables without further notice. ```APIDOC ## drop_all_temp_tables() ### Description Drop the data model temporary (prefixed) tables. BE CAUTIOUS, THIS METHOD DROPS TABLES WITHOUT FURTHER NOTICE! ### Method DELETE ### Endpoint /drop_all_temp_tables ```