Try Live
Add Docs
Rankings
Pricing
Docs
Install
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
xbrl-core
https://github.com/youseiushida/xbrl-core
Admin
xbrl-core is a pure-Python parser and structured data extraction library for XBRL 2.1 instance
...
Tokens:
11,061
Snippets:
83
Trust Score:
6.3
Update:
2 weeks ago
Context
Skills
Chat
Benchmark
84.2
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# XBRL Core XBRL Core is a pure-Python parser and structured data extraction library for XBRL 2.1 instance documents and iXBRL (Inline XBRL) documents. It provides comprehensive support for fact extraction, context/unit structuring, all five linkbase types (presentation, calculation, definition, label, reference), XSD schema parsing, calculation validation, text block extraction, pandas/DataFrame conversion, and Rich/HTML rendering. The only required dependency is lxml. The library is designed for processing financial filings from regulatory bodies like the SEC and EDINET, making it ideal for financial data analysis, compliance checking, and building financial data pipelines. It supports both strict and lenient parsing modes, allowing graceful handling of non-compliant documents while providing detailed error reporting for debugging. ## Parsing XBRL Instance Documents The `parse_xbrl_facts()` function parses XBRL 2.1 instance documents from raw bytes and extracts facts, contexts, units, schema references, footnote links, and ignored elements into a structured `ParsedXBRL` object. ```python from xbrl_core import parse_xbrl_facts # Parse XBRL instance document with open("instance.xbrl", "rb") as f: parsed = parse_xbrl_facts(f.read(), source_path="instance.xbrl") # Access parsed data print(f"Total facts: {parsed.fact_count}") print(f"Contexts: {len(parsed.contexts)}") print(f"Units: {len(parsed.units)}") # Iterate over facts for fact in parsed.facts[:5]: print(f"{fact.local_name}: {fact.value_raw} (context={fact.context_ref})") # Access schema references for schema_ref in parsed.schema_refs: print(f"Schema: {schema_ref.href}") # Lenient parsing mode for non-compliant documents parsed_lenient = parse_xbrl_facts(xbrl_bytes, strict=False) for elem in parsed_lenient.ignored_elements: print(f"Ignored: {elem.reason} at line {elem.source_line}") ``` ## Parsing iXBRL (Inline XBRL) Documents The `parse_ixbrl_facts()` function parses iXBRL documents embedded in XHTML, returning the same `ParsedXBRL` structure for seamless integration with downstream pipelines. It automatically handles format attributes, scale/sign transformations, and continuation elements. ```python from xbrl_core import parse_ixbrl_facts, merge_ixbrl_results, FormatRegistry # Parse single iXBRL document with open("report.htm", "rb") as f: parsed = parse_ixbrl_facts(f.read(), source_path="report.htm") for fact in parsed.facts[:5]: print(f"{fact.local_name}: {fact.value_raw}") # Custom format registry for locale-specific formats registry = FormatRegistry() registry.register("dateyearmonthdaycjk", lambda s: s.replace("年", "-").replace("月", "-").replace("日", "")) parsed = parse_ixbrl_facts(ixbrl_bytes, format_registry=registry) # Merge multiple iXBRL files (IXDS - Inline XBRL Document Set) ixbrl_files = [open(f, "rb").read() for f in ["part1.htm", "part2.htm", "part3.htm"]] results = [parse_ixbrl_facts(f) for f in ixbrl_files] merged = merge_ixbrl_results(results) print(f"Merged facts: {merged.fact_count}") ``` ## Structuring Contexts The `structure_contexts()` function converts raw context XML fragments into typed `StructuredContext` objects with period, entity, and dimension information. The `ContextCollection` class provides filtering and query operations. ```python from xbrl_core import structure_contexts, ContextCollection # Structure contexts from parsed XBRL ctx_map = structure_contexts(parsed.contexts) # Access individual context ctx = ctx_map["CurrentYearInstant"] print(f"Period: {ctx.period}") # InstantPeriod(instant=datetime.date(2024, 3, 31)) print(f"Entity: {ctx.entity_id}") # "E00001" print(f"Dimensions: {ctx.dimensions}") # tuple[DimensionMember, ...] print(f"Is instant: {ctx.is_instant}") # True # Use ContextCollection for filtering coll = ContextCollection(ctx_map) # Filter by period type instant_contexts = coll.filter_instant() duration_contexts = coll.filter_duration() # Filter by dimensions no_dim_contexts = coll.filter_no_dimensions() segment_contexts = coll.filter_by_dimension( axis="{http://example.com/taxonomy}ProductAxis", member="{http://example.com/taxonomy}SegmentA" ) # Get latest periods print(f"Latest instant: {coll.latest_instant_period}") print(f"Unique durations: {coll.unique_duration_periods}") # Get contexts for latest period latest_instant_ctxs = coll.latest_instant_contexts() ``` ## Structuring Units The `structure_units()` function converts raw unit XML fragments into typed `StructuredUnit` objects with support for simple measures and divide measures (e.g., per-share units). ```python from xbrl_core import structure_units # Structure units from parsed XBRL unit_map = structure_units(parsed.units) # Access monetary unit jpy_unit = unit_map["JPY"] print(f"Is monetary: {jpy_unit.is_monetary}") # True print(f"Currency code: {jpy_unit.currency_code}") # "JPY" # Access pure number unit pure_unit = unit_map["pure"] print(f"Is pure: {pure_unit.is_pure}") # True # Access per-share unit (currency / shares) per_share_unit = unit_map["JPYPerShare"] print(f"Is per-share: {per_share_unit.is_per_share}") # True print(f"Currency code: {per_share_unit.currency_code}") # "JPY" # Access shares unit shares_unit = unit_map["shares"] print(f"Is shares: {shares_unit.is_shares}") # True ``` ## Building LineItems The `build_line_items()` function merges `RawFact`, `StructuredContext`, and optional `LabelResolver` into fully typed `LineItem` objects with proper value conversion and label resolution. ```python from xbrl_core import build_line_items, structure_contexts # Parse and structure ctx_map = structure_contexts(parsed.contexts) # Build LineItems with multiple language labels items = build_line_items(parsed.facts, ctx_map, langs=("en", "ja")) for item in items[:10]: print(f"Concept: {item.local_name}") print(f"Value: {item.value}") # Decimal for numeric, str for text print(f"Period: {item.period}") # InstantPeriod or DurationPeriod print(f"Entity: {item.entity_id}") print(f"Unit: {item.unit_ref}") print(f"Decimals: {item.decimals}") print(f"Dimensions: {item.dimensions}") print(f"English label: {item.label('en')}") print(f"Japanese label: {item.label('ja')}") print("---") ``` ## Parsing Presentation Linkbase The `parse_presentation_linkbase()` function parses presentation linkbases to extract hierarchical concept relationships for displaying financial statements in proper order. ```python from xbrl_core import parse_presentation_linkbase, merge_presentation_trees with open("taxonomy_pre.xml", "rb") as f: trees = parse_presentation_linkbase(f.read()) # Iterate over role URIs and trees for role_uri, tree in trees.items(): print(f"Role: {role_uri}") # Flatten the tree (depth-first traversal) for node in tree.flatten(skip_abstract=True, skip_dimension=True): indent = " " * node.depth print(f"{indent}{node.concept} (order={node.order})") # Get only line-items subtree roots for node in tree.line_items_roots(): print(f"Line item root: {node.concept}") # Merge multiple presentation linkbases with open("extension_pre.xml", "rb") as f: trees_ext = parse_presentation_linkbase(f.read()) merged_trees = merge_presentation_trees(trees, trees_ext) ``` ## Parsing Calculation Linkbase The `parse_calculation_linkbase()` function parses calculation linkbases to extract summation-item relationships with weights for validation purposes. ```python from xbrl_core import parse_calculation_linkbase with open("taxonomy_cal.xml", "rb") as f: calc_lb = parse_calculation_linkbase(f.read()) print(f"Roles: {len(calc_lb.role_uris)}") # Iterate over calculation trees for role_uri in calc_lb.role_uris: tree = calc_lb.get_tree(role_uri) print(f"Role: {role_uri}, Roots: {tree.roots}") for arc in tree.arcs: sign = "+" if arc.weight == 1 else "-" print(f" {arc.parent} {sign}-> {arc.child}") # Query relationships children = calc_lb.children_of("GrossProfit") for arc in children: print(f"Child of GrossProfit: {arc.child} (weight={arc.weight})") parents = calc_lb.parent_of("NetSales") for arc in parents: print(f"Parent of NetSales: {arc.parent}") # Get ancestor chain to root ancestors = calc_lb.ancestors_of("NetSales", role_uri=role_uri) print(f"Ancestors: {ancestors}") ``` ## Parsing Definition Linkbase The `parse_definition_linkbase()` function parses definition linkbases to extract dimensional relationships including hypercubes, axes, and domain members. ```python from xbrl_core import parse_definition_linkbase, ARCROLE_HYPERCUBE_DIMENSION, ARCROLE_DOMAIN_MEMBER with open("taxonomy_def.xml", "rb") as f: def_lb = parse_definition_linkbase(f.read()) # Iterate over definition trees for role_uri in def_lb.role_uris: tree = def_lb.get_tree(role_uri) # Access hypercube (dimensional table) information for hc in tree.hypercubes: print(f"Table: {hc.table_concept}") for axis in hc.axes: print(f" Axis: {axis.axis_concept}") if axis.domain: print(f" Domain: {axis.domain.concept}") for member in axis.domain.children: print(f" Member: {member.concept}") # Query relationships by arcrole dimensions = def_lb.children_of("TableConcept", arcrole=ARCROLE_HYPERCUBE_DIMENSION) members = def_lb.parent_of("MemberConcept", arcrole=ARCROLE_DOMAIN_MEMBER) ``` ## Parsing Label Linkbase The `parse_label_linkbase()` function extracts human-readable labels for concepts in multiple languages and roles. ```python from xbrl_core import parse_label_linkbase with open("taxonomy_lab.xml", "rb") as f: labels = parse_label_linkbase(f.read()) # Iterate over labels for lab in labels: print(f"{lab.concept_name} [{lab.lang}] ({lab.role}): {lab.text}") # Custom concept extractor for jurisdiction-specific taxonomies import re def edinet_concept_extractor(href: str) -> str | None: if "#" not in href: return None fragment = href.rsplit("#", 1)[1] m = re.search(r"_([A-Z][A-Za-z0-9]*)$", fragment) return m.group(1) if m else fragment labels = parse_label_linkbase(xml_bytes, concept_extractor=edinet_concept_extractor) ``` ## Parsing Reference Linkbase The `parse_reference_linkbase()` function extracts authoritative references (accounting standards, regulations) associated with concepts. ```python from xbrl_core import parse_reference_linkbase with open("taxonomy_ref.xml", "rb") as f: refs = parse_reference_linkbase(f.read()) for ref in refs: print(f"Concept: {ref.concept_name}") print(f"Role: {ref.role}") for part in ref.parts: print(f" {part.local_name}: {part.value}") print("---") ``` ## Parsing Footnotes The `parse_footnote_links()` function extracts footnotes from XBRL instance documents and maps them to their associated facts. ```python from xbrl_core import parse_footnote_links # Parse footnotes from parsed XBRL footnote_map = parse_footnote_links(parsed.footnote_links) # Get footnotes for a specific fact fact_id = "IdFact1234" notes = footnote_map.get(fact_id) if notes: for note in notes: print(f"Footnote: {note.text} (lang={note.lang})") # List all fact IDs with footnotes print(f"Facts with footnotes: {footnote_map.fact_ids}") print(f"Total facts with footnotes: {len(footnote_map)}") ``` ## Parsing XSD Schema Elements The `parse_xsd_elements()` function extracts element definitions from taxonomy XSD files, including period type, balance, abstract flag, and substitution group. ```python from xbrl_core import parse_xsd_elements with open("taxonomy.xsd", "rb") as f: elements = parse_xsd_elements(f.read()) elem = elements["NetSales"] print(f"Period type: {elem.period_type}") # "duration" print(f"Balance: {elem.balance}") # "credit" print(f"Abstract: {elem.abstract}") # False print(f"Type: {elem.type_name}") # "xbrli:monetaryItemType" print(f"Substitution group: {elem.substitution_group}") # "xbrli:item" ``` ## Calculation Validation The `validate_calculations()` function validates summation-item relationships per XBRL 2.1 section 5.2.5.2, with decimals-based rounding tolerance. ```python from xbrl_core import validate_calculations, parse_calculation_linkbase, build_line_items, structure_contexts # Build LineItems ctx_map = structure_contexts(parsed.contexts) items = build_line_items(parsed.facts, ctx_map) # Parse calculation linkbase with open("taxonomy_cal.xml", "rb") as f: calc_lb = parse_calculation_linkbase(f.read()) # Validate calculations result = validate_calculations(items, calc_lb) print(result) # "Calculation validation: PASS (checked=42, passed=42, errors=0, skipped=3)" print(f"Valid: {result.is_valid}") print(f"Errors: {result.error_count}") print(f"Warnings: {result.warning_count}") # Examine individual issues for issue in result.issues: print(f"Concept: {issue.parent_concept}") print(f"Expected: {issue.expected}, Actual: {issue.actual}") print(f"Difference: {issue.difference}, Tolerance: {issue.tolerance}") print(f"Severity: {issue.severity}") print(f"Message: {issue.message}") # Validate specific role only result = validate_calculations(items, calc_lb, role_uri="http://example.com/role/BalanceSheet") ``` ## Text Block Extraction The `extract_text_blocks()` function extracts textBlockItemType facts (MD&A, risk factors, notes) from filings. The `clean_html()` function converts HTML to plain text. ```python from xbrl_core import extract_text_blocks, clean_html, structure_contexts ctx_map = structure_contexts(parsed.contexts) blocks = extract_text_blocks(parsed.facts, ctx_map) for block in blocks: print(f"Concept: {block.concept}") # "BusinessRisksTextBlock" print(f"Period: {block.period}") # DurationPeriod(...) print(f"Context: {block.context_ref}") # Convert HTML to plain text (preserves table structure) plain_text = clean_html(block.html) print(f"Content preview: {plain_text[:500]}...") print("---") ``` ## DataFrame Conversion The `line_items_to_dataframe()` function converts LineItems to pandas DataFrame. Export functions support CSV, Parquet, and Excel formats. ```python from xbrl_core import line_items_to_dataframe, to_csv, to_parquet, to_excel # Convert to DataFrame df = line_items_to_dataframe(items, label_lang="en") # View data print(df[["local_name", "label", "value", "period_end", "unit_ref"]].head(20)) # Filter numeric facts numeric_df = df[df["unit_ref"].notna()] # Group by period by_period = df.groupby("period_end")["value"].sum() # Export to various formats to_csv(df, "financial_data.csv") to_parquet(df, "financial_data.parquet") to_excel(df, "financial_data.xlsx", sheet_name="BalanceSheet") # Add metadata df = line_items_to_dataframe(items, metadata={"source": "SEC Filing", "entity": "ACME Corp"}) print(df.attrs) # {'source': 'SEC Filing', 'entity': 'ACME Corp'} ``` ## Rich Terminal Display The `render_statement()` and `render_hierarchical_statement()` functions create formatted Rich tables for terminal display with proper indentation and styling. ```python from rich.console import Console from xbrl_core import render_statement, render_hierarchical_statement, DisplayHint, build_display_rows console = Console() # Simple flat table table = render_statement(items, title="Income Statement", label_lang="en") console.print(table) # Hierarchical display with presentation hints hints = [ DisplayHint(concept="AssetsAbstract", depth=0, is_abstract=True, label="Assets"), DisplayHint(concept="CashAndDeposits", depth=1), DisplayHint(concept="AccountsReceivable", depth=1), DisplayHint(concept="CurrentAssets", depth=1, is_total=True, label="Total Current Assets"), DisplayHint(concept="TotalAssets", depth=0, is_total=True, label="Total Assets"), ] table = render_hierarchical_statement(items, hints=hints, title="Balance Sheet", label_lang="en") console.print(table) # Get raw DisplayRow objects for custom rendering rows = build_display_rows(items, hints=hints, label_lang="en") for row in rows: indent = " " * row.depth print(f"{indent}{row.label}: {row.value}") ``` ## HTML Display (Jupyter) The `to_html()` function generates HTML tables suitable for Jupyter notebooks with hierarchical display support. ```python from xbrl_core import to_html, DisplayHint from IPython.display import HTML hints = [ DisplayHint(concept="RevenueAbstract", depth=0, is_abstract=True, label="Revenue"), DisplayHint(concept="NetSales", depth=1), DisplayHint(concept="OtherRevenue", depth=1), DisplayHint(concept="TotalRevenue", depth=0, is_total=True), ] html = to_html(items, hints=hints, title="Income Statement") display(HTML(html)) ``` ## Error Handling All errors inherit from `XbrlError` and carry structured error codes and context for debugging. Custom error and warning classes can be substituted for domain-specific packages. ```python from xbrl_core import XbrlError, XbrlParseError, XbrlValidationError, XbrlWarning import warnings # Catch parsing errors try: parsed = parse_xbrl_facts(bad_bytes) except XbrlParseError as e: print(f"Error code: {e.code}") # "XBRL_PARSE_001" print(f"Context: {e.context}") # {"source_path": "..."} # Capture warnings with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") parsed = parse_xbrl_facts(xbrl_bytes, strict=False) for warning in w: if issubclass(warning.category, XbrlWarning): print(f"Warning: {warning.message}") # Custom error/warning classes for domain-specific packages class EdinetParseError(XbrlParseError): """EDINET-specific parse error.""" class EdinetWarning(UserWarning): """EDINET-specific warning.""" from xbrl_core import parse_calculation_linkbase lb = parse_calculation_linkbase( xml_bytes, error_class=EdinetParseError, warning_class=EdinetWarning, ) ``` ## Summary XBRL Core provides a comprehensive toolkit for parsing and analyzing XBRL and iXBRL financial filings. The main use cases include extracting financial data from regulatory filings (SEC 10-K/10-Q, EDINET reports), validating calculation relationships, converting financial statements to pandas DataFrames for analysis, and building data pipelines for financial research and compliance systems. The library's design ensures that parsed data flows seamlessly through the pipeline: raw bytes are parsed into `ParsedXBRL`, contexts and units are structured, facts are converted to `LineItem` objects, and finally exported to DataFrames or displayed in formatted tables. Integration patterns typically follow a linear pipeline: parse documents with `parse_xbrl_facts()` or `parse_ixbrl_facts()`, structure contexts with `structure_contexts()`, build typed line items with `build_line_items()`, optionally validate calculations, and export to DataFrame or display formats. The library supports both strict mode for production validation and lenient mode for handling non-compliant documents. Custom format registries, concept extractors, and error classes allow adaptation to jurisdiction-specific taxonomies like EDINET (Japan) or TDNET while maintaining the core API structure.