### Install Textractor Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Install the Textractor library to begin working with Amazon Textract. This is a prerequisite for using the subsequent code examples. ```bash pip install textractor ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to retrieve a specific field. Ensure the 'textractor' library is installed. ```python from textractor import Textractor txt = Textractor("us-east-1") # Query for a specific field response = txt.query("What is the invoice number?") print(response.query_results[0].value) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a fundamental query to extract data. Ensure you have the necessary Textractor setup before running. ```python from textractor.tools.utils import Query query = Query("What is the total amount?") ``` -------------------------------- ### Direct Path Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a direct path query to extract specific information. Ensure the 'textractor' library is installed. ```python from textractor.tools.utils import get_document doc = get_document("path/to/your/document.pdf") query = "What is the total amount?" result = doc.query(query) print(result) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a fundamental query to extract specific information. Ensure you have the necessary Textractor setup. ```python from textractor.tools.utils import query_table # Assuming 'doc' is a loaded Textractor document object # Example: Query for a specific value in a table result = query_table(doc, "column_name", "value_to_find") ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract specific text. Ensure the 'textractor' library is installed. ```python from textractor import Textractor txt = Textractor("us-east-1") # Example: Extract text based on a query query = "What is the invoice number?" response = txt.query(query=query) print(response.text) ``` -------------------------------- ### Start Document Analysis (Example) Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/layout_analysis_for_text_linearization.ipynb.txt Example of starting a Textract document analysis job. Ensure SNS topic and role ARNs are valid. ```python textract_client.start_document_analysis( DocumentLocation={'S3Object': {'Bucket': 'your-bucket-name', 'Name': 'your-document.pdf'}}, FeatureTypes=['TABLES'], NotificationChannel={'SNSTopicArn': 'your-sns-topic-arn', 'RoleArn': 'your-sns-role-arn'} ) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to retrieve all pages from a document. Ensure the Textractor library is installed and imported. ```python from textractor import Textractor txtr = Textractor(filename="document.pdf") # Get all pages response = txtr.query(queries=["all pages"]) print(response.pages) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract text based on a keyword. Ensure the 'textractor' library is installed. ```python from textractor import Textractor txt = Textractor() # Example: Extract text after the word "Invoice" query = "Invoice" result = txt.get_text_after(query) print(result) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find a specific word in a document. Ensure the 'textractor' library is installed. ```python from textractor import Textractor txt = Textractor("us-east-1") # Query for a specific word results = txt.query("invoice") for result in results: print(result) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find all 'Invoice ID' fields. Ensure the Textractor library is installed and imported. ```python from textractor.tools.query import Query query = Query("Invoice ID") ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find a specific key-value pair. Ensure you have the Textractor library installed and imported. ```python from textractor.tools.utils import load_document from textractor.data.document import Document doc: Document = load_document("path/to/your/document.pdf") # Example: Find the value associated with the key "Invoice Number" query = "Invoice Number" results = doc.query(query) for result in results: print(f"Found: {result.value} at page {result.page_number}") ``` -------------------------------- ### Direct Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a direct query to Textractor for processing a document. Ensure you have the necessary imports and document setup. ```python from textractor import Textractor txt = Textractor("us-east-1") doc = txt.open_document("document.pdf") # Direct query response = doc.query("What is the total amount?") print(response.answer) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find a specific string within a document. Ensure the Textractor library is installed and imported. ```python from textractor import Textractor txt = Textractor("YOUR_ACCESS_KEY_ID", "YOUR_SECRET_ACCESS_KEY", "YOUR_REGION_NAME") # Example: Query for a specific string response = txt.query("Find the invoice number.") print(response.text) ``` -------------------------------- ### Example Usage Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Provides a complete example of creating a Document object and then linearizing it to both Markdown and HTML formats. ```python # Create a document doc = Document("My Document") # Add sections and content intro_section = Section("Introduction") intro_section.add_content("This is the introduction to my document.") doc.add_section(intro_section) body_section = Section("Main Content") body_section.add_content("This is the main content of the document.") body_section.add_content("Here is another paragraph.") doc.add_section(body_section) # Convert to Markdown markdown_output = to_markdown(doc) print("--- Markdown Output ---") print(markdown_output) # Convert to HTML html_output = to_html(doc) print("--- HTML Output ---") print(html_output) ``` -------------------------------- ### Install Amazon Textract Textractor from Source Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/installation.rst.txt Install the package in editable mode after cloning the repository and installing requirements. ```bash pip install -e . ``` -------------------------------- ### Get Signature Information from a Page Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to retrieve signature information specifically from a single page of a document. Ensure the 'textractor' library is installed. ```python from textractor import Textractor txt = Textractor(textract_client=None) # Get signatures from a specific page (e.g., page 1) doc = txt.start("document.pdf", pages=[1]) signatures = doc.pages[0].signatures # Print signature information for the page for signature in signatures: print(f"Signature found at: {signature.geometry}") print(f"Confidence: {signature.confidence}") ``` -------------------------------- ### Install Requirements from Source Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/installation.rst.txt After cloning the repository, install the necessary requirements using the provided requirements.txt file. ```bash pip install -r requirements.txt ``` -------------------------------- ### Get Document Analysis Results (Example) Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/layout_analysis_for_text_linearization.ipynb.txt Example of retrieving Textract analysis results using a Job ID. This is a simplified retrieval. ```python response = textract_client.get_document_analysis(JobId='your-job-id') results = response['Blocks'] ``` -------------------------------- ### Python: Example Usage Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt This is an example of how to use the notebook_to_markdown function, including commented-out lines for specifying the notebook path and printing the output. ```python # Example usage: # notebook_path = 'path/to/your/notebook.ipynb' # markdown_content = notebook_to_markdown(notebook_path) # print(markdown_content) ``` -------------------------------- ### Install amazon-textract-textractor Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/simple_ocr.ipynb.txt Install the package using pip. Consider installing PDF extra dependencies if your workflow uses PDFs. ```bash pip install amazon-textract-textractor ``` ```bash pip install amazon-textract-textractor[pdfium] ``` -------------------------------- ### Example Usage: Converting Extracted Table Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/table_data_to_various_formats.ipynb.txt This example shows how to use the conversion functions with sample table data. Ensure you have the necessary libraries installed. ```python sample_table_data = [ {'Column1': 'Row1Value1', 'Column2': 'Row1Value2'}, {'Column1': 'Row2Value1', 'Column2': 'Row2Value2'} ] # Convert to CSV csv_output = table_to_csv(sample_table_data) print("--- CSV Output ---") print(csv_output) # Convert to JSON json_output = table_to_json(sample_table_data) print("\n--- JSON Output ---") print(json_output) # Convert to Parquet (saves to a file named 'output.parquet') table_to_parquet(sample_table_data, 'output.parquet') print("\n--- Parquet Output ---") print("Table data saved to output.parquet") ``` -------------------------------- ### Python Image Processing Setup Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/visualizing_results.ipynb.txt Basic imports required for image manipulation and drawing using Pillow. Ensure Pillow is installed (`pip install Pillow`). ```python from PIL import Image, ImageDraw import json ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to retrieve all documents. This is a foundational step for more complex operations. ```python from textractor.tools.document import Document from textractor.tools.document.query import Query doc = Document("path/to/your/document.pdf") # Get all documents all_docs = Query().all().get(doc) print(f"Found {len(all_docs)} documents.") ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract data. Ensure you have the necessary imports. ```python from textractor.tools.utils import get_document_information doc = get_document_information("path/to/your/document.pdf") # Example query: Extract all tables tables = doc.query("tables") # Example query: Extract all forms forms = doc.query("forms") # Example query: Extract all text text = doc.query("text") ``` -------------------------------- ### Start Expense Analysis (Async) Source: https://aws-samples.github.io/amazon-textract-textractor/commandline.html Initiates asynchronous analysis for expense documents. Requires input files to be in S3 or uploaded using `--s3-upload-path`. ```bash textractor start-expense-analysis [-h] [--s3-upload-path S3_UPLOAD_PATH] [--s3-output-path S3_OUTPUT_PATH] [--profile-name PROFILE_NAME] [--region-name REGION_NAME] file_source ``` -------------------------------- ### Get Text by Product Instruction Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/layout_analysis_for_text_linearization.ipynb.txt Retrieves detected product instructions from documents. Useful for user manuals and guides. ```python print(doc.get_text_by_product_instruction()) ``` -------------------------------- ### Initialize Bedrock Client Source: https://aws-samples.github.io/amazon-textract-textractor/notebooks/textractor_for_large_language_models.html Sets up the AWS region and Bedrock endpoint URL, then initializes the Bedrock client for invoking models. ```python import os import boto3 import json from PIL import Image from textractor import Textractor from textractor.visualizers.entitylist import EntityList from textractor.data.constants import TextractFeatures def get_response_from_claude(context, prompt_data): body = json.dumps({ "prompt": f"""Human: Given the following document: {context} Answer the following:\n {prompt_data} Assistant:""", "max_tokens_to_sample": 2000, "top_k": 1, }) modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider accept = '*/*' contentType = 'application/json' response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType) response_body = json.loads(response.get('body').read()) answer = response_body.get('completion') return answer os.environ["AWS_DEFAULT_REGION"] = "us-west-2" os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com" bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com') ``` -------------------------------- ### Initialize Textractor and Analyze Document with Queries Source: https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_queries.html Initialize the Textractor client and use the analyze_document method with TextractFeatures.QUERIES and a list of queries. This example uses an image file as the document source. ```python import os from textractor import Textractor from textractor.data.constants import TextractFeatures extractor = Textractor(profile_name="default") document = extractor.analyze_document( file_source=Image.open("../../../tests/fixtures/form.png"), features=[TextractFeatures.QUERIES], queries=queries ) ``` -------------------------------- ### Example Usage: Analyze and Print Tables Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Demonstrates the complete workflow: analyzing a document for tables and then printing the linearized results. Replace 'your-bucket-name' and 'your-document.pdf' with your actual S3 bucket and document. ```python bucket = 'your-bucket-name' document = 'your-document.pdf' response = analyze_document_tables(bucket, document) table_data = get_table_results(response) print_table_data(table_data) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract a specific piece of information. Ensure you have initialized the Textractor client. ```python from textractor.tools.utils import query_document query = "What is the total amount?" response = query_document(document=doc, query=query) print(response) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find a specific piece of information. Ensure you have initialized Textractor and loaded your document. ```python from textractor import Textractor txt = Textractor("YOUR_REGION", "YOUR_BUCKET_NAME") doc = txt.start("YOUR_DOCUMENT_NAME") # Define a query to find the 'Invoice Number' query = "Invoice Number" # Run the query on the document response = doc.query(query) # Print the query response print(response) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract specific information. Ensure you have the necessary Textractor components imported. ```python from textractor.tools.utils import query_document query_document(document, "What is the total amount?") ``` -------------------------------- ### Access Column Content Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Accesses the content of a specific column. This example shows how to get the text content of all cells in the first column. ```python column_texts = [cell.text for cell in tables[0].columns[0]] ``` -------------------------------- ### Access Row Content Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Accesses the content of a specific row. This example shows how to get the text content of all cells in the first row. ```python row_texts = [cell.text for cell in tables[0].rows[0]] ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt This snippet demonstrates a fundamental query to retrieve specific data. Ensure necessary imports are present. ```python from textractor.tools.utils import get_document_from_s3 document = get_document_from_s3(bucket="your-bucket-name", key="your-document-key.pdf") # Example query: Get all pages with more than 100 words query_result = document.query(lambda page: len(page.words) > 100) print(f"Found {len(query_result)} pages matching the query.") ``` -------------------------------- ### Analyze Expense with Custom Configuration Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_analyze_expense.ipynb.txt Demonstrates how to use custom configurations for expense analysis, such as specifying a particular feature or setting a minimum confidence score. This allows for more tailored data extraction. ```python from trp.analyze.expense import AnalyzeExpense, ExpenseFeatures expense_analyzer = AnalyzeExpense( features=[ExpenseFeatures.LINE_ITEM_GROUPS], minimum_ைconfidence=0.9 ) response = expense_analyzer.analyze_expense("document.pdf") print(response) ``` -------------------------------- ### Get Signature Bounding Box Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to retrieve the bounding box coordinates for detected signatures. It iterates through the signatures found in a document. ```python from textractor.data.document import Document if __name__ == "__main__": # Example usage: doc = Document(path="path/to/your/document.pdf") signatures = doc.signatures for signature in signatures: print(f"Signature bounding box: {signature.bounding_box}") ``` -------------------------------- ### Get Signature Field Information Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to extract information about signature fields, including their geometry and confidence score, from the Textract response. ```python import boto3 def get_signature_field_info(bucket, document): client = boto3.client('textract') # Call analyze_document with Signature detection response = client.analyze_document( Document={'S3Object': {'Bucket': bucket, 'Name': document}}, FeatureTypes=['SIGNATURE'] ) # Extract signature field information signature_fields = [] for item in response['DocumentMetadata']['Pages']: for block in response['Blocks']: if block['BlockType'] == 'SIGNATURE_FIELD': signature_fields.append({ 'page': item['Page'], 'id': block['Id'], 'confidence': block['Confidence'], 'geometry': block['Geometry'] }) return signature_fields # Example usage: # bucket_name = 'your-s3-bucket-name' # document_name = 'your-document-name.pdf' # signature_info = get_signature_field_info(bucket_name, document_name) # print(f"Found {len(signature_info)} signature fields.") # for info in signature_info: # print(f"Page: {info['page']}, ID: {info['id']}, Confidence: {info['confidence']}") ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to extract data. Ensure you have initialized the Textractor client and loaded your document. ```python from textractor import Textractor client = Textractor(region_name="us-east-1") document = client.start_document_analysis( "s3://textract-sample-data/invoice/sample-invoice.pdf", "invoice" ) # Example: Get all invoice IDs invoice_ids = document.get_by_field("INVOICE_ID") for invoice_id in invoice_ids: print(f"Invoice ID: {invoice_id.value}") ``` -------------------------------- ### Load Document and Get Tables Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Loads a document and extracts all tabular data. Ensure the 'textractor' library is installed and the document path is correct. ```python from textractor import Textractor txt = Textractor.from_path("path/to/your/document.pdf") tables = txt.get_tables() print(f"Found {len(tables)} tables.") ``` -------------------------------- ### Extract and Print Tables Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt This example demonstrates how to use the Textractor library to get document content and then extract and print any tables found within it. ```python from textractor import Textractor txtr = Textractor("us-east-2") doc = txtr.start(file_path='sample.pdf') tables = process_document_tables(doc) for i, table in enumerate(tables): print(f"\n--- Table {i+1} ---") for row in table: print('\t'.join(row)) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to retrieve all pages from a document. This is a foundational step for many extraction tasks. ```python from textractor import Textractor txt = Textractor("path/to/your/document.pdf") # Get all pages response = txt.get_pages() print(response) ``` -------------------------------- ### Access Cell Content Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Accesses the text content of a specific cell within a table. This example shows how to get the text from the first cell. ```python cell_text = tables[0].cells[0].text ``` -------------------------------- ### Get Table as Pandas DataFrame Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization_continued.ipynb.txt Converts a table into a Pandas DataFrame for convenient data manipulation and analysis. Requires the pandas library to be installed. ```python import pandas as pd df = tables[0].to_pandas() print(df) ``` -------------------------------- ### Get Specific Pages as Forms Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the forms from a range of pages as a list of forms. Specify start and end page numbers (inclusive). ```python pages_3_to_5_forms = document.get_pages_forms(start_page=3, end_page=5) ``` -------------------------------- ### Querying with Document Configuration Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt This example demonstrates how to use DocumentConfig to specify settings for query execution, such as the desired feature types (e.g., QUERIES). This allows for fine-grained control over the Textract analysis. ```python from trp.trp_query import Query, QueryConfig from trp.trp_utils import DocumentConfig # Assuming you have a Textract client and a document loaded # Example: # textract_client = boto3.client("textract", region_name="us-east-1") # document = Document("path/to/your/document.pdf", textract_client=textract_client) # Define a query query_config = QueryConfig(query_string="Payment Method") queries = [Query(query_config=query_config)] # Configure document settings for query execution document_config = DocumentConfig(feature_type="QUERIES") # Execute the query with document configuration # response = document.query(queries=queries, document_config=document_config) # The response will be generated based on the specified feature type # print(response) ``` -------------------------------- ### Get Specific Pages as Tables Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the tables from a range of pages as a list of tables. Specify start and end page numbers (inclusive). ```python pages_3_to_5_tables = document.get_pages_tables(start_page=3, end_page=5) ``` -------------------------------- ### Get Specific Pages as Images Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the content from a range of pages as a list of images. Specify start and end page numbers (inclusive). ```python pages_3_to_5_images = document.get_pages_images(start_page=3, end_page=5) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find a specific key-value pair. Ensure you have initialized the Textractor client and loaded your document. ```python from textractor import Textractor txt = Textractor(region_name="us-east-1") doc = txt.start_document_analysis("document.pdf") # Example: Query for a specific key query_result = doc.query("Invoice Number") print(query_result) ``` -------------------------------- ### Initialize Textractor for Visualization Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/visualizing_results.ipynb.txt Set up the Textractor library to process documents and prepare for result visualization. Ensure you have the necessary AWS credentials and Textract permissions configured. ```python from textractor.visualizer import Visualizer from textractor import Textractor # Initialize Textractor with your AWS region txt_processor = Textractor(region_name="us-east-1") # Initialize the Visualizer visualizer = Visualizer() # Load a document (e.g., from a file path) doc = txt_processor.parse_document(document="path/to/your/document.pdf") ``` -------------------------------- ### Get Signature Field Information Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to retrieve detailed information about detected signature fields, including their bounding boxes and confidence scores. ```python from trp.aws.aws_textract_document import AWSTextractDocument def get_signature_field_info(bucket, document): """Gets information about signature fields in a document. Args: bucket (str): The S3 bucket name. document (str): The S3 object key for the document. Returns: list: A list of dictionaries, each containing information about a signature field. """ doc = AWSTextractDocument(s3_bucket=bucket, s3_object_key=document) signature_fields = doc.detect_signatures() field_info = [] for field in signature_fields: field_info.append({ "type": field.type, "confidence": field.confidence, "geometry": field.geometry }) return field_info ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to filter data. Ensure you have the necessary Textractor library imported. ```python from textractor.tools.query import Query query = Query() query.add_filter(field="field_name", operator="=", value="value") results = query.run(data) ``` -------------------------------- ### Get Signature Fields from a Document Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example demonstrates how to retrieve signature fields from a document using the AWSTextractDocument class. It assumes the document has already been processed. ```python from trp.aws.aws_textract_document import AWSTextractDocument def get_signature_fields(bucket: str, document: str): """Gets signature fields from a document. Args: bucket: The S3 bucket name. document: The document file name. Returns: A list of signature fields. """ doc = AWSTextractDocument(bucket=bucket, document=document) return doc.get_signature_fields() ``` -------------------------------- ### Initialize Textractor with Gateway and Direct Path Queries Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Initialize the Textractor client with both gateway and direct path query configurations. This allows for flexible query execution. ```python from textractor.textractor import Textractor from textractor.data.config import TextractorConfig config = TextractorConfig(queries=["What is the invoice number?", "What is the total amount?"], gateway_queries=["What is the vendor name?"]) txt = Textractor(config=config) ``` -------------------------------- ### Process Document with Textractor Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Load and process a document using Textractor. This example shows how to get all text elements, including tables and key-value pairs. ```python from textractor import Textractor from textractor.data.text_elements import TextType txt = Textractor(profile_name="your-profile-name") doc = txt.process_document("path/to/your/document.pdf") # Get all text elements all_text_elements = doc.get_text_elements() # Get tables tables = doc.get_table_elements() # Get key-value pairs key_value_pairs = doc.get_key_value_elements() # Convert tables to JSON tables_json = [table.to_json(TextType.LINEARIZED) for table in tables] # Convert key-value pairs to JSON key_value_json = [kv.to_json(TextType.KEY_VALUE) for kv in key_value_pairs] print("Tables:", tables_json) print("Key-Value Pairs:", key_value_json) ``` -------------------------------- ### Get Table as Pandas DataFrame Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Converts a detected table into a Pandas DataFrame for easier data manipulation and analysis. Requires the pandas library to be installed. ```python import pandas as pd df = document.tables[0].to_pandas() print(df) ``` -------------------------------- ### Initialize Textract Client (Python) Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt Example of initializing the Amazon Textract client in Python. This is a prerequisite for using Textract functionalities. ```python import boto3 textract_client = boto3.client('textract') document = { 'S3Object': { 'Bucket': 'your-bucket-name', 'Name': 'your-document-name' } } ``` -------------------------------- ### Claude LLM Integration Setup Source: https://aws-samples.github.io/amazon-textract-textractor/notebooks/tabular_data_linearization_continued.html Sets up the AWS SDK for Bedrock to interact with Claude models for natural language processing. ```python import json import boto3 def get_response_from_claude(context, prompt_data): body = json.dumps({ "prompt": f"Human: Given the following document: {context} Answer the following:\n {prompt_data} Assistant:" ``` ```python "max_tokens_to_sample": 2000, "top_k": 1, }) modelId = f'anthropic.claude-instant-v1' # change this to use a different version from the model provider accept = '*/*' contentType = 'application/json' response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType) response_body = json.loads(response.get('body').read()) answer = response_body.get('completion') return answer os.environ["AWS_DEFAULT_REGION"] = "us-west-2" os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com" bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2',endpoint_url='https://bedrock-runtime.us-west-2.amazonaws.com') ``` -------------------------------- ### Get Document Text with Layout (Final Example) Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/layout_analysis_for_text_linearization.ipynb.txt A comprehensive function to extract document text, handling various block types and their hierarchical relationships. ```python def get_document_text_with_layout_final(response): text = "" for block in response["Blocks"]: if block["BlockType"] == "FORM": text += block["Text"] + "\n\n" elif block["BlockType"] == "TABLE": text += block["Text"] + "\n\n" elif block["BlockType"] == "PARAGRAPH": text += block["Text"] + "\n\n" elif block["BlockType"] == "LINE": text += block["Text"] + "\n" elif block["BlockType"] == "WORD": text += block["Text"] + " " elif block["BlockType"] == "PAGE_NUMBER": text += block["Text"] + "\n\n" return text ``` -------------------------------- ### Get Specific Pages as CSV Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the content from a range of pages as a single CSV file. Specify start and end page numbers (inclusive). ```python pages_3_to_5_csv = document.get_pages_csv(start_page=3, end_page=5) ``` -------------------------------- ### Gateway Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Shows how to use a gateway query, which is useful for more complex scenarios or when direct queries are not sufficient. This method allows for more control over the query process. ```python from textractor import Textractor txt = Textractor("us-east-1") doc = txt.open_document("document.pdf") # Gateway query response = doc.gateway_query("What is the total amount?") print(response.answer) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt This snippet demonstrates a fundamental query to extract data. Ensure you have the necessary imports before running. ```python from textractor.tools.utils import get_document_information doc_info = get_document_information(document_path) # Example query: Extract all text from the document query = "SELECT text FROM document" results = doc_info.query(query) print(results) ``` -------------------------------- ### Get Specific Pages as PDF Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the content from a range of pages as a single PDF file. Specify start and end page numbers (inclusive). ```python pages_3_to_5_pdf = document.get_pages_pdf(start_page=3, end_page=5) ``` -------------------------------- ### Get Document Analysis Results Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/layout_analysis_for_text_linearization.ipynb.txt Retrieves the results of an asynchronous document analysis job using the JobId. This is typically called after starting a job with `start_document_analysis`. ```python response = textract.get_document_analysis( JobId=job_id ) # Process the results for item in response["Blocks"]: if item["BlockType"] == "LINE": print(item["Text"]) ``` -------------------------------- ### Process Signatures from S3 Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to detect signatures in a document stored in an S3 bucket. Ensure the 'textractor' library is installed and AWS credentials are configured. ```python from textractor.data.document import Document doc = Document(s3_bucket="your-bucket-name", s3_key="path/to/your/document.pdf") doc.detect_signatures() for signature in doc.signatures: print(f"Signature found at: {signature.bounding_box}") ``` -------------------------------- ### Basic Signature Detection Query Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example demonstrates a basic query to detect signatures in a document. Ensure the document is uploaded and accessible. ```python from textractor.data.document import Document doc = Document.open("path/to/your/document.pdf") signatures = doc.get_signatures() for signature in signatures: print(f"Signature found at: {signature.geometry}") ``` -------------------------------- ### Get Signature Information with Confidence Score Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example demonstrates how to retrieve signature information along with their confidence scores. This is useful for filtering signatures based on reliability. ```python from trp.aws.aws_textract_document import AWSTextractDocument from trp.aws.aws_textract_document import AWSTextractDocumentConfig doc = AWSTextractDocument(AWSTextractDocumentConfig(profile="default", region_name="us-east-1")) doc.detect_signatures(document="/path/to/your/document.pdf") for signature in doc.signatures: print(f"Signature ID: {signature.id}, Confidence: {signature.confidence}") ``` -------------------------------- ### Advanced Querying with Specific Keys Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt This example shows how to query for specific keys, allowing for more precise data extraction. It's useful when you know the exact field names you need. ```python doc = t.start( "s3://textract-sample-data/sample-us-west-2/invoice.png" ) # Query for a specific key results = doc.query_kvp(["invoice"]) # Print the results for result in results: print(f"{result.key}: {result.value}") ``` -------------------------------- ### Get Table as Pandas DataFrame Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt Converts a specific linearized table into a Pandas DataFrame for easier data manipulation and analysis. Requires the pandas library to be installed. ```python import pandas as pd df = pd.DataFrame(linearized_tables[0]) ``` -------------------------------- ### Example Usage: Detect and Process Signatures Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to call the detect_signatures and process_signature_detection_results functions with a sample S3 bucket and document name. Make sure to replace 'your-bucket-name' and 'your-document.pdf' with your actual S3 details. ```python if __name__ == "__main__": bucket_name = "your-bucket-name" document_name = "your-document.pdf" signature_response = detect_signatures(bucket_name, document_name) process_signature_detection_results(signature_response) ``` -------------------------------- ### AnalyzeDocument API Call Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt This is a basic example of calling the AnalyzeDocument API to get document analysis results. Ensure you have the necessary AWS credentials and permissions configured. ```python import boto3 client = boto3.client('textract') response = client.analyze_document( Document={'S3Object': {'Bucket': 'YOUR_BUCKET_NAME', 'Name': 'YOUR_DOCUMENT_NAME'}}, FeatureTypes=['FORMS', 'TABLES'] ) # Process the response here print(response) ``` -------------------------------- ### Basic Query Example Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_queries.ipynb.txt Demonstrates a simple query to find specific text within a document. Ensure Textractor is initialized before use. ```python from textractor import Textractor txt = Textractor() # Example: Find all occurrences of 'invoice' results = txt.find("invoice") for result in results: print(f"Found 'invoice' at page {result.page_number}, bounding box: {result.bounding_box}") ``` -------------------------------- ### Find a Word and Get its Page Number Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/finding_words_within_a_document.ipynb.txt This example shows how to find a word and retrieve the page number where it was found. This is helpful for organizing and referencing search results. ```python from textractor.textractor import Textractor tractor = Textractor() # Find all occurrences of "data" and get their page numbers found_words_with_pages = tractor.find_all_words("data", get_page_numbers=True) for word_info in found_words_with_pages: print(f"Found 'data' on page: {word_info['page']}") ``` -------------------------------- ### Example Usage of Signature Detection Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt Demonstrates how to use the detect_signatures function with a sample document. Ensure you have AWS credentials configured. ```python import boto3 # Initialize Textract client textract_client = boto3.client('textract') # Load document from S3 with open("signature.png", "rb") as document_file: document_bytes = document_file.read() document = { 'Bytes': document_bytes } signatures = detect_signatures(textract_client, document) if signatures: print(f"Detected {len(signatures)} signature fields:") for sig in signatures: print(f"- ID: {sig['Id']}, Geometry: {sig['Geometry']}") else: print("No signature fields detected.") ``` -------------------------------- ### Install Textractor with Multiple Extras Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/installation.rst.txt Install Textractor by specifying multiple extras, such as pdf and torch, separated by commas. ```bash pip install amazon-textract-textractor[pdf,torch] ``` -------------------------------- ### Get Table as Pandas DataFrame Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization_continued.ipynb.txt Retrieves a specific table from the document as a Pandas DataFrame for convenient data manipulation and analysis. Requires the 'pandas' library to be installed. ```python import pandas as pd from textractor.data.document import Document doc = Document(bucket_name="amazon-textract-sample-data", document_name="sample_invoice.pdf") doc.process(verbose=True) tables = doc.tables # Assuming you want the first table df = tables[0].to_pandas() print(df) ``` -------------------------------- ### Get Specific Pages as JSON Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/document_linearization_to_markdown_or_html.ipynb.txt Retrieves the analysis results for a range of pages as a single JSON object. Specify start and end page numbers (inclusive). ```python pages_3_to_5_json = document.get_pages_json(start_page=3, end_page=5) ``` -------------------------------- ### Initialize Textractor Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_analyze_id.ipynb.txt Instantiate the Textractor client. This is the first step before performing any Textract operations. ```python from textractor import Textractor txt = Textractor(profile_name="your-profile-name", region_name="your-region-name") ``` -------------------------------- ### Analyze ID with Specific Configuration Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/using_analyze_id.ipynb.txt This example demonstrates how to analyze an ID with specific configurations, such as setting the region and profile. Adjust 'us-east-1' and 'default' as needed. ```python from textractor.tools.document import Document from textractor.data.document import DocumentType doc = Document(document_type=DocumentType.ANALYZE_ID, region_name="us-east-1", profile_name="default") doc.analyze() doc.save_json("analyze_id_output_configured.json") ``` -------------------------------- ### Detect Signatures in a Document (Node.js) Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This Node.js example demonstrates how to detect signatures in a document using the AWS SDK for JavaScript. Ensure you have the AWS SDK installed and configured. ```javascript const AWS = require('aws-sdk'); const fs = require('fs'); AWS.config.update({ region: 'us-east-1' }); const textract = new AWS.Textract(); fs.readFile('invoice.png', (err, data) => { if (err) throw err; const params = { Document: { Bytes: data } }; textract.detectDocumentText(params, (err, response) => { if (err) { console.log(err); return; } response.Blocks.forEach((block) => { if (block.BlockType === 'SIGNATURE') { console.log(`Found signature: ${block.Id}`); } }); }); }); ``` -------------------------------- ### Install Amazon Textract Textractor from PyPI Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/installation.rst.txt Install the base package from PyPI. Use extras like [pdfium] for PDF rasterization. ```bash pip install amazon-textract-textractor ``` -------------------------------- ### Process Signatures from an Image Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example shows how to detect signatures in an image file using Amazon Textract. Ensure the 'textractor' library is installed and the image path is correct. ```python from textractor.tools.signature_detection import SignatureDetection image_path = "/path/to/your/image.png" signature_detector = SignatureDetection(image_path=image_path) signature_detector.detect_signatures() # Iterate through detected signatures and print their information for signature in signature_detector.signatures: print(f"Detected signature: {signature.bounding_box}") ``` -------------------------------- ### Loading a Document with Specific Configurations Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/tabular_data_linearization.ipynb.txt This example illustrates loading a document with Textractor, specifying configurations like skipping duplicates and normalizing text and layout. Adjust the document path as needed. ```python from textractor.data.document import Document from textractor.data.document.document import DocumentConfiguration doc = Document(document_path="path/to/your/document.pdf", document_configuration=DocumentConfiguration(skip_duplicates=True, normalize_text=True, normalize_layout=True, normalize_table=True)) ``` -------------------------------- ### Get Signature Confidence Score Source: https://aws-samples.github.io/amazon-textract-textractor/_sources/notebooks/signature_detection.ipynb.txt This example demonstrates how to retrieve the confidence score for detected signatures. The confidence score indicates the likelihood that the detected field is indeed a signature. ```python from trp.signature import SignatureDetector def get_signature_confidence(document_path): detector = SignatureDetector(document_path=document_path) detector.detect() for signature in detector.signatures: print(f"Signature on page {signature.page} has confidence: {signature.confidence:.2f}") get_signature_confidence("path/to/your/document.pdf") ```