### Install PDFDataExtractor and ChemDataExtractor Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Clone the repository and install PDFDataExtractor. Optionally, install ChemDataExtractor for chemistry-aware extraction. ```sh # Clone the repository git clone git@github.com:cat-lemonade/PDFDataExtractor.git # Install PDFDataExtractor python setup.py install # Optional: install ChemDataExtractor for chemistry-aware extraction pip install chemdataextractor ``` -------------------------------- ### Install PDFDataExtractor Package Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt Run this command in the downloaded repository directory to install the PDFDataExtractor package. ```sh python setup.py install ``` -------------------------------- ### Install ChemDataExtractor Dependency Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt Install ChemDataExtractor using pip if you need to perform chemistry-related information extraction, such as identifying chemical names. ```sh pip install chemdataextractor ``` -------------------------------- ### Extract Chemistry Information from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/source/perform_extraction/extraction.md Utilize the `chem=True` flag to enable chemistry-related information extraction using ChemDataExtractor alongside metadata extraction. Ensure both libraries are installed. ```python # Path to the PDF file file = r'path to the file' # Create an instance reader = Reader() # Read the file pdf = reader.read_file(file) # Show extracted chemical information r = pdf.abstract(chem=True) r.records.serialize() ``` -------------------------------- ### Iterate and Print PDF References Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb This Python code iterates through the references extracted from a PDF document. It prints the sequence number and the content of each reference. Ensure the pdfdataextractor library is installed and a PDF object is initialized. ```python for seq, ref in pdf.reference().items(): print(seq) print(ref) ``` -------------------------------- ### Extract Images from PDF (Temporarily unavailable) Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/perform_extraction/extraction.rst.txt This code is intended to extract images from a PDF file. Note that this feature is temporarily unavailable. The example shows how to access a specific image by its index. ```python # Import PDFDataExtractor from pdfdataextractor import Reader # Spefify the path to the PDF file path = r'the path to the PDF file' # Create an instance file = Reader() # Read the file pdf = file.read_file(path) # To access a specific image pdf.iamge()[0] ``` -------------------------------- ### Extract Pure Text from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/perform_extraction/extraction.rst.txt Use this snippet to extract plain text content from a PDF file. Ensure the PDFDataExtractor library is installed and the correct path to the PDF is provided. ```python # Import PDFDataExtractor from pdfdataextractor import Reader # Spefify the path to the PDF file path = r'the path to the PDF file' # Create an instance file = Reader() # Read the file pdf = file.read_file(path) # Get pure text pdf.plaintext() ``` -------------------------------- ### Extract Full Plain Text Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Get the entire document as a single concatenated string in reading order. Useful for full-text indexing or downstream NLP pipelines. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') # Get entire document as plain text text = pdf.plaintext() print(text[:500]) # Output: "Journal of Chemical Information and Modeling PDFDataExtractor: A Tool for Reading Scientific Text..." ``` -------------------------------- ### Get Section Titles and Text Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Use the `pdf.section()` method to retrieve a dictionary where keys are section titles and values are lists of text content within each section. This is useful for parsing structured documents. ```python pdf.section() ``` -------------------------------- ### Initialize PDF Path Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Specify the file path for the PDF document to be processed. ```python file_test = r'/Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf' ``` -------------------------------- ### Initialize PDF Reader Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Initialize the Reader class, which is the entry point for PDF processing. Optionally show the detected publisher name. ```python from pdfdataextractor import Reader # Initialize reader (optionally show detected publisher name) reader = Reader(showPublisher=True) # Read a single PDF file — auto-detects publisher and returns a template object pdf = reader.read_file('path/to/article.pdf') # Verify that the PDF was loaded and a template was returned pdf.test() # Output: PDF returned successfully ``` -------------------------------- ### Initialize Reader Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Create an instance of the Reader class to handle PDF reading. ```python reader = Reader() ``` -------------------------------- ### Extract Journal Information Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Retrieves journal-related information from the PDF. Can be used to get the full journal string or specific components like name, year, volume, or page. ```APIDOC ## pdf.journal() ### Description Retrieves a dictionary containing journal information (name, year, volume, page) from the PDF. ### Method `pdf.journal()` ### Parameters None ### Response #### Success Response (200) - **name** (string) - The full journal name and citation. - **year** (string) - The publication year. - **volume** (string) - The publication volume. - **page** (string) - The page range. ### Request Example ```python pdf.journal() ``` ### Response Example ```json { "name": "J. Chem. Inf. Model. 2016, 56, 1894−1904", "year": "2016", "volume": "56", "page": "1894-1904" } ``` ## pdf.journal(field) ### Description Retrieves a specific field of journal information from the PDF. ### Method `pdf.journal(field: str)` ### Parameters #### Path Parameters - **field** (string) - Required - The specific field to retrieve. Accepted values: 'name', 'year', 'volume', 'page'. ### Request Example ```python pdf.journal('year') ``` ### Response #### Success Response (200) - Returns the string value of the requested field. ### Response Example ``` '2016' ``` ``` -------------------------------- ### Pass a PDF File to Reader Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/running.rst.txt Specify the path to your PDF file, create an instance of the Reader, and then read the file. Includes a test to verify successful reading. ```python # Spefify the path to the PDF file path = r'the path to the PDF file' # Create an instance file = Reader() # Read the file pdf = file.read_file(path) # Test if pdf is returned successfully pdf.test() ``` -------------------------------- ### Load and Parse a PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Read a PDF file, auto-detect the publisher, and dispatch to the matching template. Batch process multiple PDFs. ```python from pdfdataextractor import Reader reader = Reader(showPublisher=True) # Returns a publisher-specific template object (e.g., ElsevierTemplate) pdf = reader.read_file('elsevier_article.pdf') # Console output: Publisher: *** elsevier *** # *** Elsevier detected *** # Batch processing multiple PDFs import glob for filepath in glob.glob('/data/articles/acs/*.pdf'): try: pdf = reader.read_file(filepath) if pdf: print(pdf.title()) except Exception: pass ``` -------------------------------- ### Initialize Reader Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Create an instance of the Reader class. This object will be used to read and process PDF files. ```python file = Reader() ``` -------------------------------- ### Read PDF File with PDFDataExtractor Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/getting_started/running.html Specify the path to your PDF file, create a Reader instance, and read the file. Includes a test to verify successful file reading. ```python # Spefify the path to the PDF file path = r'the path to the PDF file' # Create an instance file = Reader() # Read the file pdf = file.read_file(path) # Test if pdf is returned successfully pdf.test() ``` -------------------------------- ### Reader() — Initialize the PDF Reader Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt The Reader class is the primary entry point for PDF processing. It handles publisher auto-detection and directs PDFs to the appropriate publisher-specific template for extraction. ```APIDOC ## Reader() ### Description Initializes the Reader class, which is the entry point for all PDF processing. It handles publisher auto-detection and routes each PDF to the appropriate publisher-specific template for extraction. ### Method ```python Reader(showPublisher=True) ``` ### Parameters #### Optional Parameters - **showPublisher** (bool) - Optional - If True, displays the detected publisher name during initialization. ### Request Example ```python from pdfdataextractor import Reader # Initialize reader (optionally show detected publisher name) reader = Reader(showPublisher=True) ``` ``` -------------------------------- ### Specify PDF File Path Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Define the path to the PDF file you want to process. Use raw strings (r'') to handle backslashes correctly. ```python path = r'/Users/miao/Downloads/acs.jcim.6b00207.pdf' ``` -------------------------------- ### Clone PDFDataExtractor Repository Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt Use this command to download the PDFDataExtractor source code from GitHub. ```sh git clone git@github.com:cat-lemonade/PDFDataExtractor.git ``` -------------------------------- ### Extract Figure and Table Captions from PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Returns a dictionary of figure and table captions, keyed by identifiers like 'figure 1' or 'table 1'. Setting nicely=True will pretty-print captions to standard output. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') # Get captions as a dictionary captions = pdf.caption() print(captions) # Output: { # 'figure 1': 'Figure 1. Schematic of the PDFDataExtractor pipeline...', # 'figure 2': 'Figure 2. Precision-recall curves for each publisher template...', # 'table 1': 'Table 1. Summary of extraction performance across publishers.' # } # Pretty-print captions pdf.caption(nicely=True) # figure 1 # Figure 1. Schematic of the PDFDataExtractor pipeline... # # figure 2 # Figure 2. Precision-recall curves for each publisher template... ``` -------------------------------- ### pdf.reference() Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Parses the reference section and returns a dictionary of bibliographic references, handling various styles and publisher-specific logic. ```APIDOC ## `pdf.reference()` — Extract Bibliographic References Parses the reference section and returns a dictionary where each key is a reference index (as string) and the value is the full reference text. Handles both numbered styles `[1]`, `(1)` and unnumbered styles, with publisher-specific anchor logic and noise filtering. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') references = pdf.reference() for idx, ref in references.items(): print(f"[{idx}] {ref}") # Output: # [0] Smith, J.; Jones, A. J. Am. Chem. Soc. 2019, 141, 1000–1010. # [1] Wang, L.; Chen, X. Angew. Chem. Int. Ed. 2020, 59, 5432–5438. # [2] Taylor, R. et al. Chem. Sci. 2021, 12, 8765–8773. ``` ``` -------------------------------- ### Read and Process PDF File Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Use the read_file method of the Reader instance to process the specified PDF file. The output indicates the file being read and the detected publisher. ```python pdf = file.read_file(path) ``` -------------------------------- ### Import glob module Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Imports the glob module, which is used for finding files matching a specified pattern. ```python import glob ``` -------------------------------- ### Import Reader Module Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/running.rst.txt Import the Reader class from the pdfdataextractor library. This is the first step before using the tool. ```python from pdfdataextractor import Reader ``` -------------------------------- ### Read PDF File Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Read the specified PDF file using the initialized Reader object. ```python pdf = reader.read_file(file_test) ``` -------------------------------- ### Test PDF Processing Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Use this method to verify if the PDF has been processed successfully. It returns a confirmation message. ```python pdf.test() ``` -------------------------------- ### pdf.caption(nicely=False) Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Returns a dictionary of all figure and table captions found in the PDF. Setting `nicely=True` pretty-prints each caption to stdout. ```APIDOC ## `pdf.caption(nicely=False)` — Extract Figure and Table Captions Returns a dictionary of all figure and table captions found in the PDF, keyed by `"figure 1"`, `"figure 2"`, `"table 1"`, etc. Setting `nicely=True` pretty-prints each caption to stdout. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') # Get captions as a dictionary captions = pdf.caption() print(captions) # Output: { # 'figure 1': 'Figure 1. Schematic of the PDFDataExtractor pipeline...', # 'figure 2': 'Figure 2. Precision-recall curves for each publisher template...', # 'table 1': 'Table 1. Summary of extraction performance across publishers.' # } # Pretty-print captions pdf.caption(nicely=True) # figure 1 # Figure 1. Schematic of the PDFDataExtractor pipeline... # # figure 2 # Figure 2. Precision-recall curves for each publisher template... ``` ``` -------------------------------- ### pdf.keywords() Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Extracts author-defined keywords from the article. The behavior is publisher-specific. ```APIDOC ## `pdf.keywords()` — Extract Keywords Extracts author-defined keywords from the article. Behaviour is publisher-specific: Elsevier strips the "Keywords" label; ACS checks within the abstract block; Angewandte and Chemistry—A European Journal use spatial coordinates to locate the keyword block. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') keywords = pdf.keywords() print(keywords) # Output (Elsevier): " machine learning, PDF extraction, chemistry, natural language processing" # Output (ACS): ": machine learning pdf extraction cheminformatics" ``` ``` -------------------------------- ### Extract Abstract from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Retrieves the abstract of the document. The abstract provides a summary of the document's content. ```python pdf.abstract() ``` -------------------------------- ### Process Single and Multiple PDFs with PDFDataExtractor Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Use `read_single()` to process one PDF and `read_multiple()` to process a list of PDFs. These functions extract chemistry-aware abstracts, titles, DOIs, authors, journals, keywords, sections, and references. ```python from pdfdataextractor import Reader import glob def read_single(file): reader = Reader() pdf = reader.read_file(file) try: # Extract chemistry-aware abstract chem_para = pdf.abstract(chem=True) print(chem_para.records.serialize()) print(pdf.title()) print(pdf.doi()) print(pdf.author()) print(pdf.journal()) print(pdf.keywords()) print(pdf.section().keys()) for idx, ref in pdf.reference().items(): print(idx, ref) except Exception: pass def read_multiple(path_list): for seq, filepath in enumerate(path_list): read_single(filepath) print('-------------------\n') # Process a single file read_single('path/to/article.pdf') # Process all PDFs in a directory read_multiple(glob.glob('/data/articles/elsevier/*.pdf')) ``` -------------------------------- ### Reader.read_file(file_name) — Load and Parse a PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Reads a PDF file page by page, extracts text blocks with spatial features, auto-detects the publisher, and dispatches to the matching publisher template. Returns a template instance with all extraction methods available. ```APIDOC ## Reader.read_file(file_name) ### Description Loads and parses a PDF file. It reads the file page by page using PDFMiner, extracts text blocks with spatial features, auto-detects the publisher from the first page's text, and then dispatches the file to the corresponding publisher template for further processing. The method returns a template instance that provides access to various extraction methods. ### Method ```python reader.read_file(file_name) ``` ### Parameters #### Path Parameters - **file_name** (str) - Required - The path to the PDF file to be read and processed. ### Request Example ```python from pdfdataextractor import Reader reader = Reader(showPublisher=True) # Read a single PDF file pdf = reader.read_file('elsevier_article.pdf') # Console output: Publisher: *** elsevier *** # *** Elsevier detected *** # Batch processing multiple PDFs import glob for filepath in glob.glob('/data/articles/acs/*.pdf'): try: pdf = reader.read_file(filepath) if pdf: print(pdf.title()) except Exception: pass ``` ### Response #### Success Response - Returns a publisher-specific template object (e.g., ElsevierTemplate) which contains methods for extracting information from the PDF. #### Response Example ```python # Example of a returned template object (conceptual) pdf = reader.read_file('some_article.pdf') ``` ``` -------------------------------- ### Extract Semantic Information from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/perform_extraction/extraction.html This snippet demonstrates how to extract various semantic elements from a PDF, such as captions, keywords, titles, abstracts, and journal details. It requires importing Reader and creating a file instance. ```python from pdfdataextractor import Reader path = r'the path to the PDF file' file = Reader() pdf = file.read_file(path) pdf.caption() pdf.keywords() pdf.title() pdf.doi() pdf.abstract() pdf.journal() pdf.journal('name') pdf.journal('year') pdf.journal('volume') pdf.journal('page') pdf.section() pdf.reference() ``` -------------------------------- ### Serialize Chemistry Records Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Serialize the extracted chemistry records into a list of dictionaries for further processing or display. ```python r.records.serialize() ``` -------------------------------- ### Extract Keywords from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Attempts to extract keywords from the PDF. Note that some documents may not contain keywords, resulting in an empty string. ```python pdf.keywords()# Note: Some articles do not contain keywords. For example, the current one. ``` -------------------------------- ### Define PDF reading functions Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Defines functions to read a single PDF file and to iterate through multiple files, printing their abstracts. Ensure the Reader class is imported and available. ```python def read_single(file): reader = Reader() pdf = reader.read_file(file) print(pdf.abstract()) ``` ```python def read_multiple(path): for i in path: read_single(i) print('-------------------', '\n') ``` -------------------------------- ### Extract Plain Text Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Extracts and returns the entire plain text content of the PDF document. ```APIDOC ## pdf.plaintext() ### Description Extracts and returns the entire plain text content of the PDF document. ### Method `pdf.plaintext()` ### Parameters None ### Response #### Success Response (200) - **plaintext** (string) - The complete text content of the PDF. ### Request Example ```python pdf.plaintext() ``` ### Response Example ``` 'Article\n\npubs.acs.org/jcim\n\nChemDataExtractor: A Toolkit for Automated Extraction of Chemical\nInformation from the Scientific Literature\nMatthew C. Swain and Jacqueline M. Cole*\n\nCavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.\n\nABSTRACT: The emergence of “big data” initiatives has led\nto the need for tools that can automatically extract valuable\nchemical information from large volumes of unstructured data,\nsuch as the scientific literature. Since chemical information can\nbe present in figures, tables, and textual paragraphs, successful\ninformation extraction often depends on the ability to interpret\nall of these domains simultaneously. We present a complete\ntoolkit for the automated extraction of chemical entities and\ntheir associated properties, measurements, and relationships\nfrom scientific documents that can be used to populate\nstructured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for\ntokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved\nperformance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus\nof chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars\nthat are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe\ndocument-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of\nchemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in\nthe text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set\nagainst the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All\ntools have been released under the MIT license and are available to download from http://www.chemdataextractor.org.\n\n■ INTRODUCTION\n\nScientific results are typically communicated in the form of\npapers, patents, and theses that contain unstructured and\nsemistructured data described by free-flowing natural lang' ``` ``` -------------------------------- ### Extract Abstract with Chemistry Information Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Extract the abstract from the PDF, enabling chemistry-related extraction by setting 'chem=True'. This utilizes ChemDataExtractor for detailed chemical entity recognition. ```python r = pdf.abstract(chem=True) ``` -------------------------------- ### Extract Bibliographic References from PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Parses the reference section, returning a dictionary where keys are reference indices and values are the full reference texts. Handles numbered and unnumbered styles with publisher-specific logic. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') references = pdf.reference() for idx, ref in references.items(): print(f"[{idx}] {ref}") # Output: # [0] Smith, J.; Jones, A. J. Am. Chem. Soc. 2019, 141, 1000–1010. # [1] Wang, L.; Chen, X. Angew. Chem. Int. Ed. 2020, 59, 5432–5438. # [2] Taylor, R. et al. Chem. Sci. 2021, 12, 8765–8773. ``` -------------------------------- ### Extract Keywords from PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Extracts author-defined keywords. Behavior varies by publisher; some strip the 'Keywords' label, others check the abstract, and some use spatial coordinates. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') keywords = pdf.keywords() print(keywords) # Output (Elsevier): " machine learning, PDF extraction, chemistry, natural language processing" # Output (ACS): ": machine learning pdf extraction cheminformatics" ``` -------------------------------- ### Process multiple PDF files Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Calls the read_multiple function with a list of PDF files obtained using glob. This will process all PDFs in the specified directory. ```python read_multiple(glob.glob(r'/Users/miao/Desktop/test/els/*.pdf')) ``` -------------------------------- ### pdf.journal(info_type=None) Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Returns a dictionary with journal name, year, volume, and page. An optional info_type string can retrieve a single field. ```APIDOC ## `pdf.journal(info_type=None)` — Extract Journal Metadata Returns a dictionary with journal `name`, `year`, `volume`, and `page`. Pass an `info_type` string to retrieve a single field. Publisher-specific parsing handles each journal's unique header/footer layout. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('rsc_article.pdf') # Get full journal info as a dict journal_info = pdf.journal() print(journal_info) # Output: {'name': 'Chemical Science', 'year': '2022', 'volume': '13', 'page': '1234'} # Get individual fields print(pdf.journal('name')) # Output: "Chemical Science" print(pdf.journal('year')) # Output: "2022" print(pdf.journal('volume')) # Output: "13" print(pdf.journal('page')) # Output: "1234" ``` ``` -------------------------------- ### Extract Pure Text from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/extraction_mode/extraction.html Use this snippet to extract plain text content from a PDF file. Ensure the path to the PDF is correctly specified. ```python # Import PDFDataExtractor from pdfdataextractor import Reader # Spefify the path to the PDF file path = r'the path to the PDF file' # Create an instance file = Reader() # Read the file pdf = file.read_file(path) # Get pure text pdf.pdf.plaintext() ``` -------------------------------- ### Extract Journal Information from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Call the `journal()` method without arguments to extract all available journal information as a dictionary. This includes the journal name, year, volume, and page numbers. ```python pdf.journal() ``` -------------------------------- ### Extract Document Sections from PDF Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Segments the article body into named sections using publisher-specific regex patterns. Returns a dictionary mapping section titles to lists of text blocks. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') sections = pdf.section() for title, text_blocks in sections.items(): print(f"[{title}]") print(' '.join(text_blocks)[:200]) print() # Output: # [Introduction] # Automated extraction of information from scientific literature is a key challenge... # # [Results and Discussion] # The extraction pipeline was evaluated on 500 articles from six publishers... # # [References] # [1] Smith, J. et al. J. Chem. Inf. Model. 2020, 60, 1234–1245... ``` -------------------------------- ### Access Specific Image from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/perform_extraction/extraction.html This code snippet shows how to access a specific image from a PDF file using the pdfdataextractor library. It assumes the Reader has been initialized and the file read. ```python from pdfdataextractor import Reader path = r'the path to the PDF file' file = Reader() pdf = file.read_file(path) pdf.iamge()[0] ``` -------------------------------- ### Extract Captions from PDF Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb Retrieves all figure captions from the PDF document. The output is a dictionary mapping figure identifiers to their caption text. ```python pdf.caption() ``` -------------------------------- ### pdf.section() Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt Segments the article body into named sections using publisher-specific regex patterns. Returns a dictionary of section titles and their text blocks. ```APIDOC ## `pdf.section()` — Extract Document Sections Segments the article body into named sections using publisher-specific regex patterns to detect section titles. Returns a dictionary where keys are section title strings and values are lists of text blocks belonging to that section. ```python from pdfdataextractor import Reader reader = Reader() pdf = reader.read_file('article.pdf') sections = pdf.section() for title, text_blocks in sections.items(): print(f"[{title}]") print(' '.join(text_blocks)[:200]) print() # Output: # [Introduction] # Automated extraction of information from scientific literature is a key challenge... # # [Results and Discussion] # The extraction pipeline was evaluated on 500 articles from six publishers... # # [References] # [1] Smith, J. et al. J. Chem. Inf. Model. 2020, 60, 1234–1245... ``` ```