### Install PDFDataExtractor and ChemDataExtractor

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Clone the repository and install PDFDataExtractor. Optionally, install ChemDataExtractor for chemistry-aware extraction.

```sh
# Clone the repository
git clone git@github.com:cat-lemonade/PDFDataExtractor.git

# Install PDFDataExtractor
python setup.py install

# Optional: install ChemDataExtractor for chemistry-aware extraction
pip install chemdataextractor
```

--------------------------------

### Install PDFDataExtractor Package

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt

Run this command in the downloaded repository directory to install the PDFDataExtractor package.

```sh
python setup.py install
```

--------------------------------

### Install ChemDataExtractor Dependency

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt

Install ChemDataExtractor using pip if you need to perform chemistry-related information extraction, such as identifying chemical names.

```sh
pip install chemdataextractor
```

--------------------------------

### Extract Chemistry Information from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/source/perform_extraction/extraction.md

Utilize the `chem=True` flag to enable chemistry-related information extraction using ChemDataExtractor alongside metadata extraction. Ensure both libraries are installed.

```python
# Path to the PDF file
file = r'path to the file'

# Create an instance
reader = Reader()

# Read the file
pdf = reader.read_file(file)

# Show extracted chemical information
r = pdf.abstract(chem=True)
r.records.serialize()
```

--------------------------------

### Iterate and Print PDF References

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

This Python code iterates through the references extracted from a PDF document. It prints the sequence number and the content of each reference. Ensure the pdfdataextractor library is installed and a PDF object is initialized.

```python
for seq, ref in pdf.reference().items():
    print(seq)
    print(ref)
```

--------------------------------

### Extract Images from PDF (Temporarily unavailable)

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/perform_extraction/extraction.rst.txt

This code is intended to extract images from a PDF file. Note that this feature is temporarily unavailable. The example shows how to access a specific image by its index.

```python
# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# To access a specific image
pdf.iamge()[0]
```

--------------------------------

### Extract Pure Text from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/perform_extraction/extraction.rst.txt

Use this snippet to extract plain text content from a PDF file. Ensure the PDFDataExtractor library is installed and the correct path to the PDF is provided.

```python
# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Get pure text
pdf.plaintext()
```

--------------------------------

### Extract Full Plain Text

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Get the entire document as a single concatenated string in reading order. Useful for full-text indexing or downstream NLP pipelines.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

# Get entire document as plain text
text = pdf.plaintext()
print(text[:500])
# Output: "Journal of Chemical Information and Modeling

PDFDataExtractor: A Tool for Reading Scientific Text..."
```

--------------------------------

### Get Section Titles and Text

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Use the `pdf.section()` method to retrieve a dictionary where keys are section titles and values are lists of text content within each section. This is useful for parsing structured documents.

```python
pdf.section()
```

--------------------------------

### Initialize PDF Path

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Specify the file path for the PDF document to be processed.

```python
file_test = r'/Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf'
```

--------------------------------

### Initialize PDF Reader

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Initialize the Reader class, which is the entry point for PDF processing. Optionally show the detected publisher name.

```python
from pdfdataextractor import Reader

# Initialize reader (optionally show detected publisher name)
reader = Reader(showPublisher=True)

# Read a single PDF file — auto-detects publisher and returns a template object
pdf = reader.read_file('path/to/article.pdf')

# Verify that the PDF was loaded and a template was returned
pdf.test()
# Output: PDF returned successfully
```

--------------------------------

### Initialize Reader

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Create an instance of the Reader class to handle PDF reading.

```python
reader = Reader()
```

--------------------------------

### Extract Journal Information

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Retrieves journal-related information from the PDF. Can be used to get the full journal string or specific components like name, year, volume, or page.

```APIDOC
## pdf.journal()

### Description
Retrieves a dictionary containing journal information (name, year, volume, page) from the PDF.

### Method
`pdf.journal()`

### Parameters
None

### Response
#### Success Response (200)
- **name** (string) - The full journal name and citation.
- **year** (string) - The publication year.
- **volume** (string) - The publication volume.
- **page** (string) - The page range.

### Request Example
```python
pdf.journal()
```

### Response Example
```json
{
  "name": "J. Chem. Inf. Model. 2016, 56, 1894−1904",
  "year": "2016",
  "volume": "56",
  "page": "1894-1904"
}
```

## pdf.journal(field)

### Description
Retrieves a specific field of journal information from the PDF.

### Method
`pdf.journal(field: str)`

### Parameters
#### Path Parameters
- **field** (string) - Required - The specific field to retrieve. Accepted values: 'name', 'year', 'volume', 'page'.

### Request Example
```python
pdf.journal('year')
```

### Response
#### Success Response (200)
- Returns the string value of the requested field.

### Response Example
```
'2016'
```
```

--------------------------------

### Pass a PDF File to Reader

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/running.rst.txt

Specify the path to your PDF file, create an instance of the Reader, and then read the file. Includes a test to verify successful reading.

```python
# Spefify the path to the PDF file 
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Test if pdf is returned successfully
pdf.test()
```

--------------------------------

### Load and Parse a PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Read a PDF file, auto-detect the publisher, and dispatch to the matching template. Batch process multiple PDFs.

```python
from pdfdataextractor import Reader

reader = Reader(showPublisher=True)

# Returns a publisher-specific template object (e.g., ElsevierTemplate)
pdf = reader.read_file('elsevier_article.pdf')
# Console output: Publisher: *** elsevier ***
#                 *** Elsevier detected ***

# Batch processing multiple PDFs
import glob

for filepath in glob.glob('/data/articles/acs/*.pdf'):
    try:
        pdf = reader.read_file(filepath)
        if pdf:
            print(pdf.title())
    except Exception:
        pass
```

--------------------------------

### Initialize Reader

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Create an instance of the Reader class. This object will be used to read and process PDF files.

```python
file = Reader()
```

--------------------------------

### Read PDF File with PDFDataExtractor

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/getting_started/running.html

Specify the path to your PDF file, create a Reader instance, and read the file. Includes a test to verify successful file reading.

```python
# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Test if pdf is returned successfully
pdf.test()
```

--------------------------------

### Reader() — Initialize the PDF Reader

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

The Reader class is the primary entry point for PDF processing. It handles publisher auto-detection and directs PDFs to the appropriate publisher-specific template for extraction.

```APIDOC
## Reader()

### Description
Initializes the Reader class, which is the entry point for all PDF processing. It handles publisher auto-detection and routes each PDF to the appropriate publisher-specific template for extraction.

### Method
```python
Reader(showPublisher=True)
```

### Parameters
#### Optional Parameters
- **showPublisher** (bool) - Optional - If True, displays the detected publisher name during initialization.

### Request Example
```python
from pdfdataextractor import Reader

# Initialize reader (optionally show detected publisher name)
reader = Reader(showPublisher=True)
```
```

--------------------------------

### Specify PDF File Path

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Define the path to the PDF file you want to process. Use raw strings (r'') to handle backslashes correctly.

```python
path = r'/Users/miao/Downloads/acs.jcim.6b00207.pdf'
```

--------------------------------

### Clone PDFDataExtractor Repository

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/installation.rst.txt

Use this command to download the PDFDataExtractor source code from GitHub.

```sh
git clone git@github.com:cat-lemonade/PDFDataExtractor.git
```

--------------------------------

### Extract Figure and Table Captions from PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Returns a dictionary of figure and table captions, keyed by identifiers like 'figure 1' or 'table 1'. Setting nicely=True will pretty-print captions to standard output.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

# Get captions as a dictionary
captions = pdf.caption()
print(captions)
# Output: {
#   'figure 1': 'Figure 1. Schematic of the PDFDataExtractor pipeline...',
#   'figure 2': 'Figure 2. Precision-recall curves for each publisher template...', 
#   'table 1':  'Table 1. Summary of extraction performance across publishers.'
# }

# Pretty-print captions
pdf.caption(nicely=True)
# figure 1
# Figure 1. Schematic of the PDFDataExtractor pipeline...
# 
# figure 2
# Figure 2. Precision-recall curves for each publisher template...

```

--------------------------------

### pdf.reference()

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Parses the reference section and returns a dictionary of bibliographic references, handling various styles and publisher-specific logic.

```APIDOC
## `pdf.reference()` — Extract Bibliographic References

Parses the reference section and returns a dictionary where each key is a reference index (as string) and the value is the full reference text. Handles both numbered styles `[1]`, `(1)` and unnumbered styles, with publisher-specific anchor logic and noise filtering.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

references = pdf.reference()
for idx, ref in references.items():
    print(f"[{idx}] {ref}")
# Output:
# [0]  Smith, J.; Jones, A. J. Am. Chem. Soc. 2019, 141, 1000–1010.
# [1]  Wang, L.; Chen, X. Angew. Chem. Int. Ed. 2020, 59, 5432–5438.
# [2]  Taylor, R. et al. Chem. Sci. 2021, 12, 8765–8773.
```
```

--------------------------------

### Read and Process PDF File

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Use the read_file method of the Reader instance to process the specified PDF file. The output indicates the file being read and the detected publisher.

```python
pdf = file.read_file(path)
```

--------------------------------

### Import glob module

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Imports the glob module, which is used for finding files matching a specified pattern.

```python
import glob
```

--------------------------------

### Import Reader Module

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/_sources/getting_started/running.rst.txt

Import the Reader class from the pdfdataextractor library. This is the first step before using the tool.

```python
from pdfdataextractor import Reader
```

--------------------------------

### Read PDF File

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Read the specified PDF file using the initialized Reader object.

```python
pdf = reader.read_file(file_test)
```

--------------------------------

### Test PDF Processing

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Use this method to verify if the PDF has been processed successfully. It returns a confirmation message.

```python
pdf.test()
```

--------------------------------

### pdf.caption(nicely=False)

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Returns a dictionary of all figure and table captions found in the PDF. Setting `nicely=True` pretty-prints each caption to stdout.

```APIDOC
## `pdf.caption(nicely=False)` — Extract Figure and Table Captions

Returns a dictionary of all figure and table captions found in the PDF, keyed by `"figure 1"`, `"figure 2"`, `"table 1"`, etc. Setting `nicely=True` pretty-prints each caption to stdout.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

# Get captions as a dictionary
captions = pdf.caption()
print(captions)
# Output: {
#   'figure 1': 'Figure 1. Schematic of the PDFDataExtractor pipeline...',
#   'figure 2': 'Figure 2. Precision-recall curves for each publisher template...',
#   'table 1':  'Table 1. Summary of extraction performance across publishers.'
# }

# Pretty-print captions
pdf.caption(nicely=True)
# figure 1
# Figure 1. Schematic of the PDFDataExtractor pipeline...
# 
# figure 2
# Figure 2. Precision-recall curves for each publisher template...
```
```

--------------------------------

### pdf.keywords()

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Extracts author-defined keywords from the article. The behavior is publisher-specific.

```APIDOC
## `pdf.keywords()` — Extract Keywords

Extracts author-defined keywords from the article. Behaviour is publisher-specific: Elsevier strips the "Keywords" label; ACS checks within the abstract block; Angewandte and Chemistry—A European Journal use spatial coordinates to locate the keyword block.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

keywords = pdf.keywords()
print(keywords)
# Output (Elsevier): " machine learning, PDF extraction, chemistry, natural language processing"
# Output (ACS):      ": machine learning  pdf extraction  cheminformatics"
```
```

--------------------------------

### Extract Abstract from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Retrieves the abstract of the document. The abstract provides a summary of the document's content.

```python
pdf.abstract()
```

--------------------------------

### Process Single and Multiple PDFs with PDFDataExtractor

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Use `read_single()` to process one PDF and `read_multiple()` to process a list of PDFs. These functions extract chemistry-aware abstracts, titles, DOIs, authors, journals, keywords, sections, and references.

```python
from pdfdataextractor import Reader
import glob

def read_single(file):
    reader = Reader()
    pdf = reader.read_file(file)
    try:
        # Extract chemistry-aware abstract
        chem_para = pdf.abstract(chem=True)
        print(chem_para.records.serialize())
        print(pdf.title())
        print(pdf.doi())
        print(pdf.author())
        print(pdf.journal())
        print(pdf.keywords())
        print(pdf.section().keys())
        for idx, ref in pdf.reference().items():
            print(idx, ref)
    except Exception:
        pass

def read_multiple(path_list):
    for seq, filepath in enumerate(path_list):
        read_single(filepath)
        print('-------------------\n')

# Process a single file
read_single('path/to/article.pdf')

# Process all PDFs in a directory
read_multiple(glob.glob('/data/articles/elsevier/*.pdf'))
```

--------------------------------

### Reader.read_file(file_name) — Load and Parse a PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Reads a PDF file page by page, extracts text blocks with spatial features, auto-detects the publisher, and dispatches to the matching publisher template. Returns a template instance with all extraction methods available.

```APIDOC
## Reader.read_file(file_name)

### Description
Loads and parses a PDF file. It reads the file page by page using PDFMiner, extracts text blocks with spatial features, auto-detects the publisher from the first page's text, and then dispatches the file to the corresponding publisher template for further processing. The method returns a template instance that provides access to various extraction methods.

### Method
```python
reader.read_file(file_name)
```

### Parameters
#### Path Parameters
- **file_name** (str) - Required - The path to the PDF file to be read and processed.

### Request Example
```python
from pdfdataextractor import Reader

reader = Reader(showPublisher=True)

# Read a single PDF file
pdf = reader.read_file('elsevier_article.pdf')
# Console output: Publisher: *** elsevier ***
#                 *** Elsevier detected ***

# Batch processing multiple PDFs
import glob

for filepath in glob.glob('/data/articles/acs/*.pdf'):
    try:
        pdf = reader.read_file(filepath)
        if pdf:
            print(pdf.title())
    except Exception:
        pass
```

### Response
#### Success Response
- Returns a publisher-specific template object (e.g., ElsevierTemplate) which contains methods for extracting information from the PDF.

#### Response Example
```python
# Example of a returned template object (conceptual)
pdf = reader.read_file('some_article.pdf')
```
```

--------------------------------

### Extract Semantic Information from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/perform_extraction/extraction.html

This snippet demonstrates how to extract various semantic elements from a PDF, such as captions, keywords, titles, abstracts, and journal details. It requires importing Reader and creating a file instance.

```python
from pdfdataextractor import Reader

path = r'the path to the PDF file'

file = Reader()

pdf = file.read_file(path)

pdf.caption()
pdf.keywords()
pdf.title()
pdf.doi()
pdf.abstract()
pdf.journal()
pdf.journal('name')
pdf.journal('year')
pdf.journal('volume')
pdf.journal('page')
pdf.section()
pdf.reference()
```

--------------------------------

### Serialize Chemistry Records

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Serialize the extracted chemistry records into a list of dictionaries for further processing or display.

```python
r.records.serialize()
```

--------------------------------

### Extract Keywords from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Attempts to extract keywords from the PDF. Note that some documents may not contain keywords, resulting in an empty string.

```python
pdf.keywords()# Note: Some articles do not contain keywords. For example, the current one.
```

--------------------------------

### Define PDF reading functions

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Defines functions to read a single PDF file and to iterate through multiple files, printing their abstracts. Ensure the Reader class is imported and available.

```python
def read_single(file):
    reader = Reader()
    pdf = reader.read_file(file)
    print(pdf.abstract())
```

```python
def read_multiple(path):
    for i in path:
        read_single(i)
        print('-------------------', '\n')
```

--------------------------------

### Extract Plain Text

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Extracts and returns the entire plain text content of the PDF document.

```APIDOC
## pdf.plaintext()

### Description
Extracts and returns the entire plain text content of the PDF document.

### Method
`pdf.plaintext()`

### Parameters
None

### Response
#### Success Response (200)
- **plaintext** (string) - The complete text content of the PDF.

### Request Example
```python
pdf.plaintext()
```

### Response Example
```
'Article\n\npubs.acs.org/jcim\n\nChemDataExtractor: A Toolkit for Automated Extraction of Chemical\nInformation from the Scientiﬁc Literature\nMatthew C. Swain and Jacqueline M. Cole*\n\nCavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.\n\nABSTRACT: The emergence of “big data” initiatives has led\nto the need for tools that can automatically extract valuable\nchemical information from large volumes of unstructured data,\nsuch as the scientiﬁc literature. Since chemical information can\nbe present in ﬁgures, tables, and textual paragraphs, successful\ninformation extraction often depends on the ability to interpret\nall of these domains simultaneously. We present a complete\ntoolkit for the automated extraction of chemical entities and\ntheir associated properties, measurements, and relationships\nfrom scientiﬁc documents that can be used to populate\nstructured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for\ntokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved\nperformance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus\nof chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars\nthat are tailored for interpreting speciﬁc document domains such as textual paragraphs, captions, and tables. We also describe\ndocument-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of\nchemical databases since captions and tables commonly contain chemical identiﬁers and references that are deﬁned elsewhere in\nthe text. The performance of the toolkit to correctly extract various types of data was evaluated, aﬀording an F-score of 93.4%,
86.8%, and 91.5% for extracting chemical identiﬁers, spectroscopic attributes, and chemical property attributes, respectively; set\nagainst the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All\ntools have been released under the MIT license and are available to download from http://www.chemdataextractor.org.\n\n■ INTRODUCTION\n\nScientiﬁc results are typically communicated in the form of\npapers, patents, and theses that contain unstructured and\nsemistructured data described by free-ﬂowing natural lang'
```
```

--------------------------------

### Extract Abstract with Chemistry Information

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Extract the abstract from the PDF, enabling chemistry-related extraction by setting 'chem=True'. This utilizes ChemDataExtractor for detailed chemical entity recognition.

```python
r = pdf.abstract(chem=True)
```

--------------------------------

### Extract Bibliographic References from PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Parses the reference section, returning a dictionary where keys are reference indices and values are the full reference texts. Handles numbered and unnumbered styles with publisher-specific logic.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

references = pdf.reference()
for idx, ref in references.items():
    print(f"[{idx}] {ref}")
# Output:
# [0]  Smith, J.; Jones, A. J. Am. Chem. Soc. 2019, 141, 1000–1010.
# [1]  Wang, L.; Chen, X. Angew. Chem. Int. Ed. 2020, 59, 5432–5438.
# [2]  Taylor, R. et al. Chem. Sci. 2021, 12, 8765–8773.

```

--------------------------------

### Extract Keywords from PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Extracts author-defined keywords. Behavior varies by publisher; some strip the 'Keywords' label, others check the abstract, and some use spatial coordinates.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

keywords = pdf.keywords()
print(keywords)
# Output (Elsevier): " machine learning, PDF extraction, chemistry, natural language processing"
# Output (ACS):      ": machine learning  pdf extraction  cheminformatics"
```

--------------------------------

### Process multiple PDF files

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Calls the read_multiple function with a list of PDF files obtained using glob. This will process all PDFs in the specified directory.

```python
read_multiple(glob.glob(r'/Users/miao/Desktop/test/els/*.pdf'))
```

--------------------------------

### pdf.journal(info_type=None)

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Returns a dictionary with journal name, year, volume, and page. An optional info_type string can retrieve a single field.

```APIDOC
## `pdf.journal(info_type=None)` — Extract Journal Metadata

Returns a dictionary with journal `name`, `year`, `volume`, and `page`. Pass an `info_type` string to retrieve a single field. Publisher-specific parsing handles each journal's unique header/footer layout.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('rsc_article.pdf')

# Get full journal info as a dict
journal_info = pdf.journal()
print(journal_info)
# Output: {'name': 'Chemical Science', 'year': '2022', 'volume': '13', 'page': '1234'}

# Get individual fields
print(pdf.journal('name'))    # Output: "Chemical Science"
print(pdf.journal('year'))    # Output: "2022"
print(pdf.journal('volume'))  # Output: "13"
print(pdf.journal('page'))    # Output: "1234"
```
```

--------------------------------

### Extract Pure Text from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/extraction_mode/extraction.html

Use this snippet to extract plain text content from a PDF file. Ensure the path to the PDF is correctly specified.

```python
# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Get pure text
pdf.pdf.plaintext()
```

--------------------------------

### Extract Journal Information from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Call the `journal()` method without arguments to extract all available journal information as a dictionary. This includes the journal name, year, volume, and page numbers.

```python
pdf.journal()
```

--------------------------------

### Extract Document Sections from PDF

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Segments the article body into named sections using publisher-specific regex patterns. Returns a dictionary mapping section titles to lists of text blocks.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

sections = pdf.section()
for title, text_blocks in sections.items():
    print(f"[{title}]")
    print(' '.join(text_blocks)[:200])
    print()
# Output:
# [Introduction]
# Automated extraction of information from scientific literature is a key challenge...
# 
# [Results and Discussion]
# The extraction pipeline was evaluated on 500 articles from six publishers...
# 
# [References]
# [1] Smith, J. et al. J. Chem. Inf. Model. 2020, 60, 1234–1245...

```

--------------------------------

### Access Specific Image from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/docs/build/html/perform_extraction/extraction.html

This code snippet shows how to access a specific image from a PDF file using the pdfdataextractor library. It assumes the Reader has been initialized and the file read.

```python
from pdfdataextractor import Reader

path = r'the path to the PDF file'

file = Reader()

pdf = file.read_file(path)

pdf.iamge()[0]
```

--------------------------------

### Extract Captions from PDF

Source: https://github.com/cat-lemonade/pdfdataextractor/blob/main/demo/PDE Demo.ipynb

Retrieves all figure captions from the PDF document. The output is a dictionary mapping figure identifiers to their caption text.

```python
pdf.caption()
```

--------------------------------

### pdf.section()

Source: https://context7.com/cat-lemonade/pdfdataextractor/llms.txt

Segments the article body into named sections using publisher-specific regex patterns. Returns a dictionary of section titles and their text blocks.

```APIDOC
## `pdf.section()` — Extract Document Sections

Segments the article body into named sections using publisher-specific regex patterns to detect section titles. Returns a dictionary where keys are section title strings and values are lists of text blocks belonging to that section.

```python
from pdfdataextractor import Reader

reader = Reader()
pdf = reader.read_file('article.pdf')

sections = pdf.section()
for title, text_blocks in sections.items():
    print(f"[{title}]")
    print(' '.join(text_blocks)[:200])
    print()
# Output:
# [Introduction]
# Automated extraction of information from scientific literature is a key challenge...
# 
# [Results and Discussion]
# The extraction pipeline was evaluated on 500 articles from six publishers...
# 
# [References]
# [1] Smith, J. et al. J. Chem. Inf. Model. 2020, 60, 1234–1245...
```
```