### Install Substack2Markdown Dependencies

Source: https://github.com/timf34/substack2markdown/blob/main/README.md

This snippet shows the commands to clone the Substack2Markdown repository and install its Python dependencies using pip. It also includes optional steps for creating and activating a virtual environment.

```bash
git clone https://github.com/yourusername/substack_scraper.git
cd substack_scraper

# # Optinally create a virtual environment
# python -m venv venv
# # Activate the virtual environment
# .\venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux

pip install -r requirements.txt
```

--------------------------------

### URL Discovery using Sitemap/Feed Parsing (Python API)

Source: https://context7.com/timf34/substack2markdown/llms.txt

Demonstrates initializing the SubstackScraper with a base URL and output directories. This setup is preparatory for discovering post URLs, likely through parsing sitemap.xml or feed.xml, although the discovery logic itself is not shown in this snippet.

```python
from substack_scraper import SubstackScraper

scraper = SubstackScraper(
    base_substack_url="https://example.substack.com/",
    md_save_dir="output",
    html_save_dir="output_html"
)

```

--------------------------------

### Get All Post URLs from Substack

Source: https://context7.com/timf34/substack2markdown/llms.txt

Retrieves all post URLs from a Substack publication. This function is typically called during the scraper's initialization. It filters out URLs containing 'about', 'archive', or 'podcast' and falls back to using 'feed.xml' if 'sitemap.xml' fails, with a warning about limited post retrieval in the fallback.

```python
all_urls = scraper.get_all_post_urls()

# Output example:
# ['https://example.substack.com/p/first-post',
#  'https://example.substack.com/p/second-post',
#  'https://example.substack.com/p/third-post',
#  ...]
# Note: URLs containing 'about', 'archive', 'podcast' are filtered out

# If sitemap.xml fails, automatically falls back to feed.xml
# Warning: "Falling back to feed.xml. This will only contain up to the 22 most recent posts."
```

--------------------------------

### Command Line Interface for Basic (Free) Substack Scraping

Source: https://context7.com/timf34/substack2markdown/llms.txt

Executes the Substack scraper from the command line to download free content. Allows specifying the Substack URL, output directories for Markdown and HTML, and the number of posts to scrape.

```bash
# Scrape free content (basic scraper)
python substack_scraper.py \
  --url https://astralcodexten.substack.com \
  --directory ./my_posts \
  --html-directory ./my_html \
  --number 25

# Expected output:
# Created md directory ./my_posts/astralcodexten
# Created html directory ./my_html/astralcodexten
# 100%|██████████| 25/25 [00:45<00:00,  1.81s/it]
# Generated index at ./my_html/astralcodexten.html

# View the generated index page in browser
# It will contain sortable links to all 25 scraped posts
```

--------------------------------

### Command Line Interface for Premium Substack Scraping

Source: https://context7.com/timf34/substack2markdown/llms.txt

Runs the Substack scraper from the command line to download premium content, requiring authentication. Uses the `--premium` flag and supports headless mode. Requires user credentials to be configured in `config.py`.

```bash
# Scrape premium content (requires config.py with credentials)
python substack_scraper.py \
  --url https://www.thefitzwilliam.com \
  --directory ./premium_posts \
  --premium \
  --headless \
  --number 10

# Expected output:
# Created md directory ./premium_posts/thefitzwilliam
# Created html directory substack_html_pages/thefitzwilliam
# Logging into Substack...
# Login successful
# 100%|██████████| 10/10 [02:15<00:00, 13.50s/it]
# Saved 10 premium posts

# Files created:
# - ./premium_posts/thefitzwilliam/*.md (10 markdown files)
# - ./substack_html_pages/thefitzwilliam/*.html (10 HTML files)
# - ./data/thefitzwilliam.json (metadata)
# - ./substack_html_pages/thefitzwilliam.html (browsable index)
```

--------------------------------

### Configure Premium Content Credentials

Source: https://github.com/timf34/substack2markdown/blob/main/README.md

This Python code snippet demonstrates how to update the `config.py` file to include your Substack email and password for accessing premium content. Ensure these are replaced with your actual credentials.

```python
EMAIL = "your-email@domain.com"
PASSWORD = "your-password"
```

--------------------------------

### Run Substack2Markdown Scraper

Source: https://github.com/timf34/substack2markdown/blob/main/README.md

These commands illustrate how to run the Substack2Markdown Python script. You can hardcode the URL and directory, or specify them as command-line arguments. Options for premium content and limiting the number of posts are also shown.

```bash
# Run with hardcoded values
python substack_scraper.py

# For free Substack sites
python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts

# For premium Substack sites
python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --premium

# To scrape a specific number of posts
python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --number 5
```

--------------------------------

### Scrape Premium Substack Content with Python API

Source: https://context7.com/timf34/substack2markdown/llms.txt

Initializes the PremiumSubstackScraper for authenticated access to both free and premium posts. Requires Selenium, Microsoft Edge, and optional configuration for headless mode and browser paths. User credentials must be set in a `config.py` file.

```python
from substack_scraper import PremiumSubstackScraper

# Initialize premium scraper with authentication
scraper = PremiumSubstackScraper(
    base_substack_url="https://www.thefitzwilliam.com/",
    md_save_dir="substack_md_files",
    html_save_dir="substack_html_pages",
    headless=True,  # Run browser in background
    edge_path="",  # Optional: custom Edge browser path
    edge_driver_path="",  # Optional: custom driver path
    user_agent=""  # Optional: custom user agent for captcha bypass
)

# Note: Requires config.py with EMAIL and PASSWORD set
# EMAIL = "subscriber@example.com"
# PASSWORD = "your-password"

# Scrape all premium posts
scraper.scrape_posts(num_posts_to_scrape=0)

# Output:
# Created md directory substack_md_files/thefitzwilliam
# Created html directory substack_html_pages/thefitzwilliam
# 100%|██████████| 50/50 [05:30<00:00,  6.60s/it]
# Saved 50 posts including premium content
```

--------------------------------

### Configure Substack Authentication Credentials

Source: https://context7.com/timf34/substack2markdown/llms.txt

Sets up authentication credentials for scraping premium Substack content. This involves defining the subscriber's email and password in a configuration file, typically named 'config.py'.

```python
# config.py
EMAIL = "subscriber@example.com"
PASSWORD = "your-secure-password"

```

--------------------------------

### Scrape Free Substack Content with Python API

Source: https://context7.com/timf34/substack2markdown/llms.txt

Initializes the SubstackScraper to download free posts from a given Substack URL. Saves content to specified Markdown and HTML directories. Supports scraping all posts or a specified number of recent posts.

```python
from substack_scraper import SubstackScraper

# Initialize scraper for free content
scraper = SubstackScraper(
    base_substack_url="https://astralcodexten.substack.com/",
    md_save_dir="substack_md_files",
    html_save_dir="substack_html_pages"
)

# Scrape all posts (num_posts_to_scrape=0 means all)
scraper.scrape_posts(num_posts_to_scrape=0)

# Or scrape only the 10 most recent posts
scraper.scrape_posts(num_posts_to_scrape=10)

# Output:
# Created md directory substack_md_files/astralcodexten
# Created html directory substack_html_pages/astralcodexten
# 100%|██████████| 250/250 [02:15<00:00,  1.84it/s]
# Files saved to:
# - substack_md_files/astralcodexten/*.md
# - substack_html_pages/astralcodexten/*.html
# - data/astralcodexten.json
# - substack_html_pages/astralcodexten.html (index page)
```

--------------------------------

### Premium Substack Scraping with Authentication (Python)

Source: https://context7.com/timf34/substack2markdown/llms.txt

Scrapes premium Substack content using authentication with provided email and password. It saves content in Markdown and HTML formats, utilizing Selenium for login. Ensure your credentials in config.py are added to .gitignore for security.

```python
from config import EMAIL, PASSWORD
from substack_scraper import PremiumSubstackScraper

scraper = PremiumSubstackScraper(
    base_substack_url="https://premium.substack.com/",
    md_save_dir="premium_output",
    html_save_dir="premium_html"
)

# The scraper will automatically use EMAIL and PASSWORD
# to authenticate during initialization
# Credentials are used only for Selenium login at substack.com/sign-in
# Never shared with third parties

# Security note: Add config.py to .gitignore to protect credentials
# Example .gitignore entry:
# config.py
# *.pyc
# __pycache__/
```

--------------------------------

### Batch Substack Scraping with Error Handling and Progress (Python)

Source: https://context7.com/timf34/substack2markdown/llms.txt

Scrapes multiple posts from a Substack publication with automatic error handling and progress tracking using tqdm. It skips existing files, continues on errors, and saves JSON metadata. Requires the substack_scraper library.

```python
from substack_scraper import SubstackScraper
from tqdm import tqdm

scraper = SubstackScraper(
    base_substack_url="https://example.substack.com/",
    md_save_dir="output",
    html_save_dir="output_html"
)

# The scrape_posts method includes built-in error handling
scraper.scrape_posts(num_posts_to_scrape=50)

# Output with progress bar:
# 100%|████████████████████| 50/50 [01:30<00:00,  1.80s/it]
# Skipping premium article: https://example.substack.com/p/premium-post
# File already exists: output/example/existing-post.md
# Error scraping post: Connection timeout
# Successfully scraped: 47/50 posts

# Behavior:
# - Skips already downloaded files (checks if .md exists)
# - Continues on errors (prints error, moves to next post)
# - Tracks progress with tqdm progress bar
# - Saves JSON metadata after completion
# - Generates HTML index automatically
```

--------------------------------

### Generate Browsable HTML Interface for Substack Posts

Source: https://context7.com/timf34/substack2markdown/llms.txt

Creates an interactive HTML page for browsing scraped Substack posts. Features include toggling between Markdown and HTML views, sorting by publication date or like count, and direct links to posts. The page generation uses JSON data and embeds sorting logic.

```python
from substack_scraper import generate_html_file

# Generate author page from JSON data
generate_html_file(author_name="astralcodexten")

# Creates: substack_html_pages/astralcodexten.html
# Features:
# - Toggle between Markdown and HTML views
# - Sort by publication date (ascending/descending)
# - Sort by like count (ascending/descending)
# - Direct links to all saved posts
# - Displays title, subtitle, likes, and date for each post

# Example generated HTML structure:
# <!DOCTYPE html>
# <html>
# <head><title>astralcodexten</title></head>
# <body>
#   <h1>astralcodexten</h1>
#   <button id="toggle-format">Toggle MD/HTML</button>
#   <button id="sort-by-date">Sort by Date</button>
#   <button id="sort-by-likes">Sort by Likes</button>
#   <div id="essays-container">
#     <ul>
#       <li><a href="...">Post Title</a></li>
#       ...
#     </ul>
#   </div>
#   <script>/* Sorting logic embedded */</script>
# </body>
# </html>
```

--------------------------------

### Markdown to Styled HTML Conversion and Saving (Python)

Source: https://context7.com/timf34/substack2markdown/llms.txt

Converts Markdown content, including metadata, into styled HTML using the BaseSubstackScraper class. The output HTML is saved to a specified path with CSS styling, preserving all Markdown formatting and adding responsive meta tags.

```python
from substack_scraper import BaseSubstackScraper

scraper = BaseSubstackScraper(
    base_substack_url="https://example.substack.com/",
    md_save_dir="output",
    html_save_dir="output_html"
)

# Markdown content with metadata
markdown_content = """
# The Future of AI

## Exploring machine learning boundaries

**Jan 15, 2025**

**Likes:** 42

This is the post content with **bold** and *italic* text.

- List item 1
- List item 2
"""

# Convert to HTML
html_output = scraper.md_to_html(markdown_content)

# Save with CSS styling
output_path = "output_html/example/future-of-ai.html"
scraper.save_to_html_file(output_path, html_output)

# Generated HTML includes:
# - Responsive viewport meta tag
# - Link to assets/css/essay-styles.css (relative path)
# - Properly wrapped in <html>, <head>, <body> tags
# - Content in <main class="markdown-content"> for styling
# - All Markdown formatting preserved (headings, lists, bold, italic, links)

# Example output file:
# <!DOCTYPE html>
# <html lang="en">
# <head>
#   <meta charset="UTF-8">
#   <meta name="viewport" content="width=device-width, initial-scale=1.0">
#   <title>Markdown Content</title>
#   <link rel="stylesheet" href="../../assets/css/essay-styles.css">
# </head>
# <body>
#   <main class="markdown-content">
#     <h1>The Future of AI</h1>
#     <h2>Exploring machine learning boundaries</h2>
#     ...
#   </main>
# </body>
# </html>
```

--------------------------------

### Convert HTML to Markdown

Source: https://context7.com/timf34/substack2markdown/llms.txt

Converts raw HTML content from Substack posts into clean Markdown format. This function preserves formatting, links, lists, and overall structure. The body width is set to 0, meaning no line wrapping occurs in the output.

```python
from substack_scraper import BaseSubstackScraper

# Example HTML content from a Substack post
html_content = """
<div class=\"available-content\">
    <h2>Section Title</h2>
    <p>This is a <strong>bold text</strong> and <em>italic text</em>.</p>
    <a href=\"https://example.com\">Example Link</a>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</div>
"""

# Convert to Markdown
markdown_output = BaseSubstackScraper.html_to_md(html_content)

# Output:
"""
## Section Title

This is a **bold text** and _italic text_.

[Example Link](https://example.com)

  * Item 1
  * Item 2
"""
# Note: Preserves links, formatting, lists, and structure
# Body width set to 0 (no line wrapping)
```

--------------------------------

### Extract Post Data from Substack HTML

Source: https://context7.com/timf34/substack2markdown/llms.txt

Extracts structured metadata and content from the HTML of a Substack post. It retrieves the title, subtitle, like count, publication date, and the main body content formatted as Markdown. Requires BeautifulSoup and requests libraries.

```python
from substack_scraper import SubstackScraper
from bs4 import BeautifulSoup
import requests

scraper = SubstackScraper(
    base_substack_url="https://example.substack.com/",
    md_save_dir="output",
    html_save_dir="output_html"
)

# Fetch a post
url = "https://example.substack.com/p/my-great-post"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract all post data
title, subtitle, like_count, date, markdown_content = scraper.extract_post_data(soup)

# Output example:
# title = "The Future of AI"
# subtitle = "Exploring the boundaries of machine learning"
# like_count = "42"
# date = "Jan 15, 2025"
# markdown_content = """
# # The Future of AI
# 
# ## Exploring the boundaries of machine learning
# 
# **Jan 15, 2025**
# 
# **Likes:** 42
# 
# [Full post content in Markdown format...]
# """
```

--------------------------------

### Save Substack Post Metadata to JSON

Source: https://context7.com/timf34/substack2markdown/llms.txt

Saves extracted post metadata (title, subtitle, likes, date, file links) into a JSON file. This facilitates indexing, searching, and integration with other applications. The function merges new data with existing entries if the JSON file already exists.

```python
from substack_scraper import SubstackScraper

scraper = SubstackScraper(
    base_substack_url="https://astralcodexten.substack.com/",
    md_save_dir="output",
    html_save_dir="output_html"
)

# Essays data collected during scraping
essays_data = [
    {
        "title": "Book Review: Why We Sleep",
        "subtitle": "The new science of sleep and dreams",
        "like_count": "156",
        "date": "Jan 10, 2025",
        "file_link": "output/astralcodexten/book-review-why-we-sleep.md",
        "html_link": "output_html/astralcodexten/book-review-why-we-sleep.html"
    },
    {
        "title": "Predictions for 2025",
        "subtitle": "",
        "like_count": "203",
        "date": "Jan 01, 2025",
        "file_link": "output/astralcodexten/predictions-for-2025.md",
        "html_link": "output_html/astralcodexten/predictions-for-2025.html"
    }
]

# Save to JSON
scraper.save_essays_data_to_json(essays_data)

# Creates/updates: data/astralcodexten.json
# Merges with existing data if file already exists
# Used by HTML interface for sorting and navigation
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.