### Install Substack2Markdown Dependencies Source: https://github.com/timf34/substack2markdown/blob/main/README.md This snippet shows the commands to clone the Substack2Markdown repository and install its Python dependencies using pip. It also includes optional steps for creating and activating a virtual environment. ```bash git clone https://github.com/yourusername/substack_scraper.git cd substack_scraper # # Optinally create a virtual environment # python -m venv venv # # Activate the virtual environment # .\venv\Scripts\activate # Windows # source venv/bin/activate # Linux pip install -r requirements.txt ``` -------------------------------- ### URL Discovery using Sitemap/Feed Parsing (Python API) Source: https://context7.com/timf34/substack2markdown/llms.txt Demonstrates initializing the SubstackScraper with a base URL and output directories. This setup is preparatory for discovering post URLs, likely through parsing sitemap.xml or feed.xml, although the discovery logic itself is not shown in this snippet. ```python from substack_scraper import SubstackScraper scraper = SubstackScraper( base_substack_url="https://example.substack.com/", md_save_dir="output", html_save_dir="output_html" ) ``` -------------------------------- ### Get All Post URLs from Substack Source: https://context7.com/timf34/substack2markdown/llms.txt Retrieves all post URLs from a Substack publication. This function is typically called during the scraper's initialization. It filters out URLs containing 'about', 'archive', or 'podcast' and falls back to using 'feed.xml' if 'sitemap.xml' fails, with a warning about limited post retrieval in the fallback. ```python all_urls = scraper.get_all_post_urls() # Output example: # ['https://example.substack.com/p/first-post', # 'https://example.substack.com/p/second-post', # 'https://example.substack.com/p/third-post', # ...] # Note: URLs containing 'about', 'archive', 'podcast' are filtered out # If sitemap.xml fails, automatically falls back to feed.xml # Warning: "Falling back to feed.xml. This will only contain up to the 22 most recent posts." ``` -------------------------------- ### Command Line Interface for Basic (Free) Substack Scraping Source: https://context7.com/timf34/substack2markdown/llms.txt Executes the Substack scraper from the command line to download free content. Allows specifying the Substack URL, output directories for Markdown and HTML, and the number of posts to scrape. ```bash # Scrape free content (basic scraper) python substack_scraper.py \ --url https://astralcodexten.substack.com \ --directory ./my_posts \ --html-directory ./my_html \ --number 25 # Expected output: # Created md directory ./my_posts/astralcodexten # Created html directory ./my_html/astralcodexten # 100%|██████████| 25/25 [00:45<00:00, 1.81s/it] # Generated index at ./my_html/astralcodexten.html # View the generated index page in browser # It will contain sortable links to all 25 scraped posts ``` -------------------------------- ### Command Line Interface for Premium Substack Scraping Source: https://context7.com/timf34/substack2markdown/llms.txt Runs the Substack scraper from the command line to download premium content, requiring authentication. Uses the `--premium` flag and supports headless mode. Requires user credentials to be configured in `config.py`. ```bash # Scrape premium content (requires config.py with credentials) python substack_scraper.py \ --url https://www.thefitzwilliam.com \ --directory ./premium_posts \ --premium \ --headless \ --number 10 # Expected output: # Created md directory ./premium_posts/thefitzwilliam # Created html directory substack_html_pages/thefitzwilliam # Logging into Substack... # Login successful # 100%|██████████| 10/10 [02:15<00:00, 13.50s/it] # Saved 10 premium posts # Files created: # - ./premium_posts/thefitzwilliam/*.md (10 markdown files) # - ./substack_html_pages/thefitzwilliam/*.html (10 HTML files) # - ./data/thefitzwilliam.json (metadata) # - ./substack_html_pages/thefitzwilliam.html (browsable index) ``` -------------------------------- ### Configure Premium Content Credentials Source: https://github.com/timf34/substack2markdown/blob/main/README.md This Python code snippet demonstrates how to update the `config.py` file to include your Substack email and password for accessing premium content. Ensure these are replaced with your actual credentials. ```python EMAIL = "your-email@domain.com" PASSWORD = "your-password" ``` -------------------------------- ### Run Substack2Markdown Scraper Source: https://github.com/timf34/substack2markdown/blob/main/README.md These commands illustrate how to run the Substack2Markdown Python script. You can hardcode the URL and directory, or specify them as command-line arguments. Options for premium content and limiting the number of posts are also shown. ```bash # Run with hardcoded values python substack_scraper.py # For free Substack sites python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts # For premium Substack sites python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --premium # To scrape a specific number of posts python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --number 5 ``` -------------------------------- ### Scrape Premium Substack Content with Python API Source: https://context7.com/timf34/substack2markdown/llms.txt Initializes the PremiumSubstackScraper for authenticated access to both free and premium posts. Requires Selenium, Microsoft Edge, and optional configuration for headless mode and browser paths. User credentials must be set in a `config.py` file. ```python from substack_scraper import PremiumSubstackScraper # Initialize premium scraper with authentication scraper = PremiumSubstackScraper( base_substack_url="https://www.thefitzwilliam.com/", md_save_dir="substack_md_files", html_save_dir="substack_html_pages", headless=True, # Run browser in background edge_path="", # Optional: custom Edge browser path edge_driver_path="", # Optional: custom driver path user_agent="" # Optional: custom user agent for captcha bypass ) # Note: Requires config.py with EMAIL and PASSWORD set # EMAIL = "subscriber@example.com" # PASSWORD = "your-password" # Scrape all premium posts scraper.scrape_posts(num_posts_to_scrape=0) # Output: # Created md directory substack_md_files/thefitzwilliam # Created html directory substack_html_pages/thefitzwilliam # 100%|██████████| 50/50 [05:30<00:00, 6.60s/it] # Saved 50 posts including premium content ``` -------------------------------- ### Configure Substack Authentication Credentials Source: https://context7.com/timf34/substack2markdown/llms.txt Sets up authentication credentials for scraping premium Substack content. This involves defining the subscriber's email and password in a configuration file, typically named 'config.py'. ```python # config.py EMAIL = "subscriber@example.com" PASSWORD = "your-secure-password" ``` -------------------------------- ### Scrape Free Substack Content with Python API Source: https://context7.com/timf34/substack2markdown/llms.txt Initializes the SubstackScraper to download free posts from a given Substack URL. Saves content to specified Markdown and HTML directories. Supports scraping all posts or a specified number of recent posts. ```python from substack_scraper import SubstackScraper # Initialize scraper for free content scraper = SubstackScraper( base_substack_url="https://astralcodexten.substack.com/", md_save_dir="substack_md_files", html_save_dir="substack_html_pages" ) # Scrape all posts (num_posts_to_scrape=0 means all) scraper.scrape_posts(num_posts_to_scrape=0) # Or scrape only the 10 most recent posts scraper.scrape_posts(num_posts_to_scrape=10) # Output: # Created md directory substack_md_files/astralcodexten # Created html directory substack_html_pages/astralcodexten # 100%|██████████| 250/250 [02:15<00:00, 1.84it/s] # Files saved to: # - substack_md_files/astralcodexten/*.md # - substack_html_pages/astralcodexten/*.html # - data/astralcodexten.json # - substack_html_pages/astralcodexten.html (index page) ``` -------------------------------- ### Premium Substack Scraping with Authentication (Python) Source: https://context7.com/timf34/substack2markdown/llms.txt Scrapes premium Substack content using authentication with provided email and password. It saves content in Markdown and HTML formats, utilizing Selenium for login. Ensure your credentials in config.py are added to .gitignore for security. ```python from config import EMAIL, PASSWORD from substack_scraper import PremiumSubstackScraper scraper = PremiumSubstackScraper( base_substack_url="https://premium.substack.com/", md_save_dir="premium_output", html_save_dir="premium_html" ) # The scraper will automatically use EMAIL and PASSWORD # to authenticate during initialization # Credentials are used only for Selenium login at substack.com/sign-in # Never shared with third parties # Security note: Add config.py to .gitignore to protect credentials # Example .gitignore entry: # config.py # *.pyc # __pycache__/ ``` -------------------------------- ### Batch Substack Scraping with Error Handling and Progress (Python) Source: https://context7.com/timf34/substack2markdown/llms.txt Scrapes multiple posts from a Substack publication with automatic error handling and progress tracking using tqdm. It skips existing files, continues on errors, and saves JSON metadata. Requires the substack_scraper library. ```python from substack_scraper import SubstackScraper from tqdm import tqdm scraper = SubstackScraper( base_substack_url="https://example.substack.com/", md_save_dir="output", html_save_dir="output_html" ) # The scrape_posts method includes built-in error handling scraper.scrape_posts(num_posts_to_scrape=50) # Output with progress bar: # 100%|████████████████████| 50/50 [01:30<00:00, 1.80s/it] # Skipping premium article: https://example.substack.com/p/premium-post # File already exists: output/example/existing-post.md # Error scraping post: Connection timeout # Successfully scraped: 47/50 posts # Behavior: # - Skips already downloaded files (checks if .md exists) # - Continues on errors (prints error, moves to next post) # - Tracks progress with tqdm progress bar # - Saves JSON metadata after completion # - Generates HTML index automatically ``` -------------------------------- ### Generate Browsable HTML Interface for Substack Posts Source: https://context7.com/timf34/substack2markdown/llms.txt Creates an interactive HTML page for browsing scraped Substack posts. Features include toggling between Markdown and HTML views, sorting by publication date or like count, and direct links to posts. The page generation uses JSON data and embeds sorting logic. ```python from substack_scraper import generate_html_file # Generate author page from JSON data generate_html_file(author_name="astralcodexten") # Creates: substack_html_pages/astralcodexten.html # Features: # - Toggle between Markdown and HTML views # - Sort by publication date (ascending/descending) # - Sort by like count (ascending/descending) # - Direct links to all saved posts # - Displays title, subtitle, likes, and date for each post # Example generated HTML structure: # # # astralcodexten # #

astralcodexten

# # # #
# #
# # # ``` -------------------------------- ### Markdown to Styled HTML Conversion and Saving (Python) Source: https://context7.com/timf34/substack2markdown/llms.txt Converts Markdown content, including metadata, into styled HTML using the BaseSubstackScraper class. The output HTML is saved to a specified path with CSS styling, preserving all Markdown formatting and adding responsive meta tags. ```python from substack_scraper import BaseSubstackScraper scraper = BaseSubstackScraper( base_substack_url="https://example.substack.com/", md_save_dir="output", html_save_dir="output_html" ) # Markdown content with metadata markdown_content = """ # The Future of AI ## Exploring machine learning boundaries **Jan 15, 2025** **Likes:** 42 This is the post content with **bold** and *italic* text. - List item 1 - List item 2 """ # Convert to HTML html_output = scraper.md_to_html(markdown_content) # Save with CSS styling output_path = "output_html/example/future-of-ai.html" scraper.save_to_html_file(output_path, html_output) # Generated HTML includes: # - Responsive viewport meta tag # - Link to assets/css/essay-styles.css (relative path) # - Properly wrapped in , , tags # - Content in
for styling # - All Markdown formatting preserved (headings, lists, bold, italic, links) # Example output file: # # # # # # Markdown Content # # # #
#

The Future of AI

#

Exploring machine learning boundaries

# ... #
# # ``` -------------------------------- ### Convert HTML to Markdown Source: https://context7.com/timf34/substack2markdown/llms.txt Converts raw HTML content from Substack posts into clean Markdown format. This function preserves formatting, links, lists, and overall structure. The body width is set to 0, meaning no line wrapping occurs in the output. ```python from substack_scraper import BaseSubstackScraper # Example HTML content from a Substack post html_content = """

Section Title

This is a bold text and italic text.

Example Link
""" # Convert to Markdown markdown_output = BaseSubstackScraper.html_to_md(html_content) # Output: """ ## Section Title This is a **bold text** and _italic text_. [Example Link](https://example.com) * Item 1 * Item 2 """ # Note: Preserves links, formatting, lists, and structure # Body width set to 0 (no line wrapping) ``` -------------------------------- ### Extract Post Data from Substack HTML Source: https://context7.com/timf34/substack2markdown/llms.txt Extracts structured metadata and content from the HTML of a Substack post. It retrieves the title, subtitle, like count, publication date, and the main body content formatted as Markdown. Requires BeautifulSoup and requests libraries. ```python from substack_scraper import SubstackScraper from bs4 import BeautifulSoup import requests scraper = SubstackScraper( base_substack_url="https://example.substack.com/", md_save_dir="output", html_save_dir="output_html" ) # Fetch a post url = "https://example.substack.com/p/my-great-post" response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # Extract all post data title, subtitle, like_count, date, markdown_content = scraper.extract_post_data(soup) # Output example: # title = "The Future of AI" # subtitle = "Exploring the boundaries of machine learning" # like_count = "42" # date = "Jan 15, 2025" # markdown_content = """ # # The Future of AI # # ## Exploring the boundaries of machine learning # # **Jan 15, 2025** # # **Likes:** 42 # # [Full post content in Markdown format...] # """ ``` -------------------------------- ### Save Substack Post Metadata to JSON Source: https://context7.com/timf34/substack2markdown/llms.txt Saves extracted post metadata (title, subtitle, likes, date, file links) into a JSON file. This facilitates indexing, searching, and integration with other applications. The function merges new data with existing entries if the JSON file already exists. ```python from substack_scraper import SubstackScraper scraper = SubstackScraper( base_substack_url="https://astralcodexten.substack.com/", md_save_dir="output", html_save_dir="output_html" ) # Essays data collected during scraping essays_data = [ { "title": "Book Review: Why We Sleep", "subtitle": "The new science of sleep and dreams", "like_count": "156", "date": "Jan 10, 2025", "file_link": "output/astralcodexten/book-review-why-we-sleep.md", "html_link": "output_html/astralcodexten/book-review-why-we-sleep.html" }, { "title": "Predictions for 2025", "subtitle": "", "like_count": "203", "date": "Jan 01, 2025", "file_link": "output/astralcodexten/predictions-for-2025.md", "html_link": "output_html/astralcodexten/predictions-for-2025.html" } ] # Save to JSON scraper.save_essays_data_to_json(essays_data) # Creates/updates: data/astralcodexten.json # Merges with existing data if file already exists # Used by HTML interface for sorting and navigation ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.