# Arctic Shift Arctic Shift is a comprehensive Reddit data archive that makes historical Reddit content accessible to researchers, moderators, and the general public. The project provides access to Reddit posts and comments dating back to 2005, with data retrieved through the official Reddit API and stored in compressed JSON format. Arctic Shift offers multiple ways to interact with this data: downloadable monthly data dumps via Academic Torrents, a RESTful API for querying specific content, and a web-based search interface. The core functionality includes searching posts and comments by various criteria (author, subreddit, date range, keywords), retrieving comment trees for specific posts, aggregating data for analytics, and accessing subreddit metadata including rules and wiki pages. The Python helper scripts enable efficient processing of compressed data dumps locally, supporting `.zst`, `.zst_blocks`, `.jsonl`, and `.json` file formats. Data is updated with a 36-hour delay to capture accurate scores and comment counts. ## API Reference Base URL: `https://arctic-shift.photon-reddit.com` ### Retrieve Posts/Comments by ID Fetch multiple posts or comments using their Reddit IDs. Supports up to 500 IDs per request. ```bash # Retrieve two posts by ID curl "https://arctic-shift.photon-reddit.com/api/posts/ids?ids=ei30r4,eitwb3" # Retrieve comments by ID with HTML rendering curl "https://arctic-shift.photon-reddit.com/api/comments/ids?ids=dppum98,abc123&md2html=true" # Select specific fields only curl "https://arctic-shift.photon-reddit.com/api/posts/ids?ids=ei30r4&fields=author,title,score,created_utc" # Response format: # { # "data": [ # { # "id": "ei30r4", # "author": "username", # "title": "Post title here", # "subreddit": "worldnews", # "score": 1234, # "created_utc": 1577836800, # ... # } # ] # } ``` ### Search Posts Search for posts with filtering by subreddit, author, date range, and keywords in title or body. ```bash # Search r/worldnews for posts with "wuhan" in title, after Dec 30 2019, sorted ascending curl "https://arctic-shift.photon-reddit.com/api/posts/search?subreddit=worldnews&title=wuhan&after=2019-12-30&sort=asc&limit=10" # Search by author with date range curl "https://arctic-shift.photon-reddit.com/api/posts/search?author=spez&after=2020-01-01&before=2021-01-01&limit=25" # Search with URL prefix matching curl "https://arctic-shift.photon-reddit.com/api/posts/search?url=https://www.youtube.com/watch&subreddit=videos&limit=50" # Full-text search in title and selftext combined curl "https://arctic-shift.photon-reddit.com/api/posts/search?query=machine%20learning&subreddit=datascience&limit=100" # Filter NSFW content curl "https://arctic-shift.photon-reddit.com/api/posts/search?subreddit=funny&over_18=false&limit=25" # Response format: # { # "data": [ # { # "id": "abc123", # "author": "username", # "title": "Post title", # "selftext": "Post body content", # "subreddit": "worldnews", # "score": 5432, # "num_comments": 234, # "created_utc": 1577836800, # "url": "https://example.com/article", # ... # } # ] # } ``` ### Search Comments Search for comments with filtering options including post ID and parent comment ID. ```bash # Search for comments by a specific user under a specific post curl "https://arctic-shift.photon-reddit.com/api/comments/search?author=PresidentObama&link_id=z1c9z&limit=100" # Search top-level comments only (no parent) curl "https://arctic-shift.photon-reddit.com/api/comments/search?subreddit=askreddit&parent_id=&limit=50" # Search comments containing specific text curl "https://arctic-shift.photon-reddit.com/api/comments/search?author=spez&body=community&limit=25" # Get comments with HTML-rendered markdown curl "https://arctic-shift.photon-reddit.com/api/comments/search?subreddit=programming&after=2024-01-01&md2html=true&limit=10" # Response format: # { # "data": [ # { # "id": "xyz789", # "author": "username", # "body": "Comment text here", # "link_id": "t3_abc123", # "parent_id": "t1_def456", # "subreddit": "askreddit", # "score": 42, # "created_utc": 1577836800, # ... # } # ] # } ``` ### Get Comment Tree Retrieve comments in a hierarchical tree structure, similar to Reddit's display format. ```bash # Get comment tree for a post with a specific parent comment as root curl "https://arctic-shift.photon-reddit.com/api/comments/tree?link_id=t3_7cff0b&parent_id=t1_dppum98&md2html=true" # Get ALL comments under a post (use high limit) curl "https://arctic-shift.photon-reddit.com/api/comments/tree?link_id=t3_x8i09x&limit=9999" # Control comment collapsing depth and breadth curl "https://arctic-shift.photon-reddit.com/api/comments/tree?link_id=t3_abc123&start_depth=6&start_breadth=5&limit=500" # Response format (tree structure): # { # "data": [ # { # "kind": "t1", # "data": { # "id": "comment_id", # "body": "Comment text", # "author": "username", # "replies": { # "data": { # "children": [...] // nested comments # } # } # } # }, # { # "kind": "more", # "data": { # "children": ["id1", "id2", ...] // collapsed comment IDs # } # } # ] # } ``` ### Aggregate Posts/Comments Generate aggregate statistics by date, author, or subreddit. ```bash # Comment frequency of u/spez by year since 2006 curl "https://arctic-shift.photon-reddit.com/api/comments/search/aggregate?aggregate=created_utc&frequency=year&author=spez&after=2006-01-01" # Most active posters in r/announcements curl "https://arctic-shift.photon-reddit.com/api/posts/search/aggregate?aggregate=author&subreddit=announcements" # Top subreddits by user activity curl "https://arctic-shift.photon-reddit.com/api/posts/search/aggregate?aggregate=subreddit&author=gallowboob&limit=20" # Monthly post distribution with minimum count filter curl "https://arctic-shift.photon-reddit.com/api/posts/search/aggregate?aggregate=created_utc&frequency=month&subreddit=technology&after=2020-01-01&before=2024-01-01" # Response format: # { # "data": [ # {"key": "2023", "doc_count": 1542}, # {"key": "2022", "doc_count": 2103}, # ... # ] # } ``` ### Search Subreddits Search for subreddits by name prefix, subscriber count, and creation date. ```bash # Search for subreddits starting with "ask" sorted by subscribers curl "https://arctic-shift.photon-reddit.com/api/subreddits/search?subreddit_prefix=ask" # Find oldest subreddits with more than 1000 subscribers curl "https://arctic-shift.photon-reddit.com/api/subreddits/search?min_subscribers=1000&sort_type=created_utc&sort=asc&limit=50" # Filter NSFW subreddits curl "https://arctic-shift.photon-reddit.com/api/subreddits/search?over18=true&min_subscribers=10000&limit=25" # Response format: # { # "data": [ # { # "display_name": "AskReddit", # "subscribers": 45000000, # "created_utc": 1201233135, # "description": "Subreddit description...", # "over18": false, # ... # } # ] # } ``` ### Get Subreddit Rules Retrieve the rules defined for one or more subreddits. ```bash # Get rules for multiple subreddits curl "https://arctic-shift.photon-reddit.com/api/subreddits/rules?subreddits=askreddit,politics,science" # Response format: # { # "data": { # "askreddit": [ # { # "short_name": "Rule 1", # "description": "Full rule description...", # "kind": "all" # }, # ... # ] # } # } ``` ### Get Subreddit Wikis Retrieve wiki pages from subreddits. ```bash # Get all wiki pages from a subreddit curl "https://arctic-shift.photon-reddit.com/api/subreddits/wikis?subreddit=askreddit&limit=50" # Get specific wiki pages by path curl "https://arctic-shift.photon-reddit.com/api/subreddits/wikis?paths=/r/reddit.com/wiki/faq,/r/travel/wiki/faq" # List all wiki page paths in a subreddit curl "https://arctic-shift.photon-reddit.com/api/subreddits/wikis/list?subreddit=askreddit" # Response format: # { # "data": [ # { # "path": "/r/askreddit/wiki/index", # "content": "Wiki page content in markdown...", # "revision_date": 1577836800 # } # ] # } ``` ### Search Users Search for users by name prefix, activity metrics, and karma. ```bash # Search for users with the most karma curl "https://arctic-shift.photon-reddit.com/api/users/search?sort_type=total_karma&limit=25" # Search for users starting with "mod" who have at least 1000 comments curl "https://arctic-shift.photon-reddit.com/api/users/search?author_prefix=mod&min_num_comments=1000&sort_type=author&sort=asc" # Response format: # { # "data": [ # { # "author": "username", # "total_karma": 5000000, # "num_posts": 1234, # "num_comments": 56789, # "first_post_utc": 1234567890, # "last_comment_utc": 1677654321 # } # ] # } ``` ### User Interactions Analyze interactions between users or user activity across subreddits. ```bash # Get user-to-user interactions for u/spez before 2017 with min 10 interactions curl "https://arctic-shift.photon-reddit.com/api/users/interactions/users?author=spez&before=2017-01-01&min_count=10" # List individual interactions curl "https://arctic-shift.photon-reddit.com/api/users/interactions/users/list?author=spez&subreddit=announcements&limit=50" # Get user activity across subreddits with custom weighting curl "https://arctic-shift.photon-reddit.com/api/users/interactions/subreddits?author=gallowboob&weight_posts=2.0&weight_comments=1.0&limit=20" # Response format for interactions: # { # "data": [ # {"author": "other_user", "count": 45}, # {"author": "another_user", "count": 32}, # ... # ] # } ``` ### Aggregate User Flairs Get all author flairs used by a user, grouped by subreddit. ```bash curl "https://arctic-shift.photon-reddit.com/api/users/aggregate_flairs?author=spez" # Response format: # { # "data": { # "announcements": ["Admin", "CEO"], # "reddit": ["A"], # ... # } # } ``` ### Resolve Short Links Convert Reddit short links (r/subreddit/s/xxx format) to full URLs. ```bash curl "https://arctic-shift.photon-reddit.com/api/short_links?paths=/r/running/s/3TzXiyxaMD,/u/CEO_Gola/s/WO7Ro11h1a" # Response format: # { # "data": { # "/r/running/s/3TzXiyxaMD": "https://www.reddit.com/r/running/comments/...", # "/u/CEO_Gola/s/WO7Ro11h1a": "https://www.reddit.com/user/..." # } # } ``` ### Time Series Data Retrieve aggregated metrics over time for global Reddit activity or specific subreddits. ```bash # Get global post count per year curl "https://arctic-shift.photon-reddit.com/api/time_series?key=global/posts/count&precision=year" # Get r/askreddit subscriber growth per month curl "https://arctic-shift.photon-reddit.com/api/time_series?key=r/askreddit/subscribers&precision=month&after=2020-01-01" # Get comment activity in a subreddit by week curl "https://arctic-shift.photon-reddit.com/api/time_series?key=r/programming/comments/count&precision=week&after=2023-01-01&before=2024-01-01" # Available keys: # - global/posts/count, global/comments/count # - global/posts/sum_score, global/comments/sum_score # - r//posts/count, r//comments/count # - r//posts/sum_score, r//comments/sum_score # - r//subscribers # Response format: # { # "data": [ # {"key": "2023-01", "value": 15234567}, # {"key": "2023-02", "value": 14567890}, # ... # ] # } ``` ## Python Scripts for Processing Data Dumps ### Process Compressed Reddit Data Files The Python scripts enable processing of downloaded data dumps in various compressed formats. ```python # scripts/processFiles.py - Main processing script import sys import os from fileStreams import getFileJsonStream from utils import FileProgressLog # Set the path to your downloaded data file or folder fileOrFolderPath = r"/path/to/reddit_data/RC_2024-01.zst" recursive = False # Set True to process subfolders def processFile(path: str): """Process a single Reddit data file""" print(f"Processing file {path}") with open(path, "rb") as f: jsonStream = getFileJsonStream(path, f) if jsonStream is None: print(f"Skipping unknown file {path}") return progressLog = FileProgressLog(path, f) # Track statistics comment_count = 0 unique_authors = set() for row in jsonStream: progressLog.onRow() # Access common fields author = row["author"] subreddit = row["subreddit"] post_id = row["id"] created = row["created_utc"] score = row["score"] # For comments files (RC_*.zst) if "body" in row: body = row["body"] parent_id = row["parent_id"] # t3_xxx (post) or t1_xxx (comment) link_id = row["link_id"] # t3_xxx (post ID) # For posts/submissions files (RS_*.zst) if "title" in row: title = row["title"] selftext = row.get("selftext", "") url = row.get("url", "") num_comments = row.get("num_comments", 0) # Example: Count comments by subreddit comment_count += 1 unique_authors.add(author) progressLog.logProgress("\n") print(f"Total records: {comment_count:,}") print(f"Unique authors: {len(unique_authors):,}") # Run processing if os.path.isdir(fileOrFolderPath): for file in os.listdir(fileOrFolderPath): processFile(os.path.join(fileOrFolderPath, file)) else: processFile(fileOrFolderPath) ``` ### File Stream Utilities Utilities for reading different compressed file formats. ```python # scripts/fileStreams.py - Streaming JSON from compressed files from typing import BinaryIO, Iterator import zstandard try: import orjson as json # Faster JSON parsing (recommended) except ImportError: import json def getFileJsonStream(path: str, f: BinaryIO) -> Iterator[dict] | None: """ Get appropriate JSON stream based on file extension. Supports: .zst, .zst_blocks, .jsonl, .ndjson, .json """ if path.endswith(".jsonl") or path.endswith(".ndjson"): return getJsonLinesFileJsonStream(f) elif path.endswith(".zst"): return getZstFileJsonStream(f) elif path.endswith(".zst_blocks"): return getZstBlocksFileJsonStream(f) elif path.endswith(".json"): return getJsonFileStream(f) return None def getZstFileJsonStream(f: BinaryIO, chunk_size=1024*1024*10) -> Iterator[dict]: """Stream JSON objects from a zstandard compressed file""" decompressor = zstandard.ZstdDecompressor(max_window_size=2**31) currentString = "" zstReader = decompressor.stream_reader(f) while True: chunk = zstReader.read(chunk_size) if not chunk: break currentString += chunk.decode("utf-8", "replace") lines = currentString.split("\n") currentString = lines[-1] for line in lines[:-1]: if line: yield json.loads(line) if currentString: yield json.loads(currentString) # Example: Extract all comments from a specific subreddit def extract_subreddit_comments(zst_path: str, target_subreddit: str, output_path: str): """Extract all comments from a specific subreddit to a new file""" import json as std_json with open(zst_path, "rb") as f, open(output_path, "w") as out: for record in getZstFileJsonStream(f): if record.get("subreddit", "").lower() == target_subreddit.lower(): out.write(std_json.dumps(record) + "\n") # Usage: # extract_subreddit_comments("RC_2024-01.zst", "programming", "programming_comments.jsonl") ``` ### Progress Tracking Utility Track processing progress for large files with time estimates. ```python # scripts/utils.py - Progress logging utility import os import time from typing import BinaryIO class FileProgressLog: """Track and display progress when processing large files""" def __init__(self, path: str, file: BinaryIO): self.file = file self.fileSize = os.path.getsize(path) self.i = 0 self.startTime = time.time() self.printEvery = 10_000 def onRow(self): """Call this for each processed row""" self.i += 1 if self.i % self.printEvery == 0: self.logProgress() def logProgress(self, end=""): """Print current progress with time estimates""" progress = self.file.tell() / self.fileSize if not self.file.closed else 1 elapsed = time.time() - self.startTime remaining = (elapsed / progress - elapsed) if progress > 0 else 0 print(f"\r{self.i:,} rows - {progress:.2%} - " f"elapsed: {elapsed:.0f}s - remaining: {remaining:.0f}s", end=end) # Example output during processing: # 1,230,000 rows - 45.32% - elapsed: 120s - remaining: 145s ``` ## Summary Arctic Shift serves as the primary successor to Pushshift for Reddit data archival, catering to researchers studying social media trends, moderators needing historical data for community management, and developers building Reddit analytics tools. The API provides comprehensive search, aggregation, and tree-building capabilities that mirror Reddit's own data structures, while the downloadable dumps enable large-scale offline analysis. Common use cases include tracking discourse evolution over time, analyzing user behavior patterns, building recommendation systems, and conducting academic research on online communities. For integration, the API follows RESTful conventions with consistent parameter naming across endpoints, making it straightforward to build client libraries. Rate limiting is permissive for typical usage, but heavy analysis should use the monthly data dumps instead. The Python processing scripts provide a foundation for building custom analysis pipelines, supporting streaming decompression to handle files that exceed available RAM. Data is available from June 2005 through the present, with new monthly archives typically released within a few days of each month's end via Academic Torrents.