Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
gdeltnews
https://github.com/iandreafc/gdeltnews
Admin
gdeltnews is a Python package that reconstructs full-text news articles from the GDELT Web News
...
Tokens:
8,645
Snippets:
63
Trust Score:
5.6
Update:
1 month ago
Context
Skills
Chat
Benchmark
94.1
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# gdeltnews gdeltnews is a Python package for reconstructing full-text news articles from the GDELT Web News NGrams 3.0 dataset. The GDELT project provides minute-by-minute n-gram data from web news sources worldwide, and this package enables researchers to reconstruct complete article text from overlapping fragments, making large-scale news analysis accessible without proprietary data subscriptions. The package provides a three-step workflow: downloading GDELT Web NGrams files for any time range, reconstructing article text from overlapping n-gram fragments using parallel processing, and filtering/merging results using Boolean queries. It supports language and URL-based filtering, automatic deduplication (keeping the longest text per URL), and produces clean CSV outputs suitable for text analysis and NLP research. ## Installation ```bash pip install gdeltnews ``` ## download - Download GDELT Web NGrams files Downloads GDELT Web NGrams minute files for a specified time range. The function fetches compressed `.json.gz` files from the GDELT project servers, iterating minute-by-minute through the given range. Files that don't exist on the server (not all minutes have data) are automatically skipped. Supports multiple timestamp formats including ISO 8601, space-separated datetime, and compact 14-digit format. ```python from gdeltnews import download, DownloadStats # Download 4 hours of GDELT data stats: DownloadStats = download( start="2025-11-25T10:00:00", # Start timestamp (datetime or string) end="2025-11-25T13:59:00", # End timestamp (inclusive) outdir="gdeltdata", # Output directory for downloaded files overwrite=False, # Skip existing files (default: False) decompress=True, # Also write decompressed .json files (default: True) timeout=30, # HTTP request timeout in seconds (default: 30) show_progress=True, # Show tqdm progress bar (default: True) ) # Check download statistics print(f"Requested: {stats.requested_minutes} minute slots") print(f"Downloaded: {stats.downloaded_gz} .gz files") print(f"Decompressed: {stats.decompressed_json} .json files") # Output: # Time range from 2025-11-25 10:00:00 to 2025-11-25 13:59:00 covers 240 minute slots. # Target directory for downloads: gdeltdata # Downloaded 187 .gz files into gdeltdata. # Decompressed 187 files to .json in gdeltdata. ``` ## parse_timestamp - Parse timestamp strings Parses timestamp strings into datetime objects. Accepts multiple formats commonly used with GDELT data. This utility function is useful when working with GDELT filenames or programmatically building time ranges. ```python from gdeltnews import parse_timestamp import datetime # Parse ISO 8601 format dt1 = parse_timestamp("2025-03-16T00:01:00") print(dt1) # 2025-03-16 00:01:00 # Parse with trailing Z (UTC indicator) dt2 = parse_timestamp("2025-03-16T00:01:00Z") print(dt2) # 2025-03-16 00:01:00 # Parse space-separated format dt3 = parse_timestamp("2025-03-16 00:01:00") print(dt3) # 2025-03-16 00:01:00 # Parse compact 14-digit format (GDELT filename format) dt4 = parse_timestamp("20250316000100") print(dt4) # 2025-03-16 00:01:00 # Use parsed timestamps with download from gdeltnews import download start_dt = parse_timestamp("20250315000000") end_dt = start_dt + datetime.timedelta(hours=2) download(start_dt, end_dt, outdir="gdeltdata") ``` ## reconstruct - Bulk article reconstruction Reconstructs article text from GDELT Web NGrams files using parallel processing. Processes all `*.webngrams.json.gz` files in a directory, reconstructing full article text from overlapping n-gram fragments. Must be run from a `.py` script (not Jupyter) due to multiprocessing requirements. Produces one CSV file per input file with columns: `Text|Date|URL|Source`. ```python from multiprocessing import freeze_support from gdeltnews import reconstruct def main(): reconstruct( input_dir="gdeltdata", # Directory with .webngrams.json.gz files output_dir="gdeltpreprocessed", # Output directory for CSV files language="it", # Language filter (None = all languages) url_filters=[ # URL substrings to keep (any match keeps URL) "repubblica.it", "corriere.it" ], processes=10, # Worker processes (None = all CPU cores) delete_gz=False, # Delete original .gz after processing (default: False) delete_json=True, # Delete temp .json after processing (default: True) delete_empty_csv=True, # Delete CSVs with no articles (default: True) show_progress=True, # Show progress bar (default: True) ) if __name__ == "__main__": freeze_support() # Required on Windows main() # Output: # Found 187 *.webngrams.json.gz files in gdeltdata # Output CSV files will be written to gdeltpreprocessed # Processing 20251125100200.webngrams.json.gz # Loading and filtering data from gdeltdata/20251125100200.webngrams.json... # Reconstructing 45 articles using 10 processes... # Wrote 45 articles to gdeltpreprocessed/20251125100200.webngrams.articles.csv ``` ## filtermerge - Filter and merge reconstructed CSVs Filters, deduplicates, and merges multiple CSV files into a single output file. Applies Boolean queries with AND, OR, and NOT operators (case-insensitive substring matching), deduplicates by URL (keeping the row with longest text), and combines all filtered results into one CSV. ```python from gdeltnews import filtermerge # Filter with complex Boolean query filtermerge( input_dir="gdeltpreprocessed", # Directory with CSV files from reconstruct output_file="final_filtered_dedup.csv", # Output CSV path query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)', keep_temp=False, # Keep intermediate .tmp file (default: False) verbose=True, # Print progress messages (default: True) ) # Output: # Filtering CSV files in gdeltpreprocessed into temporary file final_filtered_dedup.csv.tmp. # Deduplicating by URL and writing final output to final_filtered_dedup.csv. # Query syntax examples: # Simple term: query='elections' # AND query: query='biden AND trump' # OR query: query='inflation OR recession' # NOT query: query='economy AND NOT stock' # Phrase (quoted): query='"giorgia meloni" AND italia' # Complex: query='(("climate change" OR global warming) AND policy) AND NOT denial' # Without query (merge and deduplicate only) filtermerge( input_dir="gdeltpreprocessed", output_file="all_articles_dedup.csv", query=None, # No filtering, just merge and deduplicate ) ``` ## build_query_expr - Parse Boolean queries Parses and validates Boolean query strings for use with filtering. Returns a parsed expression and phrase mapping that can be reused for programmatic filtering. Useful for validating queries before processing large datasets. ```python from gdeltnews import build_query_expr # Parse a query string expr, phrases = build_query_expr( '("climate change" OR warming) AND policy AND NOT denial' ) print(f"Parsed expression: {expr}") print(f"Phrase mappings: {phrases}") # Parsed expression: ((PHRASE_0|warming)&policy&~denial) # Phrase mappings: {'PHRASE_0': 'climate change'} # Validate query before processing try: expr, phrases = build_query_expr('invalid ((( query') except ValueError as e: print(f"Invalid query: {e}") # Invalid query: Invalid Boolean query: ... # Empty/None query returns None (matches everything) expr, phrases = build_query_expr(None) print(expr, phrases) # None {} ``` ## reconstruct_webngrams_file - Process single file Reconstructs articles from a single decompressed `.webngrams.json` file. This is the lower-level function used by `reconstruct()` for processing individual files. Useful for advanced users who want fine-grained control or have pre-downloaded/pre-filtered data from Google BigQuery. ```python from multiprocessing import freeze_support from gdeltnews import reconstruct_webngrams_file def main(): # Process a single pre-decompressed JSON file reconstruct_webngrams_file( input_file="gdeltdata/20251125103200.webngrams.json", output_file="output/articles_103200.csv", language="en", # Language code (None = all languages) url_filters=["nytimes.com", "washingtonpost.com"], # URL filters processes=None, # Use all CPU cores ) if __name__ == "__main__": freeze_support() main() # Output: # Loading and filtering data from gdeltdata/20251125103200.webngrams.json... # Reconstructing 128 articles using 8 processes... # Wrote 128 articles to output/articles_103200.csv # Output CSV format (pipe-delimited): # Text|Date|URL|Source # Full reconstructed article text...|2025-11-25|https://www.nytimes.com/...|nytimes.com ``` ## load_and_filter_data - Load and filter raw data Loads raw GDELT JSON data and applies language and URL filters. Returns transformed data grouped by URL and a list preserving original URL order. Useful for custom processing pipelines or data inspection. ```python from gdeltnews.wordmatch import load_and_filter_data # Load and filter a GDELT JSON file articles, url_order = load_and_filter_data( input_file="gdeltdata/20251125103200.webngrams.json", language_filter="en", # Keep only English (None = all languages) url_filter=["bbc.com", "cnn.com"], # URL substrings to keep ) print(f"Found {len(articles)} unique URLs") print(f"First URL: {url_order[0]}") print(f"Entries for first URL: {len(articles[url_order[0]])}") # Inspect entry structure first_url = url_order[0] first_entry = articles[first_url][0] print(f"Entry keys: {first_entry.keys()}") # Entry keys: dict_keys(['sentence', 'pos', 'date', 'lang', 'type']) # Process without language filter articles_all, _ = load_and_filter_data( input_file="gdeltdata/20251125103200.webngrams.json", language_filter=None, # Keep all languages url_filter=None, # Keep all URLs ) ``` ## reconstruct_sentence - Merge overlapping fragments Reconstructs text from overlapping sentence fragments using greedy word-overlap merging. This is the core algorithm that joins n-gram fragments into coherent text. When positions are provided, it respects the original fragment order to avoid incorrect reorderings. ```python from gdeltnews.wordmatch import reconstruct_sentence # Overlapping fragments from GDELT n-grams fragments = [ "The president announced new", "announced new economic policies", "new economic policies today at", "policies today at the White House" ] positions = [0, 5, 10, 15] # Original positions from GDELT # Reconstruct the full sentence text = reconstruct_sentence(fragments, positions) print(text) # Output: The president announced new economic policies today at the White House # Without positions (pure overlap-based merging) text_no_pos = reconstruct_sentence(fragments, positions=None) print(text_no_pos) # Output: The president announced new economic policies today at the White House ``` ## process_article - Process single article Processes entries for a single URL into reconstructed article text. Sorts entries by position, reconstructs the text, removes overlaps, and derives metadata. Designed for parallel execution via multiprocessing. ```python from gdeltnews.wordmatch import process_article # Article entries from load_and_filter_data url = "https://www.example.com/news/article123" entries = [ {"sentence": "Breaking news today", "pos": 0, "date": "2025-11-25", "lang": "en", "type": ""}, {"sentence": "news today from the", "pos": 3, "date": "2025-11-25", "lang": "en", "type": ""}, {"sentence": "today from the capital", "pos": 6, "date": "2025-11-25", "lang": "en", "type": ""}, ] # Process the article result = process_article( url_entries_tuple=(url, entries), url_filters=["example.com"], ) print(result) # Output: # { # 'url': 'https://www.example.com/news/article123', # 'text': 'Breaking news today from the capital', # 'date': '2025-11-25', # 'source': 'example.com' # } ``` ## Complete Workflow Example A full end-to-end example showing the typical three-step workflow for collecting and analyzing news articles from GDELT. ```python #!/usr/bin/env python3 """Complete GDELT news reconstruction workflow.""" from multiprocessing import freeze_support from gdeltnews import download, reconstruct, filtermerge def main(): # Step 1: Download GDELT data for a time range print("=== Step 1: Downloading GDELT data ===") stats = download( start="2025-11-25T10:00:00", end="2025-11-25T13:59:00", outdir="gdeltdata", decompress=False, # reconstruct() handles decompression ) print(f"Downloaded {stats.downloaded_gz} files\n") # Step 2: Reconstruct articles with filtering print("=== Step 2: Reconstructing articles ===") reconstruct( input_dir="gdeltdata", output_dir="gdeltpreprocessed", language="it", # Italian articles only url_filters=["repubblica.it", "corriere.it"], processes=10, ) print("Reconstruction complete\n") # Step 3: Filter, deduplicate, and merge print("=== Step 3: Filtering and merging ===") filtermerge( input_dir="gdeltpreprocessed", output_file="final_results.csv", query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)', ) print("Pipeline complete! Results in final_results.csv") if __name__ == "__main__": freeze_support() # Required for Windows multiprocessing main() # Output CSV format (final_results.csv): # Text|Date|URL|Source # Reconstructed article about elections...|2025-11-25|https://www.repubblica.it/...|repubblica.it # Another article about regional voting...|2025-11-25|https://www.corriere.it/...|corriere.it ``` ## Summary gdeltnews provides a complete solution for researchers and data scientists who need to analyze global news content from the GDELT Web News NGrams 3.0 dataset. The three main functions—`download()`, `reconstruct()`, and `filtermerge()`—form a pipeline that handles the entire workflow from data acquisition to analysis-ready output. The package is particularly valuable for studying news coverage across languages, tracking global events, and conducting large-scale media analysis without requiring expensive data subscriptions. Integration with existing data pipelines is straightforward: the CSV output format with pipe delimiters works well with pandas, the Boolean query syntax supports complex filtering needs, and the multiprocessing architecture enables processing of large time ranges efficiently. Advanced users can leverage the lower-level functions like `reconstruct_webngrams_file()` and `load_and_filter_data()` for custom workflows, or integrate with Google BigQuery for pre-filtering GDELT data before reconstruction.