gdeltnews (iandreafc/gdeltnews)

gdeltnews

https://github.com/iandreafc/gdeltnews
Admin
gdeltnews is a Python package that reconstructs full-text news articles from the GDELT Web News...

Tokens:8,645
Snippets:63
Trust Score:5.6
Update:1 month ago
Show doc for...
Context Summary (auto-generated)
Raw
# gdeltnews

gdeltnews is a Python package for reconstructing full-text news articles from the GDELT Web News NGrams 3.0 dataset. The GDELT project provides minute-by-minute n-gram data from web news sources worldwide, and this package enables researchers to reconstruct complete article text from overlapping fragments, making large-scale news analysis accessible without proprietary data subscriptions.

The package provides a three-step workflow: downloading GDELT Web NGrams files for any time range, reconstructing article text from overlapping n-gram fragments using parallel processing, and filtering/merging results using Boolean queries. It supports language and URL-based filtering, automatic deduplication (keeping the longest text per URL), and produces clean CSV outputs suitable for text analysis and NLP research.

## Installation

```bash
pip install gdeltnews
```

## download - Download GDELT Web NGrams files

Downloads GDELT Web NGrams minute files for a specified time range. The function fetches compressed `.json.gz` files from the GDELT project servers, iterating minute-by-minute through the given range. Files that don't exist on the server (not all minutes have data) are automatically skipped. Supports multiple timestamp formats including ISO 8601, space-separated datetime, and compact 14-digit format.

```python
from gdeltnews import download, DownloadStats

# Download 4 hours of GDELT data
stats: DownloadStats = download(
    start="2025-11-25T10:00:00",   # Start timestamp (datetime or string)
    end="2025-11-25T13:59:00",     # End timestamp (inclusive)
    outdir="gdeltdata",            # Output directory for downloaded files
    overwrite=False,               # Skip existing files (default: False)
    decompress=True,               # Also write decompressed .json files (default: True)
    timeout=30,                    # HTTP request timeout in seconds (default: 30)
    show_progress=True,            # Show tqdm progress bar (default: True)
)

# Check download statistics
print(f"Requested: {stats.requested_minutes} minute slots")
print(f"Downloaded: {stats.downloaded_gz} .gz files")
print(f"Decompressed: {stats.decompressed_json} .json files")

# Output:
# Time range from 2025-11-25 10:00:00 to 2025-11-25 13:59:00 covers 240 minute slots.
# Target directory for downloads: gdeltdata
# Downloaded 187 .gz files into gdeltdata.
# Decompressed 187 files to .json in gdeltdata.
```

## parse_timestamp - Parse timestamp strings

Parses timestamp strings into datetime objects. Accepts multiple formats commonly used with GDELT data. This utility function is useful when working with GDELT filenames or programmatically building time ranges.

```python
from gdeltnews import parse_timestamp
import datetime

# Parse ISO 8601 format
dt1 = parse_timestamp("2025-03-16T00:01:00")
print(dt1)  # 2025-03-16 00:01:00

# Parse with trailing Z (UTC indicator)
dt2 = parse_timestamp("2025-03-16T00:01:00Z")
print(dt2)  # 2025-03-16 00:01:00

# Parse space-separated format
dt3 = parse_timestamp("2025-03-16 00:01:00")
print(dt3)  # 2025-03-16 00:01:00

# Parse compact 14-digit format (GDELT filename format)
dt4 = parse_timestamp("20250316000100")
print(dt4)  # 2025-03-16 00:01:00

# Use parsed timestamps with download
from gdeltnews import download

start_dt = parse_timestamp("20250315000000")
end_dt = start_dt + datetime.timedelta(hours=2)
download(start_dt, end_dt, outdir="gdeltdata")
```

## reconstruct - Bulk article reconstruction

Reconstructs article text from GDELT Web NGrams files using parallel processing. Processes all `*.webngrams.json.gz` files in a directory, reconstructing full article text from overlapping n-gram fragments. Must be run from a `.py` script (not Jupyter) due to multiprocessing requirements. Produces one CSV file per input file with columns: `Text|Date|URL|Source`.

```python
from multiprocessing import freeze_support
from gdeltnews import reconstruct

def main():
    reconstruct(
        input_dir="gdeltdata",           # Directory with .webngrams.json.gz files
        output_dir="gdeltpreprocessed",  # Output directory for CSV files
        language="it",                   # Language filter (None = all languages)
        url_filters=[                    # URL substrings to keep (any match keeps URL)
            "repubblica.it",
            "corriere.it"
        ],
        processes=10,                    # Worker processes (None = all CPU cores)
        delete_gz=False,                 # Delete original .gz after processing (default: False)
        delete_json=True,                # Delete temp .json after processing (default: True)
        delete_empty_csv=True,           # Delete CSVs with no articles (default: True)
        show_progress=True,              # Show progress bar (default: True)
    )

if __name__ == "__main__":
    freeze_support()  # Required on Windows
    main()

# Output:
# Found 187 *.webngrams.json.gz files in gdeltdata
# Output CSV files will be written to gdeltpreprocessed
# Processing 20251125100200.webngrams.json.gz
# Loading and filtering data from gdeltdata/20251125100200.webngrams.json...
# Reconstructing 45 articles using 10 processes...
# Wrote 45 articles to gdeltpreprocessed/20251125100200.webngrams.articles.csv
```

## filtermerge - Filter and merge reconstructed CSVs

Filters, deduplicates, and merges multiple CSV files into a single output file. Applies Boolean queries with AND, OR, and NOT operators (case-insensitive substring matching), deduplicates by URL (keeping the row with longest text), and combines all filtered results into one CSV.

```python
from gdeltnews import filtermerge

# Filter with complex Boolean query
filtermerge(
    input_dir="gdeltpreprocessed",       # Directory with CSV files from reconstruct
    output_file="final_filtered_dedup.csv",  # Output CSV path
    query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)',
    keep_temp=False,                     # Keep intermediate .tmp file (default: False)
    verbose=True,                        # Print progress messages (default: True)
)

# Output:
# Filtering CSV files in gdeltpreprocessed into temporary file final_filtered_dedup.csv.tmp.
# Deduplicating by URL and writing final output to final_filtered_dedup.csv.

# Query syntax examples:
# Simple term: query='elections'
# AND query: query='biden AND trump'
# OR query: query='inflation OR recession'
# NOT query: query='economy AND NOT stock'
# Phrase (quoted): query='"giorgia meloni" AND italia'
# Complex: query='(("climate change" OR global warming) AND policy) AND NOT denial'

# Without query (merge and deduplicate only)
filtermerge(
    input_dir="gdeltpreprocessed",
    output_file="all_articles_dedup.csv",
    query=None,  # No filtering, just merge and deduplicate
)
```

## build_query_expr - Parse Boolean queries

Parses and validates Boolean query strings for use with filtering. Returns a parsed expression and phrase mapping that can be reused for programmatic filtering. Useful for validating queries before processing large datasets.

```python
from gdeltnews import build_query_expr

# Parse a query string
expr, phrases = build_query_expr(
    '("climate change" OR warming) AND policy AND NOT denial'
)

print(f"Parsed expression: {expr}")
print(f"Phrase mappings: {phrases}")
# Parsed expression: ((PHRASE_0|warming)&policy&~denial)
# Phrase mappings: {'PHRASE_0': 'climate change'}

# Validate query before processing
try:
    expr, phrases = build_query_expr('invalid ((( query')
except ValueError as e:
    print(f"Invalid query: {e}")
# Invalid query: Invalid Boolean query: ...

# Empty/None query returns None (matches everything)
expr, phrases = build_query_expr(None)
print(expr, phrases)  # None {}
```

## reconstruct_webngrams_file - Process single file

Reconstructs articles from a single decompressed `.webngrams.json` file. This is the lower-level function used by `reconstruct()` for processing individual files. Useful for advanced users who want fine-grained control or have pre-downloaded/pre-filtered data from Google BigQuery.

```python
from multiprocessing import freeze_support
from gdeltnews import reconstruct_webngrams_file

def main():
    # Process a single pre-decompressed JSON file
    reconstruct_webngrams_file(
        input_file="gdeltdata/20251125103200.webngrams.json",
        output_file="output/articles_103200.csv",
        language="en",                   # Language code (None = all languages)
        url_filters=["nytimes.com", "washingtonpost.com"],  # URL filters
        processes=None,                  # Use all CPU cores
    )

if __name__ == "__main__":
    freeze_support()
    main()

# Output:
# Loading and filtering data from gdeltdata/20251125103200.webngrams.json...
# Reconstructing 128 articles using 8 processes...
# Wrote 128 articles to output/articles_103200.csv

# Output CSV format (pipe-delimited):
# Text|Date|URL|Source
# Full reconstructed article text...|2025-11-25|https://www.nytimes.com/...|nytimes.com
```

## load_and_filter_data - Load and filter raw data

Loads raw GDELT JSON data and applies language and URL filters. Returns transformed data grouped by URL and a list preserving original URL order. Useful for custom processing pipelines or data inspection.

```python
from gdeltnews.wordmatch import load_and_filter_data

# Load and filter a GDELT JSON file
articles, url_order = load_and_filter_data(
    input_file="gdeltdata/20251125103200.webngrams.json",
    language_filter="en",              # Keep only English (None = all languages)
    url_filter=["bbc.com", "cnn.com"], # URL substrings to keep
)

print(f"Found {len(articles)} unique URLs")
print(f"First URL: {url_order[0]}")
print(f"Entries for first URL: {len(articles[url_order[0]])}")

# Inspect entry structure
first_url = url_order[0]
first_entry = articles[first_url][0]
print(f"Entry keys: {first_entry.keys()}")
# Entry keys: dict_keys(['sentence', 'pos', 'date', 'lang', 'type'])

# Process without language filter
articles_all, _ = load_and_filter_data(
    input_file="gdeltdata/20251125103200.webngrams.json",
    language_filter=None,  # Keep all languages
    url_filter=None,       # Keep all URLs
)
```

## reconstruct_sentence - Merge overlapping fragments

Reconstructs text from overlapping sentence fragments using greedy word-overlap merging. This is the core algorithm that joins n-gram fragments into coherent text. When positions are provided, it respects the original fragment order to avoid incorrect reorderings.

```python
from gdeltnews.wordmatch import reconstruct_sentence

# Overlapping fragments from GDELT n-grams
fragments = [
    "The president announced new",
    "announced new economic policies",
    "new economic policies today at",
    "policies today at the White House"
]
positions = [0, 5, 10, 15]  # Original positions from GDELT

# Reconstruct the full sentence
text = reconstruct_sentence(fragments, positions)
print(text)
# Output: The president announced new economic policies today at the White House

# Without positions (pure overlap-based merging)
text_no_pos = reconstruct_sentence(fragments, positions=None)
print(text_no_pos)
# Output: The president announced new economic policies today at the White House
```

## process_article - Process single article

Processes entries for a single URL into reconstructed article text. Sorts entries by position, reconstructs the text, removes overlaps, and derives metadata. Designed for parallel execution via multiprocessing.

```python
from gdeltnews.wordmatch import process_article

# Article entries from load_and_filter_data
url = "https://www.example.com/news/article123"
entries = [
    {"sentence": "Breaking news today", "pos": 0, "date": "2025-11-25", "lang": "en", "type": ""},
    {"sentence": "news today from the", "pos": 3, "date": "2025-11-25", "lang": "en", "type": ""},
    {"sentence": "today from the capital", "pos": 6, "date": "2025-11-25", "lang": "en", "type": ""},
]

# Process the article
result = process_article(
    url_entries_tuple=(url, entries),
    url_filters=["example.com"],
)

print(result)
# Output:
# {
#     'url': 'https://www.example.com/news/article123',
#     'text': 'Breaking news today from the capital',
#     'date': '2025-11-25',
#     'source': 'example.com'
# }
```

## Complete Workflow Example

A full end-to-end example showing the typical three-step workflow for collecting and analyzing news articles from GDELT.

```python
#!/usr/bin/env python3
"""Complete GDELT news reconstruction workflow."""
from multiprocessing import freeze_support
from gdeltnews import download, reconstruct, filtermerge

def main():
    # Step 1: Download GDELT data for a time range
    print("=== Step 1: Downloading GDELT data ===")
    stats = download(
        start="2025-11-25T10:00:00",
        end="2025-11-25T13:59:00",
        outdir="gdeltdata",
        decompress=False,  # reconstruct() handles decompression
    )
    print(f"Downloaded {stats.downloaded_gz} files\n")

    # Step 2: Reconstruct articles with filtering
    print("=== Step 2: Reconstructing articles ===")
    reconstruct(
        input_dir="gdeltdata",
        output_dir="gdeltpreprocessed",
        language="it",  # Italian articles only
        url_filters=["repubblica.it", "corriere.it"],
        processes=10,
    )
    print("Reconstruction complete\n")

    # Step 3: Filter, deduplicate, and merge
    print("=== Step 3: Filtering and merging ===")
    filtermerge(
        input_dir="gdeltpreprocessed",
        output_file="final_results.csv",
        query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)',
    )
    print("Pipeline complete! Results in final_results.csv")

if __name__ == "__main__":
    freeze_support()  # Required for Windows multiprocessing
    main()

# Output CSV format (final_results.csv):
# Text|Date|URL|Source
# Reconstructed article about elections...|2025-11-25|https://www.repubblica.it/...|repubblica.it
# Another article about regional voting...|2025-11-25|https://www.corriere.it/...|corriere.it
```

## Summary

gdeltnews provides a complete solution for researchers and data scientists who need to analyze global news content from the GDELT Web News NGrams 3.0 dataset. The three main functions—`download()`, `reconstruct()`, and `filtermerge()`—form a pipeline that handles the entire workflow from data acquisition to analysis-ready output. The package is particularly valuable for studying news coverage across languages, tracking global events, and conducting large-scale media analysis without requiring expensive data subscriptions.

Integration with existing data pipelines is straightforward: the CSV output format with pipe delimiters works well with pandas, the Boolean query syntax supports complex filtering needs, and the multiprocessing architecture enables processing of large time ranges efficiently. Advanced users can leverage the lower-level functions like `reconstruct_webngrams_file()` and `load_and_filter_data()` for custom workflows, or integrate with Google BigQuery for pre-filtering GDELT data before reconstruction.