### Get Example Paper Path from Installed Package Source: https://github.com/massimoaria/contentanalysis/blob/master/inst/examples/README.md Retrieves the file path for the example paper using the `get_example_paper()` function. This is the recommended method for users of the installed package. The function will download the file if it's not found locally. ```r paper_path <- get_example_paper() ``` -------------------------------- ### Get Example Paper Path from GitHub Source: https://github.com/massimoaria/contentanalysis/blob/master/inst/examples/README.md Retrieves the file path for the example paper directly from the GitHub repository. Use this when developing or if the package is not yet installed. ```r paper_path <- system.file('examples', 'example_paper.pdf', package = 'contentanalysis') ``` -------------------------------- ### GPL Notice for Interactive Program Startup Source: https://github.com/massimoaria/contentanalysis/blob/master/LICENSE.md Display this notice when a program starts in interactive mode to inform users about its free software status, warranty, and redistribution conditions. ```text Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details. ``` -------------------------------- ### Add New Fixture Test Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Example of how to add a test for a new PDF fixture. Ensure pdftools is installed and the fixture file exists. ```R it("extracts from your-new-fixture.pdf", { skip_if_not_installed("pdftools") skip_if_no_fixture("your-new-fixture.pdf") pdf_file <- fixture_path("your-new-fixture.pdf") result <- pdf2txt_multicolumn_safe(pdf_file) expect_type(result, "character") expect_gt(nchar(result), 0) # Add specific expectations based on known content }) ``` -------------------------------- ### Download Example Scientific Paper Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Download a sample scientific paper from a specified URL to a local file. This is useful for testing the analysis functions with a real-world example. Ensure the 'destfile' path is writable. ```r paper_url <- "https://raw.githubusercontent.com/massimoaria/contentanalysis/master/inst/examples/example_paper.pdf" download.file(paper_url, destfile = "example_paper.pdf", mode = "wb") ``` -------------------------------- ### Install contentanalysis Package from GitHub Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Install the development version of the contentanalysis package directly from GitHub using the devtools package. Ensure devtools is installed first. ```r # install.packages("devtools") devtools::install_github("massimoaria/contentanalysis") ``` -------------------------------- ### Standard GPL Copyright Notice for Source Files Source: https://github.com/massimoaria/contentanalysis/blob/master/LICENSE.md Include this notice at the start of each source file to state the exclusion of warranty and provide copyright information. It points to the full license text. ```text Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . ``` -------------------------------- ### Calculate and View Word Distribution by Section Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Tracks the frequency of specified terms across different sections of a document. Use this to understand how specific keywords are distributed throughout the text, for example, to analyze the prevalence of disease-related terms in different segments. ```r # Track disease-related terms disease_terms <- c("covid", "pandemic", "health", "policy", "vaccination") dist <- calculate_word_distribution(doc, disease_terms, use_sections = TRUE) # View frequencies by section dist %>% select(segment_name, word, count, percentage) %>% arrange(segment_name, desc(percentage)) #> # A tibble: 1 × 4 #> segment_name word count percentage #> #> 1 Conclusion health 1 0.330 # Visualize trends #plot_word_distribution(dist, plot_type = "area", smooth = FALSE) ``` -------------------------------- ### Generate Test Fixtures via Command Line Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Execute this command in the terminal to regenerate the test fixtures using an Rscript. ```bash Rscript create_test_fixtures.R ``` -------------------------------- ### Generate Test Fixtures Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Run this R code to regenerate the test fixtures. Ensure the 'create_test_fixtures.R' script is available. ```r source("create_test_fixtures.R") main() ``` -------------------------------- ### Set Google Gemini API Key in .Renviron Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Persistently set the GEMINI_API_KEY environment variable for Google Gemini API access by adding it to your .Renviron file. ```r # Add to ~/.Renviron GEMINI_API_KEY=your-api-key-here ``` -------------------------------- ### Set OpenAlex API Key in .Rprofile Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Automatically set your OpenAlex API key at R startup by adding the configuration to your .Rprofile file. ```r # Add to ~/.Rprofile openalexR::oa_apikey("your-api-key-here") ``` -------------------------------- ### Set Google Gemini API Key Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Configure the GEMINI_API_KEY environment variable in your R session to enable AI-enhanced PDF import and rhetorical move analysis. ```r Sys.setenv(GEMINI_API_KEY = "your-api-key-here") ``` -------------------------------- ### Fixture Metadata Information Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Provides sample metadata for different types of PDF fixtures, including page count, column structure, approximate character count, and expected content. ```R # Sample characteristics (approximate) fixtures_info <- list( "sample-single-column.pdf" = list( pages = 1, columns = 1, approx_chars = 500, contains = c("SAMPLE DOCUMENT", "Introduction", "Methods") ), "sample-two-columns.pdf" = list( pages = 1, columns = 2, approx_chars = 400, contains = c("TWO-COLUMN", "ABSTRACT", "Introduction") ), "sample-three-columns.pdf" = list( pages = 1, columns = 3, approx_chars = 200, contains = c("THREE-COLUMN", "Column 1", "Column 2", "Column 3") ) ) ``` -------------------------------- ### Display Citation Contexts Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Shows the first few rows of citation contexts, including the cleaned citation text, section, and the full context where the citation appears. Useful for initial data exploration. ```r head(analysis$citation_contexts[, c("citation_text_clean", "section", "full_context")]) ``` -------------------------------- ### View Citation-Reference Matches Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Shows the first few citation-reference mappings, including the clean citation text, referenced authors, year, and match confidence. Useful for verifying link quality. ```r # View citation-reference matches with confidence levels library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union head(analysis$citation_references_mapping[, c("citation_text_clean", "ref_authors", "ref_year", "match_confidence")]) #> # A tibble: 6 × 4 #> citation_text_clean ref_authors ref_year match_confidence #> #> 1 (Mitchell, 1997) Mitchell 1997 high #> 2 (Breiman, Friedman, Olshen, & Stone, 1984) Breiman 1984 high #> 3 (Breiman, 2001) Breiman, L. 2001 high #> 4 (see Breiman, 1996) Breiman, L. 1996 high #> 5 (Hastie, Tibshirani, & Friedman, 2009) Hastie 2009 high #> 6 (Hastie et al., 2009) Hastie 2009 high ``` -------------------------------- ### Set Gemini API Key for Hybrid Analysis Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Sets the Gemini API key as an environment variable. This is required for hybrid classification that uses the Gemini LLM. ```r # Set your Gemini API key Sys.setenv(GEMINI_API_KEY = "your-api-key-here") ``` -------------------------------- ### Load contentanalysis Library Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Load the contentanalysis library into your R session to access its functions. This is a prerequisite for using any of the package's functionalities. ```r library(contentanalysis) ``` -------------------------------- ### Import PDF and Detect Sections Automatically Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Import a PDF document and automatically detect its sections, such as Introduction, Related Work, and References. This function normalizes references and can be configured with the number of columns and citation type. ```r doc <- pdf2txt_auto("example_paper.pdf", n_columns = 2, citation_type = "author_year") #> Using 17 sections from PDF table of contents #> Found 15 sections: Introduction, Related work, Internal processing approaches, Random forest extra information, Visualization toolkits, Post-Hoc approaches, Size reduction, Rule extraction, Local explanation, Comparison study, Experimental design, Analysis, Conclusion, Acknowledgment, References #> Normalized 32 references with consistent \n\n separators # Check detected sections names(doc) #> [1] "Full_text" "Introduction" #> [3] "Related work" "Internal processing approaches" #> [5] "Random forest extra information" "Visualization toolkits" #> [7] "Post-Hoc approaches" "Size reduction" #> [9] "Rule extraction" "Local explanation" #> [11] "Comparison study" "Experimental design" #> [13] "Analysis" "Conclusion" #> [15] "Acknowledgment" "References" ``` -------------------------------- ### View Parsed References Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Displays the first few rows of parsed references, showing author, year, journal, and source. Useful for a quick overview of the reference data. ```r # View parsed references (enriched with CrossRef and OpenAlex) head(analysis$parsed_references[, c("ref_first_author", "ref_year", "ref_journal", "ref_source")]) #> ref_first_author ref_year ref_journal ref_source #> 1 Adadi 2018 IEEE Access crossref #> 2 crossref #> 3 Branco 2016 ACM Computing Surveys crossref #> 4 Breiman 1996 Machine Learning crossref #> 5 Breiman 2001 Machine Learning crossref #> 6 Breiman 1984 International Group crossref ``` -------------------------------- ### Classify Rhetorical Moves with Rules Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Classifies rhetorical moves using predefined cue phrase dictionaries without requiring an API key. Displays sentence-level classifications. ```r # Classify rhetorical moves using cue phrase dictionaries Moves <- classify_rhetorical_moves(doc, use_llm = FALSE) #> Analyzing rhetorical moves in 12 sections: Introduction, Related work, Internal processing approaches, Random forest extra information, Visualization toolkits, Post-Hoc approaches, Size reduction, Rule extraction, Local explanation, Comparison study, Analysis, Conclusion #> Segmented 194 sentences # Sentence-level classification head(moves$sentences[, c("sentence_id", "section", "move", "step", "confidence")]) #> # A tibble: 6 × 5 #> sentence_id section move step confidence #> #> 1 1 Introduction Unclassified Unclassified 0 #> 2 2 Introduction Unclassified Unclassified 0 #> 3 3 Introduction Unclassified Unclassified 0 #> 4 4 Introduction M3: Occupying the niche Announcing purpose 0.45 #> 5 5 Introduction Unclassified Unclassified 0 #> 6 6 Introduction Unclassified Unclassified 0 ``` -------------------------------- ### Set OpenAlex API Key Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Set your OpenAlex API key in your R session to increase rate limits for heavier usage. This is optional but recommended for batch analysis. ```r openalexR::oa_apikey("your-api-key-here") ``` -------------------------------- ### Create and Display Citation Network Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Generates an interactive citation network visualization. Parameters control the maximum distance between citations, minimum connections for a node, and whether to show labels. The resulting network object can be directly displayed. ```r # Create citation network network <- create_citation_network( citation_analysis_results = analysis, max_distance = 800, # Max distance between citations (characters) min_connections = 2, # Minimum connections to include a node show_labels = TRUE ) # Display interactive network network ``` -------------------------------- ### Check Data Sources for References Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Counts the occurrences of each reference source (e.g., 'crossref'). Helps understand the origin of the parsed reference data. ```r # Check data sources table(analysis$parsed_references$ref_source) #> #> crossref #> 33 ``` -------------------------------- ### Ignore Large PDF Fixtures in Git Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Configure .gitignore to exclude large PDF files from version control while including specific smaller generated ones. This helps manage repository size. ```gitignore # Ignore large PDF fixtures tests/testthat/fixtures/*.pdf !tests/testthat/fixtures/sample-*.pdf !tests/testthat/fixtures/empty-*.pdf ``` -------------------------------- ### Set Mailto for Polite Pool Access Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Configure the 'mailto' parameter to use the polite pool for CrossRef and OpenAlex APIs, enabling faster and more reliable access. This is recommended for most users. ```r analysis <- analyze_scientific_content( text = doc, doi = "10.xxxx/xxxxx", mailto = "your@email.com" # Your email for polite pool access ) ``` -------------------------------- ### Analyze Citation Distribution by Section Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Shows the distribution of citations across different sections of a document. Provides counts and percentages for each section. ```r analysis$citation_metrics$section_distribution #> # A tibble: 14 × 3 #> section n percentage #> #> 1 Introduction 6 12.2 #> 2 Related work 9 18.4 #> 3 Internal processing approaches 0 0 #> 4 Random forest extra information 6 12.2 #> 5 Visualization toolkits 4 8.16 #> 6 Post-Hoc approaches 0 0 #> 7 Size reduction 6 12.2 #> 8 Rule extraction 3 6.12 #> 9 Local explanation 5 10.2 #> 10 Comparison study 2 4.08 #> 11 Experimental design 4 8.16 #> 12 Analysis 4 8.16 #> 13 Conclusion 0 0 #> 14 Acknowledgment 0 0 ``` -------------------------------- ### Hybrid Rhetorical Move Classification with Gemini Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Performs rhetorical move classification using a hybrid approach that combines rule-based methods with the Gemini LLM. Includes a progress bar for long analyses. ```r # Hybrid classification with progress bar Moves_hybrid <- classify_rhetorical_moves(doc, use_llm = TRUE, model = "2.5-flash") ``` -------------------------------- ### Activate Rhetorical Moves in Content Analysis Pipeline Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Activates rhetorical move analysis as part of the full scientific content analysis pipeline using analyze_scientific_content(). ```r analysis <- analyze_scientific_content( text = doc, doi = "10.1016/j.mlwa.2021.100094", citation_type = "author_year", rhetorical_moves = TRUE # Activate rhetorical analysis ) ``` -------------------------------- ### Create Citation Network with Custom Min Connections Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Generates a citation network highlighting only highly connected citations by setting a large `max_distance` and a high `min_connections`. This is useful for identifying central or influential citations. ```r # Show only highly connected citations network_hubs <- create_citation_network( analysis, max_distance = 1000, min_connections = 5 ) ``` -------------------------------- ### View Enriched OpenAlex Metadata Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Displays the first few rows of enriched OpenAlex metadata for references, including title, publication year, citation count, type, and OpenAlex status. Requires OpenAlex data to be retrieved. ```r # If OpenAlex data was retrieved if (!is.null(analysis$references_oa)) { # View enriched metadata head(analysis$references_oa[, c("title", "publication_year", "cited_by_count", "type", "oa_status")]) # Analyze citation impact summary(analysis$references_oa$cited_by_count) } #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 101.0 207.2 1153.5 12252.6 5411.2 123905.0 ``` -------------------------------- ### Aggregate Rhetorical Move Blocks Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Aggregates consecutive sentences with the same rhetorical move into blocks. Displays aggregated move block information. ```r # Aggregated move blocks (consecutive sentences with the same move) head(moves$move_blocks[, c("block_id", "section", "move", "n_sentences", "avg_confidence")]) #> # A tibble: 6 × 5 #> block_id section move n_sentences avg_confidence #> #> 1 1 Introduction Unclassified 3 0 #> 2 2 Introduction M3: Occupying the niche 1 0.45 #> 3 3 Introduction Unclassified 3 0 #> 4 4 Introduction M2: Establishing a niche 1 0.35 #> 5 5 Introduction M1: Establishing a territory 1 0.65 #> 6 6 Introduction Unclassified 6 0 ``` -------------------------------- ### Rhetorical Move Distribution Summary Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Calculates and displays the distribution of rhetorical moves across different sections of a document. ```r # Move distribution across sections Moves$summary$move_distribution #> # A tibble: 20 × 4 #> section move n pct #> #> 1 Analysis M1: Establishing a territory 3 42.9 #> 2 Analysis M2: Establishing a niche 1 14.3 #> 3 Analysis M3: Occupying the niche 3 42.9 #> 4 Comparison study M3: Occupying the niche 1 100 #> 5 Conclusion M2: Evaluating the study 2 66.7 #> 6 Conclusion M3: Looking forward 1 33.3 #> 7 Introduction M1: Establishing a territory 2 28.6 #> 8 Introduction M2: Establishing a niche 2 28.6 #> 9 Introduction M3: Occupying the niche 3 42.9 #> 10 Local explanation M1: Establishing a territory 6 85.7 #> 11 Local explanation M2: Establishing a niche 1 14.3 #> 12 Post-Hoc approaches M3: Occupying the niche 1 100 #> 13 Related work M1: Establishing context 1 50 #> 14 Related work M2: Reviewing prior work 1 50 #> 15 Rule extraction M1: Establishing a territory 1 33.3 #> 16 Rule extraction M2: Establishing a niche 2 66.7 #> 17 Size reduction M1: Establishing a territory 2 40 #> 18 Size reduction M2: Establishing a niche 2 40 #> 19 Size reduction M3: Occupying the niche 1 20 #> 20 Visualization toolkits M1: Establishing a territory 3 100 ``` -------------------------------- ### Create Citation Network with Hidden Labels Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Generates a citation network visualization with labels turned off using `show_labels = FALSE`. This can result in a cleaner visual representation when focusing on the network structure rather than individual citation labels. ```r # Hide labels for cleaner visualization network_clean <- create_citation_network( analysis, show_labels = FALSE ) ``` -------------------------------- ### View Hybrid Analysis Rhetorical Flow Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Visualizes the rhetorical flow of a paper after performing hybrid classification using the Gemini LLM. ```r # View the rhetorical flow of the paper Moves_hybrid$summary$flow_pattern ``` -------------------------------- ### Analyze Match Quality Distribution Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Counts the occurrences of different match confidence levels for citation-reference links. Helps assess the reliability of the mappings. ```r # Match quality distribution table(analysis$citation_references_mapping$match_confidence) #> #> high no_match_author #> 41 8 ``` -------------------------------- ### Exclude PDF Fixtures from Built Package Source: https://github.com/massimoaria/contentanalysis/blob/master/tests/testthat/fixtures/README.md Configure .Rbuildignore to prevent large or real-world PDF fixtures from being included in the final package distribution. This ensures the package remains lean. ```regex ^tests/testthat/fixtures/large-.*\.pdf$ ^tests/testthat/fixtures/.*-real\.pdf$ ``` -------------------------------- ### Generate Citation Cluster Descriptions Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Use this function to generate thematic descriptions for each section's bibliography based on TF-IDF analysis of reference titles. View the top terms per section or detailed TF-IDF scores. ```r # Generate cluster descriptions cluster_desc <- describe_citation_clusters(analysis, top_n = 10) # View summary: top terms per section cluster_desc$cluster_summary #> # A tibble: 10 × 3 #> section n_references top_terms #> #> 1 Introduction 5 learning, bagging, bagging pred… #> 2 Related work 7 interpretable machine, interpre… #> 3 Random forest extra information 5 forests, random forests, annals… #> 4 Visualization toolkits 4 tree, forests, random forests, … #> 5 Size reduction 4 adaptive, adaptive diagnostic, … #> 6 Rule extraction 2 annals applied, applied, applie… #> 7 Local explanation 2 box classifiers, classification… #> 8 Comparison study 1 annals applied, applied, applie… #> 9 Experimental design 3 bell, bell laboratories, labora… #> 10 Analysis 3 acm computing, computing survey… # View detailed TF-IDF scores cluster_desc$cluster_descriptions #> # A tibble: 83 × 7 #> section ngram ngram_size n tf idf tf_idf #> #> 1 Introduction learning 1 2 0.08 1.61 0.129 #> 2 Introduction bagging 1 1 0.04 2.30 0.0921 #> 3 Introduction bagging predictors 2 1 0.04 2.30 0.0921 #> 4 Introduction belmont 1 1 0.04 2.30 0.0921 #> 5 Introduction belmont wadsworth 2 1 0.04 2.30 0.0921 #> 6 Introduction data 1 1 0.04 2.30 0.0921 #> 7 Introduction data mining 2 1 0.04 2.30 0.0921 #> 8 Introduction elements 1 1 0.04 2.30 0.0921 #> 9 Introduction elements statistical 2 1 0.04 2.30 0.0921 #> 10 Introduction inference 1 1 0.04 2.30 0.0921 #> # ℹ 73 more rows ``` -------------------------------- ### Access Rhetorical Moves Results from Pipeline Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Accesses the results of the rhetorical move analysis, including sentences, move blocks, and summary statistics, after it has been run as part of the full content analysis pipeline. ```r # Access results analysis$rhetorical_moves$sentences analysis$rhetorical_moves$move_blocks analysis$rhetorical_moves$summary ``` -------------------------------- ### Perform Comprehensive Scientific Content Analysis Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Analyze scientific content by extracting citations, retrieving metadata from CrossRef and OpenAlex, matching citations to references, and performing text analysis. Requires the document text, DOI, and an email address for API requests. ```r analysis <- analyze_scientific_content( text = doc, doi = "10.1016/j.mlwa.2021.100094", mailto = "your@email.com", citation_type = "author_year" ) #> Extracting author-year citations only #> Attempting to retrieve references from CrossRef... #> Successfully retrieved 33 references from CrossRef #> Fetching Open Access metadata for 14 DOIs from OpenAlex... #> Successfully retrieved metadata for 14 references from OpenAlex #> Enriching CrossRef references with 32 PDF-parsed entries... #> Enriched 10 CrossRef references with PDF-parsed data ``` -------------------------------- ### View Summary Statistics of Content Analysis Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Access the summary statistics generated by the analyze_scientific_content function. This includes metrics like total words, unique words, citation counts, lexical diversity, and reference matching quality. ```r analysis$summary #> $total_words_analyzed #> [1] 3230 #> #> $unique_words #> [1] 1238 #> #> $citations_extracted #> [1] 49 #> #> $narrative_citations #> [1] 15 #> #> $parenthetical_citations #> [1] 34 #> #> $complex_citations_parsed #> [1] 12 #> #> $lexical_diversity #> [1] 0.3832817 #> #> $average_citation_context_length #> [1] 2856.061 #> #> $citation_density_per_1000_words #> [1] 6.83 #> #> $references_parsed #> [1] 33 #> #> $citations_matched_to_refs #> [1] 41 #> #> $match_quality #> # A tibble: 2 × 3 #> match_confidence n percentage #> #> 1 high 41 83.7 #> 2 no_match_author 8 16.3 #> #> $citation_type_used #> [1] "author_year" ``` -------------------------------- ### Create Citation Network with Custom Max Distance Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Generates a citation network focusing only on very close citations by setting a small `max_distance` and a low `min_connections`. This helps to visualize immediate citation relationships. ```r # Focus on very close citations only network_close <- create_citation_network( analysis, max_distance = 300, min_connections = 1 ) ``` -------------------------------- ### Examine Most Frequent Words Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Display the top 10 most frequent words from the analysis results. This snippet shows the structure of the word frequencies output, including word, count, frequency, and rank. ```r head(analysis$word_frequencies, 10) ``` -------------------------------- ### Access Network Statistics Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Retrieves and displays statistics from a generated citation network. This includes the number of nodes and edges, average distance between citations, distribution of citations by section, and a list of citations appearing in multiple sections. ```r stats <- attr(network, "stats") # Network size cat("Nodes:", stats$n_nodes, "\n") cat("Edges:", stats$n_edges, "\n") cat("Average distance:", stats$avg_distance, "characters\n") # Citations by section print(stats$section_distribution) # Multi-section citations if (nrow(stats$multi_section_citations) > 0) { print(stats$multi_section_citations) } ``` -------------------------------- ### Analyze Citation Co-occurrence Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md View the initial data for citation co-occurrence analysis. This snippet displays the first 6 rows of network data, showing pairs of citations, their distance, and types. ```r head(analysis$network_data) ``` -------------------------------- ### Examine Citation Type Distribution Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Analyze the distribution of different citation types found in the document. This helps understand how authors cite references, distinguishing between narrative, parenthetical, and other forms. ```r analysis$citation_metrics$type_distribution #> # A tibble: 9 × 3 #> citation_type n percentage #> #> 1 parsed_from_multiple 12 24.5 #> 2 author_year_basic 9 18.4 #> 3 author_year_and 8 16.3 #> 4 narrative_etal 7 14.3 #> 5 author_year_etal 3 6.12 #> 6 narrative_three_authors_and 3 6.12 #> 7 narrative_two_authors_and 3 6.12 #> 8 narrative_four_authors_and 2 4.08 #> 9 see_citations 2 4.08 ``` -------------------------------- ### Track Methodological Terms Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Identify and count occurrences of predefined methodological terms within a document. This snippet requires a document object and a vector of terms to track. ```r method_terms <- c("machine learning", "regression", "validation", "dataset") word_dist <- calculate_word_distribution(doc, method_terms) ``` -------------------------------- ### Calculate Readability Indices Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Compute various readability indices for a given text, such as Flesch-Kincaid grade level and Gunning Fog index. The 'detailed' argument provides additional metrics like sentence length and syllable counts. ```r readability <- calculate_readability_indices(doc$Full_text, detailed = TRUE) readability #> # A tibble: 1 × 12 #> flesch_kincaid_grade flesch_reading_ease automated_readability_index #> #> 1 12.5 34.2 11.9 #> # ℹ 9 more variables: gunning_fog_index , n_sentences , #> # n_words , n_syllables , n_characters , #> # n_complex_words , avg_sentence_length , #> # avg_syllables_per_word , pct_complex_words ``` -------------------------------- ### Find Citations to Specific Authors Source: https://github.com/massimoaria/contentanalysis/blob/master/README.md Filters citation-reference mappings to find citations attributed to a specific author (e.g., 'Smith'). Returns the clean citation text, full reference text, and match confidence. ```r # Find all citations to works by Smith analysis$citation_references_mapping %>% filter(grepl("Smith", ref_authors, ignore.case = TRUE)) %>% select(citation_text_clean, ref_full_text, match_confidence) #> # A tibble: 0 × 3 #> # ℹ 3 variables: citation_text_clean , ref_full_text , match_confidence ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.