### Install stringdist from source Source: https://github.com/markvanderloo/stringdist/blob/master/README.md Commands to clone the repository and build the package from source using bash. ```bash git clone https://github.com/markvanderloo/stringdist.git cd stringdist bash ./build.bash R CMD INSTALL output/stringdist_*.tar.gz ``` -------------------------------- ### Locate the C API documentation Source: https://github.com/markvanderloo/stringdist/blob/master/README.md Command to retrieve the file path for the stringdist C API documentation PDF. ```r system.file("doc/stringdist_api.pdf", package="stringdist") ``` -------------------------------- ### Calculate String Distances with Various Methods Source: https://context7.com/markvanderloo/stringdist/llms.txt Demonstrates the usage of different string distance algorithms available in the stringdist package, including edit-based, Q-gram, Jaro, and phonetic methods. ```r library(stringdist) # Edit-based distances stringdist("kitten", "sitting", method = "osa") # Optimal String Alignment (default) stringdist("kitten", "sitting", method = "lv") # Levenshtein stringdist("kitten", "sitting", method = "dl") # Full Damerau-Levenshtein stringdist("abc", "abd", method = "hamming") # Hamming (equal length only) stringdist("abc", "bc", method = "lcs") # Longest Common Substring # Q-gram based distances (set q parameter) stringdist("night", "nacht", method = "qgram", q = 2) # Q-gram distance stringdist("night", "nacht", method = "cosine", q = 2) # Cosine distance stringdist("night", "nacht", method = "jaccard", q = 2) # Jaccard distance # Jaro and Jaro-Winkler (set p for Winkler boost) stringdist("MARTHA", "MATHRA", method = "jw") # Jaro distance stringdist("MARTHA", "MATHRA", method = "jw", p = 0.1) # Jaro-Winkler # Phonetic distance stringdist("Euler", "Ellery", method = "soundex") # Soundex-based # Custom edit weights: c(deletion, insertion, substitution, transposition) stringdist("ab", "ba", method = "osa", weight = c(1, 1, 1, 0.5)) # Parallel processing (automatic by default) stringdist(rep("hello", 10000), rep("hallo", 10000), nthread = 4) ``` -------------------------------- ### Open stringdist C/C++ API Vignette Source: https://github.com/markvanderloo/stringdist/blob/master/pkg/README.md This command opens the vignette detailing the C/C++ API for the stringdist package. It is useful for developers who want to integrate stringdist functionality into other R packages. ```R vignette("stringdist_C-Cpp_api", package="stringdist") ``` -------------------------------- ### Cite the stringdist package Source: https://github.com/markvanderloo/stringdist/blob/master/README.md BibTeX entry for citing the R Journal article associated with the package. ```bibtex @article{RJ-2014-011, author = {Mark P.J. van der Loo}, title = {{The stringdist Package for Approximate String Matching}}, year = {2014}, journal = {{The R Journal}}, doi = {10.32614/RJ-2014-011}, url = {https://doi.org/10.32614/RJ-2014-011}, pages = {111--122}, volume = {6}, number = {1} } ``` -------------------------------- ### Tabulate Q-gram Counts with qgrams Source: https://context7.com/markvanderloo/stringdist/llms.txt Use qgrams to create a table of q-gram counts from character vectors. This function is useful for analyzing text composition and can be configured with options like 'useNames' and 'useBytes' for performance. ```r library(stringdist) # Count 3-grams in a string qgrams("hello world", q = 3) # hel ell llo lo o w wo wor orl rld # V1 1 1 1 1 1 1 1 1 1 ``` ```r # Compare q-gram profiles of multiple strings x <- "I will not buy this record, it is scratched" y <- "My hovercraft is full of eels" z <- c("this", "is", "a", "dead", "parrot") qgrams(A = x, B = y, C = z, q = 2) # I wi il ll n no ot t b bu uy ... # A 1 1 1 1 1 1 1 3 1 1 1 # B 0 0 0 1 0 0 0 1 0 0 0 # C 0 0 0 0 0 0 0 1 0 0 0 ``` ```r # Q-grams with different settings x <- "peter piper picked a peck of pickled peppers" qgrams(x, q = 2) # Named columns qgrams(x, q = 2, useNames = FALSE) # Unnamed columns (faster) qgrams(x, q = 2, useBytes = TRUE) # Byte-wise (faster for ASCII) ``` ```r # Count unigrams (single characters) qgrams(c("hello", "world"), q = 1) ``` -------------------------------- ### Phonetic Encoding with phonetic Source: https://context7.com/markvanderloo/stringdist/llms.txt Generate phonetic codes for strings using the Soundex algorithm. Similar-sounding strings will receive the same or similar codes, useful for matching names that are spelled differently but sound alike. ```r library(stringdist) # Soundex encoding phonetic(c("Euler", "Gauss", "Hilbert", "Knuth", "Lloyd", "Lukasiewicz", "Wachs")) # Returns: "E460" "G200" "H416" "K530" "L300" "L222" "W200" ``` ```r # Similar sounding names get same code phonetic(c("Robert", "Rupert")) # Returns: "R163" "R163" ``` ```r phonetic(c("Smith", "Smyth")) # Returns: "S530" "S530" ``` ```r # Use in matching similar-sounding names names <- c("Johnson", "Johnsen", "Jonson", "Jackson") codes <- phonetic(names) # Group names by soundex code split(names, codes) ``` ```r # Combine with stringdist for phonetic distance stringdist("Catherine", "Kathryn", method = "soundex") # Returns: 0 (same soundex code) ``` -------------------------------- ### Fuzzy String Matching with amatch Source: https://context7.com/markvanderloo/stringdist/llms.txt Use amatch for approximate string matching, finding the best match for elements in a lookup table within a specified maximum distance. Customize the 'nomatch' value for cases where no match is found. ```r library(stringdist) # Basic fuzzy matching amatch("leia", c("uhura", "leela"), maxDist = 5) # Returns: 2 (matches "leela") ``` ```r # Restrict maximum distance amatch("leia", c("uhura", "leela"), maxDist = 1) # Returns: NA (no match within distance 1) ``` ```r # Match multiple values against lookup table amatch(c("leia", "uhura"), c("ripley", "leela", "scully", "trinity"), maxDist = 2) # Returns: 2 NA (leia->leela, uhura has no close match) ``` ```r # Custom nomatch value amatch("leia", c("uhura", "leela"), maxDist = 1, nomatch = 0) # Returns: 0 ``` ```r # Using different distance methods names_to_find <- c("Jonh", "Micheal", "Robrt") name_table <- c("John", "Michael", "Robert", "James") amatch(names_to_find, name_table, maxDist = 2, method = "lv") # Returns: 1 2 3 (corrected typos) ``` ```r # Jaro-Winkler for name matching (maxDist is on 0-1 scale) amatch("Smith", c("Smyth", "Smithe", "Schmidt"), maxDist = 0.2, method = "jw") ``` -------------------------------- ### stringsim - Compute String Similarity Scores Source: https://context7.com/markvanderloo/stringdist/llms.txt Computes string similarity scores between 0 (completely dissimilar) and 1 (identical). ```APIDOC ## stringsim(a, b, method) ### Description Computes string similarity scores, which is the complement of the normalized distance. ### Parameters #### Arguments - **a** (character vector) - Required - First vector of strings. - **b** (character vector) - Required - Second vector of strings. - **method** (string) - Optional - Distance metric used for similarity calculation. ### Request Example stringsim("ca", "abc") ### Response - **result** (numeric) - Similarity score between 0 and 1. ``` -------------------------------- ### Convert Strings to Integer Sequences for Comparison Source: https://context7.com/markvanderloo/stringdist/llms.txt Converts character strings into integer sequences using their UTF-8 byte values. This is a preprocessing step for sequence-based distance calculations. ```r a <- lapply(c("foo", "bar", "baz"), utf8ToInt) seq_distmatrix(a) ``` -------------------------------- ### Fuzzy Matching for Integer Sequences Source: https://context7.com/markvanderloo/stringdist/llms.txt Performs approximate matching between integer sequences, returning the index of the closest match within a specified maximum distance. Useful for finding similar patterns in ordered data. ```r seq_amatch( list(c(1L, 2L, 3L)), list(c(1L, 2L, 4L), c(1L, 2L, 3L, 4L)), maxDist = 1 ) ``` -------------------------------- ### Compare Word Sequences for Sentence Similarity Source: https://context7.com/markvanderloo/stringdist/llms.txt Calculates the distance between two sentences represented as sequences of integer hashes, effectively measuring sentence similarity based on word order. ```r sentence1 <- c(1L, 2L, 3L, 4L, 5L) # "Mary had a little lamb" sentence2 <- c(3L, 4L, 5L, 2L, 1L) # "a little lamb had Mary" seq_dist(list(sentence1), list(sentence2), method = "lv") ``` -------------------------------- ### Fuzzy Grep Equivalent with grab Source: https://context7.com/markvanderloo/stringdist/llms.txt Use grab as a fuzzy version of R's grep function to find the index of elements in a character vector that approximately match a pattern. The 'value = TRUE' option returns the matched substring instead of the index. ```r # grab - fuzzy grep equivalent grab(texts, "grew", maxDist = 1) # Returns: 1 (index of matching element) ``` ```r grab(texts, "grew", maxDist = 1, value = TRUE) # Returns: "grow" (the matched substring) ``` -------------------------------- ### stringdist - Compute Pairwise String Distances Source: https://context7.com/markvanderloo/stringdist/llms.txt Computes pairwise string distances between elements of two character vectors using various string metrics. ```APIDOC ## stringdist(a, b, method, ...) ### Description Computes pairwise string distances between elements of two character vectors. The shorter vector is recycled to match the length of the longer one. ### Parameters #### Arguments - **a** (character vector) - Required - First vector of strings. - **b** (character vector) - Required - Second vector of strings. - **method** (string) - Optional - Distance metric: 'osa', 'lv', 'dl', 'hamming', 'lcs', 'qgram', 'cosine', 'jaccard', 'jw', 'soundex'. - **weight** (numeric vector) - Optional - Weights for edit operations (deletion, insertion, substitution, transposition). ### Request Example stringdist("ca", "abc", method = "lv") ### Response - **result** (numeric) - The calculated distance value. ``` -------------------------------- ### Fuzzy Grepl Equivalent with grabl Source: https://context7.com/markvanderloo/stringdist/llms.txt grabl functions as a fuzzy grepl, returning a logical vector indicating which elements in a character vector approximately match a given pattern within the specified maximum distance. ```r # grabl - fuzzy grepl equivalent grabl(texts, "grew", maxDist = 1) # Returns: TRUE FALSE FALSE FALSE ``` -------------------------------- ### Compute String Distance Matrix Source: https://context7.com/markvanderloo/stringdist/llms.txt Generates a distance matrix between strings. A single vector input returns a dist object suitable for clustering. ```r library(stringdist) # Distance matrix between two vectors a <- c("foo", "bar", "boo") b <- c("baz", "buz") stringdistmatrix(a, b) # [,1] [,2] # [1,] 3 3 # [2,] 2 2 # [3,] 2 3 # Distance matrix with named rows/columns stringdistmatrix(a, b, useNames = "strings") # baz buz # foo 3 3 # bar 2 2 # boo 2 3 # Single vector returns a 'dist' object for clustering words <- c("foo", "bar", "boo", "baz") d <- stringdistmatrix(words) # Can be used directly with clustering algorithms hc <- hclust(d) plot(hc) # Dendrogram of word similarities # Using Jaro-Winkler for name clustering names <- c("Robert", "Rupert", "Roberto", "Albert") d_names <- stringdistmatrix(names, method = "jw", p = 0.1) hclust(d_names) ``` -------------------------------- ### Calculate Distance Between Integer Sequences Source: https://context7.com/markvanderloo/stringdist/llms.txt Compares integer sequences to find their distance. Useful for comparing ordered lists of items represented numerically. ```r a <- list(c(1L, 2L, 3L)) b <- list(c(1L, 3L, 2L), c(2L, 3L, 4L)) seq_dist(a, b) ``` -------------------------------- ### Fuzzy Membership Check with ain Source: https://context7.com/markvanderloo/stringdist/llms.txt Use ain as a fuzzy equivalent of R's %in% operator to check if elements exist within a lookup table, allowing for approximate matches within a specified maximum distance. ```r # Check if values exist (fuzzy %in%) ain("leia", c("uhura", "leela"), maxDist = 2) # Returns: TRUE ``` ```r ain(c("hello", "wrld"), c("world", "hello"), maxDist = 1) # Returns: TRUE TRUE ``` -------------------------------- ### stringdistmatrix - Compute String Distance Matrix Source: https://context7.com/markvanderloo/stringdist/llms.txt Computes a distance matrix between all combinations of strings in one or two vectors. ```APIDOC ## stringdistmatrix(a, b, method, useNames) ### Description Computes a distance matrix between all combinations of strings in one or two vectors. When called with a single vector, returns a 'dist' object. ### Parameters #### Arguments - **a** (character vector) - Required - First vector of strings. - **b** (character vector) - Optional - Second vector of strings. - **method** (string) - Optional - Distance metric. - **useNames** (string) - Optional - Whether to use string values as row/column names. ### Request Example stringdistmatrix(c("foo", "bar"), c("baz", "buz")) ### Response - **result** (matrix/dist) - A distance matrix or 'dist' object. ``` -------------------------------- ### Compute Pairwise String Distances Source: https://context7.com/markvanderloo/stringdist/llms.txt Calculates distances between character vectors using various metrics. The shorter vector is recycled to match the length of the longer one. ```r library(stringdist) # Basic usage with default Optimal String Alignment (osa) method stringdist("ca", "abc") # Returns: 2 # Compare multiple strings - shorter vector is recycled stringdist(c("foo", "bar", "boo"), c("baz", "buz")) # Returns: 3 2 3 # Different distance methods stringdist("ca", "abc", method = "lv") # Levenshtein: 2 stringdist("ca", "abc", method = "dl") # Full Damerau-Levenshtein: 2 stringdist("hello", "HeLl0", method = "hamming") # Hamming: 4 stringdist("survey", "surgery", method = "lcs") # Longest common substring: 3 # Jaro-Winkler distance (useful for name matching) stringdist("MARTHA", "MATHRA", method = "jw") # Jaro distance: 0.0556 stringdist("MARTHA", "MATHRA", method = "jw", p = 0.1) # Jaro-Winkler: 0.0389 # Q-gram based distances stringdist("abc", "cba", method = "qgram", q = 1) # q=1 gram distance: 0 stringdist("abc", "cba", method = "qgram", q = 2) # q=2 gram distance: 4 stringdist("abc", "bcd", method = "cosine", q = 2) # Cosine distance stringdist("abc", "bcd", method = "jaccard", q = 2) # Jaccard distance # Soundex-based distance (phonetic) stringdist("Euler", "Ellery", method = "soundex") # Same soundex code: 0 stringdist("Euler", "Gauss", method = "soundex") # Different codes: 1 # Custom weights for edit operations (deletion, insertion, substitution, transposition) stringdist("ab", "ba", weight = c(1, 1, 1, 0.5)) # Lower transposition cost stringdist("ca", "abc", weight = c(0.5, 1, 1, 1)) # Lower deletion cost # Case sensitivity - normalize if needed stringdist("ABC", "abc") # Case sensitive: 3 stringdist(tolower("ABC"), "abc") # Normalized: 0 ``` -------------------------------- ### Compute String Similarity Scores Source: https://context7.com/markvanderloo/stringdist/llms.txt Calculates similarity scores between 0 and 1, representing the complement of the normalized distance. ```r library(stringdist) # Basic similarity (default: Optimal String Alignment) stringsim("ca", "abc") # Returns: 0.333 (1 - normalized distance) # Jaro-Winkler similarity (common for name matching) stringsim("MARTHA", "MATHRA", method = "jw", p = 0.1) # Returns: 0.961 # Compare multiple pairs stringsim(c("hello", "world"), c("hallo", "word")) # Returns: 0.8 0.75 # Similarity matrix a <- c("apple", "application", "apply") stringsimmatrix(a) # [,1] [,2] [,3] # [1,] 1.0000000 0.4545455 0.6000000 # [2,] 0.4545455 1.0000000 0.5454545 # [3,] 0.6000000 0.5454545 1.0000000 # Finding most similar strings candidates <- c("algorithm", "logarithm", "arithmetic") target <- "altruism" similarities <- stringsim(target, candidates) candidates[which.max(similarities)] # Returns: "algorithm" ``` -------------------------------- ### Fuzzy Text Search with afind Source: https://context7.com/markvanderloo/stringdist/llms.txt Employ afind to locate approximate matches of patterns within larger text strings. It returns the position, distance, and the matched substring, useful for finding typos or variations. ```r library(stringdist) # Search for patterns in text texts <- c( "When I grow up, I want to be", "one of the harvesters of the sea", "I think before my days are gone", "I want to be a fisherman" ) patterns <- c("fish", "gone", "to be") result <- afind(texts, patterns, method = "running_cosine", q = 3) # Returns list with: # - location: matrix of start positions # - distance: matrix of distances # - match: matrix of matched substrings result$location # [,1] [,2] [,3] # [1,] 1 1 22 # [2,] 1 1 29 # [3,] 1 23 1 # [4,] 16 1 12 result$match # [,1] [,2] [,3] # [1,] "When" "When" "to be" # [2,] "one " "one " "e sea" # [3,] "I th" "gone" "I thi" # [4,] "fish" "I wa" "to be" ``` -------------------------------- ### Check Membership in Integer Sequences Source: https://context7.com/markvanderloo/stringdist/llms.txt Determines if an integer sequence is present within a list of other integer sequences, allowing for a maximum distance threshold. Returns TRUE if a match is found, FALSE otherwise. ```r seq_ain(list(c(1L, 2L)), list(c(1L, 2L, 3L)), maxDist = 1) ``` -------------------------------- ### Extract Fuzzy Matches with extract Source: https://context7.com/markvanderloo/stringdist/llms.txt The extract function finds and returns substrings that approximately match a given pattern within a character vector, returning NA for elements where no match is found within the maximum distance. ```r # extract - extract fuzzy matches extract(texts, "harvested", maxDist = 3) # Returns matched substrings, NA where no match found ``` -------------------------------- ### Detect Non-ASCII Characters in Strings Source: https://context7.com/markvanderloo/stringdist/llms.txt A utility function to identify strings containing non-printable ASCII or non-ASCII characters. It's recommended for data validation before applying ASCII-dependent string operations. ```r library(stringdist) # Check for printable ASCII printable_ascii(c("hello", "world")) printable_ascii(c("hello", "wörld", "hello\ttab")) ``` ```r # Filter strings before soundex encoding names <- c("Mueller", "Müller", "Miller") ascii_names <- names[printable_ascii(names)] phonetic(ascii_names) ``` ```r # Validate data quality data <- c("John Smith", "José García", "Marie-Claire") valid <- printable_ascii(data) if (!all(valid)) { warning("Non-ASCII characters found in: ", paste(data[!valid], collapse = ", ")) } ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.