### Install stringdist from source

Source: https://github.com/markvanderloo/stringdist/blob/master/README.md

Commands to clone the repository and build the package from source using bash.

```bash
git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz
```

--------------------------------

### Locate the C API documentation

Source: https://github.com/markvanderloo/stringdist/blob/master/README.md

Command to retrieve the file path for the stringdist C API documentation PDF.

```r
system.file("doc/stringdist_api.pdf", package="stringdist")
```

--------------------------------

### Calculate String Distances with Various Methods

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Demonstrates the usage of different string distance algorithms available in the stringdist package, including edit-based, Q-gram, Jaro, and phonetic methods.

```r
library(stringdist)

# Edit-based distances
stringdist("kitten", "sitting", method = "osa")     # Optimal String Alignment (default)
stringdist("kitten", "sitting", method = "lv")      # Levenshtein
stringdist("kitten", "sitting", method = "dl")      # Full Damerau-Levenshtein
stringdist("abc", "abd", method = "hamming")        # Hamming (equal length only)
stringdist("abc", "bc", method = "lcs")             # Longest Common Substring

# Q-gram based distances (set q parameter)
stringdist("night", "nacht", method = "qgram", q = 2)   # Q-gram distance
stringdist("night", "nacht", method = "cosine", q = 2)  # Cosine distance
stringdist("night", "nacht", method = "jaccard", q = 2) # Jaccard distance

# Jaro and Jaro-Winkler (set p for Winkler boost)
stringdist("MARTHA", "MATHRA", method = "jw")           # Jaro distance
stringdist("MARTHA", "MATHRA", method = "jw", p = 0.1)  # Jaro-Winkler

# Phonetic distance
stringdist("Euler", "Ellery", method = "soundex")       # Soundex-based

# Custom edit weights: c(deletion, insertion, substitution, transposition)
stringdist("ab", "ba", method = "osa", weight = c(1, 1, 1, 0.5))

# Parallel processing (automatic by default)
stringdist(rep("hello", 10000), rep("hallo", 10000), nthread = 4)
```

--------------------------------

### Open stringdist C/C++ API Vignette

Source: https://github.com/markvanderloo/stringdist/blob/master/pkg/README.md

This command opens the vignette detailing the C/C++ API for the stringdist package. It is useful for developers who want to integrate stringdist functionality into other R packages.

```R
vignette("stringdist_C-Cpp_api", package="stringdist")
```

--------------------------------

### Cite the stringdist package

Source: https://github.com/markvanderloo/stringdist/blob/master/README.md

BibTeX entry for citing the R Journal article associated with the package.

```bibtex
@article{RJ-2014-011,
  author = {Mark P.J. van der Loo},
  title = {{The stringdist Package for Approximate String Matching}},
  year = {2014},
  journal = {{The R Journal}},
  doi = {10.32614/RJ-2014-011},
  url = {https://doi.org/10.32614/RJ-2014-011},
  pages = {111--122},
  volume = {6},
  number = {1}
}
```

--------------------------------

### Tabulate Q-gram Counts with qgrams

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Use qgrams to create a table of q-gram counts from character vectors. This function is useful for analyzing text composition and can be configured with options like 'useNames' and 'useBytes' for performance.

```r
library(stringdist)

# Count 3-grams in a string
qgrams("hello world", q = 3)
#    hel ell llo lo   o w  wo wor orl rld
# V1   1   1   1   1   1   1   1   1   1
```

```r
# Compare q-gram profiles of multiple strings
x <- "I will not buy this record, it is scratched"
y <- "My hovercraft is full of eels"
z <- c("this", "is", "a", "dead", "parrot")
qgrams(A = x, B = y, C = z, q = 2)
#   I   wi il ll  n no ot  t  b bu uy ...
# A  1   1  1  1  1  1  1  3  1  1  1
# B  0   0  0  1  0  0  0  1  0  0  0
# C  0   0  0  0  0  0  0  1  0  0  0
```

```r
# Q-grams with different settings
x <- "peter piper picked a peck of pickled peppers"
qgrams(x, q = 2)                          # Named columns
qgrams(x, q = 2, useNames = FALSE)        # Unnamed columns (faster)
qgrams(x, q = 2, useBytes = TRUE)         # Byte-wise (faster for ASCII)
```

```r
# Count unigrams (single characters)
qgrams(c("hello", "world"), q = 1)
```

--------------------------------

### Phonetic Encoding with phonetic

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Generate phonetic codes for strings using the Soundex algorithm. Similar-sounding strings will receive the same or similar codes, useful for matching names that are spelled differently but sound alike.

```r
library(stringdist)

# Soundex encoding
phonetic(c("Euler", "Gauss", "Hilbert", "Knuth", "Lloyd", "Lukasiewicz", "Wachs"))
# Returns: "E460" "G200" "H416" "K530" "L300" "L222" "W200"
```

```r
# Similar sounding names get same code
phonetic(c("Robert", "Rupert"))
# Returns: "R163" "R163"
```

```r
phonetic(c("Smith", "Smyth"))
# Returns: "S530" "S530"
```

```r
# Use in matching similar-sounding names
names <- c("Johnson", "Johnsen", "Jonson", "Jackson")
codes <- phonetic(names)
# Group names by soundex code
split(names, codes)
```

```r
# Combine with stringdist for phonetic distance
stringdist("Catherine", "Kathryn", method = "soundex")
# Returns: 0 (same soundex code)
```

--------------------------------

### Fuzzy String Matching with amatch

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Use amatch for approximate string matching, finding the best match for elements in a lookup table within a specified maximum distance. Customize the 'nomatch' value for cases where no match is found.

```r
library(stringdist)

# Basic fuzzy matching
amatch("leia", c("uhura", "leela"), maxDist = 5)
# Returns: 2 (matches "leela")
```

```r
# Restrict maximum distance
amatch("leia", c("uhura", "leela"), maxDist = 1)
# Returns: NA (no match within distance 1)
```

```r
# Match multiple values against lookup table
amatch(c("leia", "uhura"), c("ripley", "leela", "scully", "trinity"), maxDist = 2)
# Returns: 2 NA (leia->leela, uhura has no close match)
```

```r
# Custom nomatch value
amatch("leia", c("uhura", "leela"), maxDist = 1, nomatch = 0)
# Returns: 0
```

```r
# Using different distance methods
names_to_find <- c("Jonh", "Micheal", "Robrt")
name_table <- c("John", "Michael", "Robert", "James")
amatch(names_to_find, name_table, maxDist = 2, method = "lv")
# Returns: 1 2 3 (corrected typos)
```

```r
# Jaro-Winkler for name matching (maxDist is on 0-1 scale)
amatch("Smith", c("Smyth", "Smithe", "Schmidt"), maxDist = 0.2, method = "jw")
```

--------------------------------

### stringsim - Compute String Similarity Scores

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Computes string similarity scores between 0 (completely dissimilar) and 1 (identical).

```APIDOC
## stringsim(a, b, method)

### Description
Computes string similarity scores, which is the complement of the normalized distance.

### Parameters
#### Arguments
- **a** (character vector) - Required - First vector of strings.
- **b** (character vector) - Required - Second vector of strings.
- **method** (string) - Optional - Distance metric used for similarity calculation.

### Request Example
stringsim("ca", "abc")

### Response
- **result** (numeric) - Similarity score between 0 and 1.
```

--------------------------------

### Convert Strings to Integer Sequences for Comparison

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Converts character strings into integer sequences using their UTF-8 byte values. This is a preprocessing step for sequence-based distance calculations.

```r
a <- lapply(c("foo", "bar", "baz"), utf8ToInt)
seq_distmatrix(a)
```

--------------------------------

### Fuzzy Matching for Integer Sequences

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Performs approximate matching between integer sequences, returning the index of the closest match within a specified maximum distance. Useful for finding similar patterns in ordered data.

```r
seq_amatch(
  list(c(1L, 2L, 3L)),
  list(c(1L, 2L, 4L), c(1L, 2L, 3L, 4L)),
  maxDist = 1
)
```

--------------------------------

### Compare Word Sequences for Sentence Similarity

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Calculates the distance between two sentences represented as sequences of integer hashes, effectively measuring sentence similarity based on word order.

```r
sentence1 <- c(1L, 2L, 3L, 4L, 5L)  # "Mary had a little lamb"
sentence2 <- c(3L, 4L, 5L, 2L, 1L)  # "a little lamb had Mary"
seq_dist(list(sentence1), list(sentence2), method = "lv")
```

--------------------------------

### Fuzzy Grep Equivalent with grab

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Use grab as a fuzzy version of R's grep function to find the index of elements in a character vector that approximately match a pattern. The 'value = TRUE' option returns the matched substring instead of the index.

```r
# grab - fuzzy grep equivalent
grab(texts, "grew", maxDist = 1)
# Returns: 1 (index of matching element)
```

```r
grab(texts, "grew", maxDist = 1, value = TRUE)
# Returns: "grow" (the matched substring)
```

--------------------------------

### stringdist - Compute Pairwise String Distances

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Computes pairwise string distances between elements of two character vectors using various string metrics.

```APIDOC
## stringdist(a, b, method, ...)

### Description
Computes pairwise string distances between elements of two character vectors. The shorter vector is recycled to match the length of the longer one.

### Parameters
#### Arguments
- **a** (character vector) - Required - First vector of strings.
- **b** (character vector) - Required - Second vector of strings.
- **method** (string) - Optional - Distance metric: 'osa', 'lv', 'dl', 'hamming', 'lcs', 'qgram', 'cosine', 'jaccard', 'jw', 'soundex'.
- **weight** (numeric vector) - Optional - Weights for edit operations (deletion, insertion, substitution, transposition).

### Request Example
stringdist("ca", "abc", method = "lv")

### Response
- **result** (numeric) - The calculated distance value.
```

--------------------------------

### Fuzzy Grepl Equivalent with grabl

Source: https://context7.com/markvanderloo/stringdist/llms.txt

grabl functions as a fuzzy grepl, returning a logical vector indicating which elements in a character vector approximately match a given pattern within the specified maximum distance.

```r
# grabl - fuzzy grepl equivalent
grabl(texts, "grew", maxDist = 1)
# Returns: TRUE FALSE FALSE FALSE
```

--------------------------------

### Compute String Distance Matrix

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Generates a distance matrix between strings. A single vector input returns a dist object suitable for clustering.

```r
library(stringdist)

# Distance matrix between two vectors
a <- c("foo", "bar", "boo")
b <- c("baz", "buz")
stringdistmatrix(a, b)
#      [,1] [,2]
# [1,]    3    3
# [2,]    2    2
# [3,]    2    3

# Distance matrix with named rows/columns
stringdistmatrix(a, b, useNames = "strings")
#     baz buz
# foo   3   3
# bar   2   2
# boo   2   3

# Single vector returns a 'dist' object for clustering
words <- c("foo", "bar", "boo", "baz")
d <- stringdistmatrix(words)
# Can be used directly with clustering algorithms
hc <- hclust(d)
plot(hc)  # Dendrogram of word similarities

# Using Jaro-Winkler for name clustering
names <- c("Robert", "Rupert", "Roberto", "Albert")
d_names <- stringdistmatrix(names, method = "jw", p = 0.1)
hclust(d_names)
```

--------------------------------

### Calculate Distance Between Integer Sequences

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Compares integer sequences to find their distance. Useful for comparing ordered lists of items represented numerically.

```r
a <- list(c(1L, 2L, 3L))
b <- list(c(1L, 3L, 2L), c(2L, 3L, 4L))
seq_dist(a, b)
```

--------------------------------

### Fuzzy Membership Check with ain

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Use ain as a fuzzy equivalent of R's %in% operator to check if elements exist within a lookup table, allowing for approximate matches within a specified maximum distance.

```r
# Check if values exist (fuzzy %in%)
ain("leia", c("uhura", "leela"), maxDist = 2)
# Returns: TRUE
```

```r
ain(c("hello", "wrld"), c("world", "hello"), maxDist = 1)
# Returns: TRUE TRUE
```

--------------------------------

### stringdistmatrix - Compute String Distance Matrix

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Computes a distance matrix between all combinations of strings in one or two vectors.

```APIDOC
## stringdistmatrix(a, b, method, useNames)

### Description
Computes a distance matrix between all combinations of strings in one or two vectors. When called with a single vector, returns a 'dist' object.

### Parameters
#### Arguments
- **a** (character vector) - Required - First vector of strings.
- **b** (character vector) - Optional - Second vector of strings.
- **method** (string) - Optional - Distance metric.
- **useNames** (string) - Optional - Whether to use string values as row/column names.

### Request Example
stringdistmatrix(c("foo", "bar"), c("baz", "buz"))

### Response
- **result** (matrix/dist) - A distance matrix or 'dist' object.
```

--------------------------------

### Compute Pairwise String Distances

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Calculates distances between character vectors using various metrics. The shorter vector is recycled to match the length of the longer one.

```r
library(stringdist)

# Basic usage with default Optimal String Alignment (osa) method
stringdist("ca", "abc")
# Returns: 2

# Compare multiple strings - shorter vector is recycled
stringdist(c("foo", "bar", "boo"), c("baz", "buz"))
# Returns: 3 2 3

# Different distance methods
stringdist("ca", "abc", method = "lv")      # Levenshtein: 2
stringdist("ca", "abc", method = "dl")      # Full Damerau-Levenshtein: 2
stringdist("hello", "HeLl0", method = "hamming")  # Hamming: 4
stringdist("survey", "surgery", method = "lcs")   # Longest common substring: 3

# Jaro-Winkler distance (useful for name matching)
stringdist("MARTHA", "MATHRA", method = "jw")        # Jaro distance: 0.0556
stringdist("MARTHA", "MATHRA", method = "jw", p = 0.1)  # Jaro-Winkler: 0.0389

# Q-gram based distances
stringdist("abc", "cba", method = "qgram", q = 1)  # q=1 gram distance: 0
stringdist("abc", "cba", method = "qgram", q = 2)  # q=2 gram distance: 4
stringdist("abc", "bcd", method = "cosine", q = 2) # Cosine distance
stringdist("abc", "bcd", method = "jaccard", q = 2) # Jaccard distance

# Soundex-based distance (phonetic)
stringdist("Euler", "Ellery", method = "soundex")  # Same soundex code: 0
stringdist("Euler", "Gauss", method = "soundex")   # Different codes: 1

# Custom weights for edit operations (deletion, insertion, substitution, transposition)
stringdist("ab", "ba", weight = c(1, 1, 1, 0.5))  # Lower transposition cost
stringdist("ca", "abc", weight = c(0.5, 1, 1, 1)) # Lower deletion cost

# Case sensitivity - normalize if needed
stringdist("ABC", "abc")  # Case sensitive: 3
stringdist(tolower("ABC"), "abc")  # Normalized: 0
```

--------------------------------

### Compute String Similarity Scores

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Calculates similarity scores between 0 and 1, representing the complement of the normalized distance.

```r
library(stringdist)

# Basic similarity (default: Optimal String Alignment)
stringsim("ca", "abc")
# Returns: 0.333 (1 - normalized distance)

# Jaro-Winkler similarity (common for name matching)
stringsim("MARTHA", "MATHRA", method = "jw", p = 0.1)
# Returns: 0.961

# Compare multiple pairs
stringsim(c("hello", "world"), c("hallo", "word"))
# Returns: 0.8 0.75

# Similarity matrix
a <- c("apple", "application", "apply")
stringsimmatrix(a)
#           [,1]      [,2]      [,3]
# [1,] 1.0000000 0.4545455 0.6000000
# [2,] 0.4545455 1.0000000 0.5454545
# [3,] 0.6000000 0.5454545 1.0000000

# Finding most similar strings
candidates <- c("algorithm", "logarithm", "arithmetic")
target <- "altruism"
similarities <- stringsim(target, candidates)
candidates[which.max(similarities)]
# Returns: "algorithm"
```

--------------------------------

### Fuzzy Text Search with afind

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Employ afind to locate approximate matches of patterns within larger text strings. It returns the position, distance, and the matched substring, useful for finding typos or variations.

```r
library(stringdist)

# Search for patterns in text
texts <- c(
  "When I grow up, I want to be",
  "one of the harvesters of the sea",
  "I think before my days are gone",
  "I want to be a fisherman"
)
patterns <- c("fish", "gone", "to be")

result <- afind(texts, patterns, method = "running_cosine", q = 3)
# Returns list with:
# - location: matrix of start positions
# - distance: matrix of distances
# - match: matrix of matched substrings

result$location
#      [,1] [,2] [,3]
# [1,]    1    1   22
# [2,]    1    1   29
# [3,]    1   23    1
# [4,]   16    1   12

result$match
#      [,1]   [,2]   [,3]
# [1,] "When" "When" "to be"
# [2,] "one " "one " "e sea"
# [3,] "I th" "gone" "I thi"
# [4,] "fish" "I wa" "to be"
```

--------------------------------

### Check Membership in Integer Sequences

Source: https://context7.com/markvanderloo/stringdist/llms.txt

Determines if an integer sequence is present within a list of other integer sequences, allowing for a maximum distance threshold. Returns TRUE if a match is found, FALSE otherwise.

```r
seq_ain(list(c(1L, 2L)), list(c(1L, 2L, 3L)), maxDist = 1)
```

--------------------------------

### Extract Fuzzy Matches with extract

Source: https://context7.com/markvanderloo/stringdist/llms.txt

The extract function finds and returns substrings that approximately match a given pattern within a character vector, returning NA for elements where no match is found within the maximum distance.

```r
# extract - extract fuzzy matches
extract(texts, "harvested", maxDist = 3)
# Returns matched substrings, NA where no match found
```

--------------------------------

### Detect Non-ASCII Characters in Strings

Source: https://context7.com/markvanderloo/stringdist/llms.txt

A utility function to identify strings containing non-printable ASCII or non-ASCII characters. It's recommended for data validation before applying ASCII-dependent string operations.

```r
library(stringdist)

# Check for printable ASCII
printable_ascii(c("hello", "world"))

printable_ascii(c("hello", "wörld", "hello\ttab"))
```

```r
# Filter strings before soundex encoding
names <- c("Mueller", "Müller", "Miller")
ascii_names <- names[printable_ascii(names)]
phonetic(ascii_names)
```

```r
# Validate data quality
data <- c("John Smith", "José García", "Marie-Claire")
valid <- printable_ascii(data)
if (!all(valid)) {
  warning("Non-ASCII characters found in: ", paste(data[!valid], collapse = ", "))
}
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.