# Rphenograph

Rphenograph is an R implementation of the PhenoGraph algorithm, a clustering method specifically designed for high-dimensional single-cell data analysis. It creates a graph representing phenotypic similarities between cells by calculating the Jaccard coefficient between nearest-neighbor sets, then identifies communities using the Louvain method for community detection.

The package is particularly useful for analyzing flow cytometry and mass cytometry (CyTOF) data, where traditional clustering methods often struggle with the high dimensionality. Rphenograph leverages optimized C++ code via Rcpp for computing Jaccard coefficients and uses the RANN package for efficient k-nearest neighbor searches, making it suitable for large-scale single-cell datasets.

## Installation

```r
# Install from GitHub using devtools
if(!require(devtools)){
  install.packages("devtools")
}
devtools::install_github("JinmiaoChenLab/Rphenograph")
```

## Rphenograph - Main Clustering Function

The primary function that performs PhenoGraph clustering on high-dimensional data. It finds k-nearest neighbors for each data point, computes Jaccard coefficients between neighbor sets to build a weighted graph, and applies Louvain community detection to identify clusters.

```r
library(Rphenograph)
library(ggplot2)
library(igraph)

# Prepare data - use iris dataset as example
iris_unique <- unique(iris)  # Remove duplicate rows
data <- as.matrix(iris_unique[, 1:4])  # Extract numeric columns

# Run Rphenograph clustering
# Parameters:
#   data: numeric matrix (rows = samples, columns = features)
#   k: number of nearest neighbors (default: 30)
Rphenograph_out <- Rphenograph(data, k = 45)

# Output:
# Run Rphenograph starts:
#   -Input data of 147 rows and 4 columns
#   -k is set to 45
#   Finding nearest neighbors...DONE ~ 0.003 s
#   Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 0.015 s
#   Build undirected graph from the weighted links...DONE ~ 0.001 s
#   Run louvain clustering on the graph ...DONE ~ 0.001 s
# Run Rphenograph DONE, totally takes 0.02s.
#   Return a community class
#   -Modularity value: 0.278
#   -Number of clusters: 3

# Extract results
graph <- Rphenograph_out[[1]]           # igraph object
community <- Rphenograph_out[[2]]       # communities object

# Get cluster assignments for each data point
cluster_membership <- membership(community)
# Returns: numeric vector with cluster ID for each row
# Example: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 ... 3 3 3 3 3 ...

# Get modularity score (quality of clustering)
mod_score <- modularity(community)
# Returns: 0.278 (higher values indicate better community structure)

# Get number and sizes of clusters
num_clusters <- length(community)       # Number of clusters: 3
cluster_sizes <- sizes(community)       # Size of each cluster
# Example: Community sizes
#  1  2  3
# 50 47 50

# Visualize results
iris_unique$phenograph_cluster <- factor(membership(community))
ggplot(iris_unique, aes(x = Sepal.Length, y = Sepal.Width,
                        col = Species, shape = phenograph_cluster)) +
  geom_point(size = 3) +
  theme_bw() +
  labs(title = "Rphenograph Clustering Results",
       shape = "Cluster", color = "True Species")
```

## find_neighbors - K-Nearest Neighbor Search

A utility function that finds k nearest neighbors for each point in the dataset using a kd-tree structure. It wraps the RANN package's nn2 function which utilizes the Approximate Near Neighbor (ANN) C++ library for efficient neighbor searches.

```r
library(Rphenograph)

# Prepare data
iris_unique <- unique(iris)
data <- as.matrix(iris_unique[, 1:4])

# Find k nearest neighbors for each data point
# Parameters:
#   data: numeric matrix (rows = samples, columns = features)
#   k: number of nearest neighbors to find
neighbors <- find_neighbors(data, k = 10)

# Returns: n-by-k matrix of neighbor indices
# Each row i contains the indices of the k nearest neighbors of point i
# The first column is always the point itself (distance = 0)

dim(neighbors)
# [1] 147  10

# Example output (first 5 rows):
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    1   18    5   40   29   12   28   41    8    38
# [2,]    2   35   46   13   10    7    4   48    3    26
# [3,]    3    4   48    7   30   13    2   31   46    35
# [4,]    4    3    7   48   30   31   13    2   46    35
# [5,]    5    1   18   40   29   12   28   41    8    38

# Use case: Pre-compute neighbors for custom graph construction
neighbors_matrix <- find_neighbors(data, k = 30)
# Can be used with jaccard_coeff internally for custom workflows
```

## Working with Community Detection Results

The Rphenograph function returns a list containing an igraph graph object and a communities object from igraph's Louvain clustering. The communities object provides multiple methods for analyzing cluster structure, extracting membership, and evaluating clustering quality.

```r
library(Rphenograph)
library(igraph)

# Run clustering
data <- as.matrix(unique(iris)[, 1:4])
result <- Rphenograph(data, k = 45)

graph <- result[[1]]      # igraph graph object
community <- result[[2]]  # communities object

# Basic community information
print(community)
# IGRAPH clustering louvain, groups: 3, mod: 0.28
# + groups:
#   $`1`
#   [1] "1"  "2"  "3"  "4"  "5"  ...
#   $`2`
#   [1] "51" "52" "53" "54" ...
#   $`3`
#   [1] "101" "102" "103" ...

# Get membership vector (cluster assignment for each vertex)
clusters <- membership(community)
# Named integer vector: vertex_id -> cluster_id

# Get number of communities
n_communities <- length(community)
# Returns: 3

# Get size of each community
community_sizes <- sizes(community)
# Returns: named vector of community sizes

# Modularity score (clustering quality metric, range -0.5 to 1)
mod <- modularity(community)
# Higher values indicate stronger community structure

# Check if edges cross community boundaries
edge_crossing <- crossing(community, graph)
# Logical vector: TRUE if edge connects different communities

# Algorithm used
algo <- algorithm(community)
# Returns: "louvain"

# Check if clustering is hierarchical
is_hier <- is_hierarchical(community)
# Returns: FALSE for Louvain method

# Plot the graph with community coloring
plot(community, graph,
     vertex.size = 5,
     vertex.label = NA,
     main = "PhenoGraph Community Structure")
```

## Summary

Rphenograph is designed for unsupervised clustering of high-dimensional single-cell data, making it particularly valuable for flow cytometry, mass cytometry (CyTOF), and single-cell RNA-seq analysis. The main use case is identifying cell populations or phenotypes from multi-parameter measurements without prior knowledge of the expected clusters. The algorithm's strength lies in its ability to automatically determine the number of clusters based on the data's inherent structure, rather than requiring users to pre-specify cluster counts.

Integration with existing R bioinformatics workflows is straightforward since Rphenograph works with standard R matrices and returns igraph objects compatible with the extensive igraph ecosystem. Users can pipe data from Bioconductor packages like flowCore or SingleCellExperiment, apply Rphenograph clustering, and visualize results using ggplot2 or specialized tools like UMAP/t-SNE for dimensionality reduction. The package's reliance on Rcpp for performance-critical Jaccard coefficient computation ensures it scales well to datasets with tens of thousands of cells common in modern single-cell experiments.