# Kotlin DataFrame

Kotlin DataFrame is a typesafe in-memory structured data processing library for the JVM that reconciles Kotlin's static typing with the dynamic nature of data. It provides a functional, immutable approach to data transformation with a readable DSL close to natural language. The library supports hierarchical data structures (JSON, nested objects), offers compile-time type safety through on-the-fly generation of extension properties, and integrates seamlessly with Kotlin Notebook and IDE tooling.

The library features a minimalistic data model with three column kinds: ValueColumn (contains data), ColumnGroup (contains columns), and FrameColumn (contains dataframes). It provides comprehensive I/O support for CSV, JSON, Excel, SQL databases, and Apache Arrow formats, along with powerful transformation operations including filtering, selection, grouping, joining, pivoting, and aggregation. Integration with the Kotlin compiler plugin enables static interpretation of DataFrame operations, automatic schema inference, and full IDE support with autocompletion and refactoring.

## Reading CSV Files

Read comma-separated value files with customizable options including delimiter, header handling, type inference, and parser configuration for locale-aware parsing and null value handling.

```kotlin
// Basic CSV reading
val df = DataFrame.readCsv("data.csv")

// With custom options
val df = DataFrame.readCsv(
    file = "data.csv",
    delimiter = '|',
    header = listOf("A", "B", "C", "D"),
    parserOptions = ParserOptions(
        nullStrings = setOf("not assigned"),
        locale = Locale.UK
    ),
    colTypes = mapOf("colName" to ColType.String),
    skipLines = 1,
    readLines = 100
)

// Write back to CSV
df.writeCsv("output.csv")

// To CSV string
val csvStr = df.toCsvStr(delimiter = ';', recordSeparator = System.lineSeparator())
```

## Reading JSON Files

Parse JSON data into DataFrames with automatic schema inference, supporting both flat and nested hierarchical structures with support for arrays and objects.

```kotlin
// Read JSON file
val df = DataFrame.readJson("data.json")

// Read JSON row
val row = DataRow.readJson("data.json")

// Write to JSON
df.writeJson("output.json")

// To JSON string with formatting
val jsonStr = df.toJson(prettyPrint = true)
```

## Reading Excel Files

Import Excel workbooks with support for multiple sheets, column type specification, and automatic type inference from cell formats.

```kotlin
// Basic Excel reading
val df = DataFrame.readExcel("file.xlsx")

// With string column specification
val df = DataFrame.readExcel("mixed_column.xlsx", stringColumns = StringColumns("A"))

// Write to Excel
df.writeExcel("output.xlsx")

// Multiple sheets
df.writeExcel(file, sheetName = "allPersons")
df.filter { isHappy }.writeExcel(file, sheetName = "happy", keepFile = true)
```

## Reading from SQL Databases

Execute SQL queries or read entire tables from JDBC-compatible databases with automatic type mapping, nullability inference, and connection pooling support.

```kotlin
// Read from SQL table
val df = DataFrame.readSqlTable(
    dbConfig = DbConnectionConfig(
        url = "jdbc:postgresql://localhost:5432/mydb",
        user = "user",
        password = "password"
    ),
    tableName = "users",
    limit = 1000,
    inferNullability = true
)

// Read with custom query
val df = DataFrame.readSqlQuery(
    dbConfig = dbConfig,
    sqlQuery = "SELECT * FROM users WHERE age > 18"
)
```

## Reading Apache Arrow Files

Import data from Apache Arrow IPC and Feather formats with efficient columnar storage, zero-copy reads, and schema preservation for interoperability with other data systems.

```kotlin
// Read Arrow Feather
val df = DataFrame.readArrowFeather("data.arrow")

// Write Arrow IPC
df.writeArrowIPC("output.arrow")

// Write Arrow Feather
df.writeArrowFeather("output.feather")

// To byte arrays
val ipcBytes = df.saveArrowIPCToByteArray()
val featherBytes = df.saveArrowFeatherToByteArray()

// With custom schema
val schema = Schema.fromJSON(schemaJson)
df.arrowWriter(
    targetSchema = schema,
    mode = ArrowWriter.Mode(restrictWidening = true, strictType = true)
).use { writer ->
    writer.writeArrowFeather(file)
}
```

## Filtering Rows

Select rows matching boolean predicates with type-safe column access, supporting complex conditions with logical operators and null-safe navigation.

```kotlin
// Filter rows based on condition
df.filter { age > 18 && name.firstName.startsWith("A") }

// Filter with null safety
df.filter { city != null && city.length > 5 }

// Filter DataColumn
column.filter { it > 10 }

// Filter ColumnSet
df.select { cols().filter { it.hasNulls() } }
```

## Selecting Columns

Extract specific columns using DSL syntax, column names, type filters, or predicates, supporting hierarchical navigation through nested column groups.

```kotlin
// Select columns using DSL
df.select { name and age and city }

// Select by column names
df.select("name", "age", "city")

// Select from column group
df.select { myColGroup.select { colA and colB } }

// Select by type
df.select { colsOf<String>() and cols { it.name().startsWith("total") } }

// Remove columns
df.remove { age and city }
```

## GroupBy Operations

Partition rows into groups based on key columns and perform aggregations within each group, supporting nested grouping and custom aggregation functions.

```kotlin
// Basic grouping
df.groupBy { city }

// Multiple columns
df.groupBy { city and country }

// Group by column names
df.groupBy("city", "country")

// Group and aggregate
df.groupBy { city }
  .aggregate {
      count() into "count"
      age.mean() into "avgAge"
      income.sum() into "totalIncome"
  }

// Group with filtering
df.groupBy { category }
  .filter { count() > 10 }
  .aggregate { values.mean() into "average" }
```

## Join Operations

Combine two DataFrames based on matching columns using various join types (inner, left, right, full, filter, exclude) with support for multiple key columns and custom match conditions.

```kotlin
// Inner join
df1.innerJoin(df2) { name }

// Left join
df1.leftJoin(df2) { id }

// Join with explicit column matching
df1.join(df2, JoinType.Inner) {
    leftColumn match rightColumn
}

// Multiple column matching
df1.join(df2) {
    id match id and name match fullName
}

// Other join types
df1.rightJoin(df2) { key }
df1.fullJoin(df2) { key }
df1.filterJoin(df2) { key }  // Keep only matched rows from left
df1.excludeJoin(df2) { key }  // Keep only unmatched rows from left
```

## Converting Column Types

Transform column values with type conversion functions, supporting nullable/non-nullable conversions, custom transformations, and per-row operations with index access.

```kotlin
// Convert with custom lambda
df.convert { topics }.with {
    val inner = it.removeSurrounding("[", "]")
    if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
}

// Convert to specific type
df.convert { age }.to<String>()

// Type-specific conversions
df.convert { price }.toDouble()
df.convert { date }.toLocalDateTime()
df.convert { flag }.toBoolean()

// Convert with row context
df.convert { score }.perRowCol { index, col -> col[index] * 2 }

// Convert not null only
df.convert { nullable }.notNull { it.toString().uppercase() }
```

## Converting to Data Schema

Transform untyped DataFrames into strongly-typed DataFrames using data classes annotated with @DataSchema, enabling type-safe column access and IDE autocompletion.

```kotlin
// Define data schema
@DataSchema
data class Repositories(
    @ColumnName("full_name")
    val fullName: String,
    @ColumnName("html_url")
    val htmlUrl: URL,
    @ColumnName("stargazers_count")
    val stargazersCount: Int,
    val topics: String
)

// Convert untyped DataFrame to typed
val repos = DataFrame
    .readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
    .convertTo<Repositories>()

// Access with type-safe properties
val filtered = repos
    .filter { stargazersCount > 50 }
    .convert { topics }.with {
        val inner = it.removeSurrounding("[", "]")
        if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
    }
```

## Aggregating Data

Compute summary statistics across rows or within groups using built-in aggregation functions (sum, mean, min, max, std, median, count, countDistinct).

```kotlin
// Count operations
df.count()  // Total rows
df.count { age > 18 }  // Conditional count

// Aggregation after groupBy
df.groupBy { city }
  .aggregate {
      count() into "population"
      age.mean() into "avgAge"
      age.min() into "minAge"
      age.max() into "maxAge"
      income.sum() into "totalIncome"
      income.std() into "stdIncome"
  }

// Column aggregations
column.sum()
column.mean()
column.median()
column.min()
column.max()
column.std()
column.countDistinct()
```

## Adding Columns

Create new columns with expressions computed from existing columns, supporting row context with access to previous rows and complex multi-column calculations.

```kotlin
// Add new column with expression
df.add("fullName") { firstName + " " + lastName }

// Add with previous row access
df.add("cumulative") { prev()?.cumulative?.plus(value) ?: value }

// Add multiple columns
df.add("topicCount") { topics.size }
  .add("kind") { getKind(fullName, topics) }

// Add with row index
df.add("rowNum") { index() }
```

## Updating Columns

Modify existing column values in place with conditional updates, index-based updates, and null-aware transformations while preserving column types.

```kotlin
// Update column values
df.update { age }.with { it + 1 }

// Update with condition
df.update { age }.where { age < 18 }.with { 18 }

// Update at specific indices
df.update { score }.at(0, 2, 4).with { it * 2 }

// Update not null values
df.update { nullable }.notNull { it.toString().uppercase() }

// Update with row context
df.update { price }.with { price * discountRate }
```

## Pivoting Data

Reshape DataFrames by converting unique values from one column into multiple columns, with support for aggregation, one-hot encoding, and hierarchical pivots.

```kotlin
// Pivot with aggregation
df.pivot { category }
  .groupBy { region }
  .aggregate {
      count() into "count"
      sales.sum() into "totalSales"
  }

// Pivot matches (one-hot encoding)
df.pivotMatches { pclass and sex }

// Pivot counts
df.pivotCounts { category }
```

## Sorting DataFrames

Order rows based on one or more columns in ascending or descending order, supporting null positioning and custom comparators.

```kotlin
// Sort by single column
df.sortBy { age }

// Sort descending
df.sortByDesc { score }

// Sort by multiple columns
df.sortBy { city and age }

// Mixed ascending/descending
df.sortBy { city }.sortByDesc { age }
```

## Handling Null Values

Fill, remove, or replace null values using constants, column statistics, or custom expressions, with per-column strategies and null safety guarantees.

```kotlin
// Fill nulls with constant
df.fillNulls { age }.with { 0 }

// Fill with column mean
df.fillNulls { numericColumns }.perCol { it.mean() }

// Fill multiple columns
df.fillNulls { sibsp and parch and age and fare }.perCol { it.mean() }
  .fillNulls { sex }.with { "female" }

// Remove rows with nulls
df.dropNulls()
df.dropNulls { age and city }
```

## Complete ETL Pipeline Example

End-to-end data processing pipeline demonstrating reading, filtering, type conversion, grouping, aggregation, sorting, and writing in a single fluent chain.

```kotlin
@DataSchema
interface Movie {
    val movieId: String
    val title: String
    val genres: String
}

// Read, clean, transform, and analyze movie data
val result = DataFrame
    .readCsv("movies.csv")
    .convertTo<Movie>()
    .split { genres }.by("|").inplace()
    .split { title }.by {
        listOf(
            """\s*\(\d{4}\)\s*$""".toRegex().replace(it, ""),
            "\\d{4}".toRegex().findAll(it).lastOrNull()?.value?.toIntOrNull() ?: -1
        )
    }.into("title", "year")
    .explode("genres")
    .filter { "year"<Int>() >= 0 && genres != "(no genres listed)" }
    .groupBy("year")
    .sortBy("year")
    .pivot("genres", inward = false)
    .aggregate {
        count() into "count"
        mean() into "mean"
    }
    .writeJson("movie_analysis.json")
```

## Machine Learning Feature Engineering

Prepare data for ML models with imputation, one-hot encoding, feature extraction, normalization, and train-test splitting.

```kotlin
val (train, test) = df
    // Imputing missing values
    .fillNulls { sibsp and parch and age and fare }.perCol { it.mean() }
    .fillNulls { sex }.with { "female" }
    // One hot encoding
    .pivotMatches { pclass and sex }
    // Feature extraction
    .select { survived and pclass and sibsp and parch and age and fare and sex }
    // Normalization (custom)
    .convert { numericColumns }.with { (it - mean) / std }
    // Split
    .shuffle()
    .toTrainTest(0.7) { survived }
```

Kotlin DataFrame serves as a comprehensive solution for data analysis, ETL pipelines, and data science workflows in Kotlin projects. Its primary use cases include exploratory data analysis in Kotlin Notebooks with rich HTML rendering, data preprocessing and feature engineering for machine learning models, ETL pipeline development with type-safe transformations, and report generation from various data sources including SQL databases, CSV files, and APIs. The library's immutable, functional design ensures thread-safe operations suitable for concurrent data processing.

Integration patterns vary by environment: In Kotlin Notebook and Jupyter environments, use `%use dataframe` magic command for automatic dependency resolution and schema inference; in Gradle projects, add dependencies for required I/O modules (csv, json, excel, jdbc, arrow) and optionally enable the compiler plugin for compile-time type checking and IDE integration. The compiler plugin (requires Kotlin 2.2.20+ and IntelliJ IDEA 2025.2+) provides static interpretation of DataFrame operations, automatic schema updates after each transformation, and full IDE support with autocompletion and refactoring. For production use, the library supports conversion to/from Kotlin collections and data classes, enabling seamless integration with existing codebases and serialization frameworks.