Kotlin DataFrame (kotlin/dataframe)

Kotlin DataFrame

https://github.com/kotlin/dataframe
Admin
Kotlin DataFrame is a typesafe, in-memory structured data processing library for the JVM, offering...

Tokens:318,334
Snippets:3,101
Trust Score:9.5
Update:4 days ago
Show doc for...
Context Summary (auto-generated)
Raw
# Kotlin DataFrame

Kotlin DataFrame is a typesafe, in-memory structured data processing library for the JVM. It reconciles Kotlin's static typing with the dynamic nature of data by providing a functional, immutable, and hierarchical data model that works seamlessly in both regular Gradle/Maven projects and Jupyter notebooks. The library supports three column kinds — `ValueColumn` (data values), `ColumnGroup` (nested columns), and `FrameColumn` (nested DataFrames) — enabling representation of arbitrarily deep JSON-like structures. Every operation returns a new `DataFrame` instance, reusing underlying storage where possible, making the API chain-friendly and side-effect-free.

The core functionality revolves around a rich DSL for creating, reading, filtering, transforming, aggregating, and writing DataFrames. Data can be loaded from CSV, TSV, JSON, Excel, Apache Arrow, Apache Parquet, and SQL databases; written back to any of those formats; and accessed either dynamically via string column names or in a fully type-safe manner through `@DataSchema`-annotated interfaces and the Kotlin DataFrame Compiler Plugin. The compiler plugin generates extension properties that provide IDE autocompletion, refactoring support, and compile-time schema verification across the entire transformation pipeline.

---

## Setup

### Gradle dependency

```kotlin
// build.gradle.kts
repositories { mavenCentral() }

dependencies {
    implementation("org.jetbrains.kotlinx:dataframe:1.0.0-Beta5")
}

// Optional: enable the Compiler Plugin for type-safe column access
plugins {
    kotlin("jvm") version "2.3.20"
    kotlin("plugin.dataframe") version "2.3.20"
}
// gradle.properties — required while incremental compilation is unsupported
// kotlin.incremental=false
```

### Kotlin Notebook / Jupyter

```
%useLatestDescriptors
%use dataframe
```

---

## DataFrame Creation

### `dataFrameOf` — create a DataFrame inline

Builds a `DataFrame` from column-name/value pairs, from `vararg` row values, or from existing `DataColumn` objects. The most direct way to construct a DataFrame from literals.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// From name-to-list pairs
val df = dataFrameOf(
    "name" to listOf("Alice", "Bob", "Charlie"),
    "age"  to listOf(15, 20, 100),
)

// From column names + row values (vararg)
val df2 = dataFrameOf("name", "age")(
    "Alice",   15,
    "Bob",     20,
    "Charlie", 100,
)

// With nested ColumnGroup
val df3 = dataFrameOf(
    "name" to columnOf(
        "firstName" to columnOf("Alice", "Bob"),
        "lastName"  to columnOf("Cooper", "Dylan"),
    ),
    "age" to columnOf(15, 20),
)
```

---

### `toDataFrame` — convert Kotlin objects / collections

Converts a `List<T>`, `Map<String, List<*>>`, an `IntRange` builder, or a `List<List<T>>` into a `DataFrame`. Supports deep object graph traversal via the `maxDepth` parameter and a configuration DSL.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// From a Map
val map = mapOf("name" to listOf("Alice", "Bob"), "age" to listOf(15, 20))
val df = map.toDataFrame()

// From data class instances
data class Name(val firstName: String, val lastName: String)
data class Score(val subject: String, val value: Int)
data class Student(val name: Name, val age: Int, val scores: List<Score>)

val students = listOf(
    Student(Name("Alice", "Cooper"), 15, listOf(Score("math", 4), Score("biology", 3))),
    Student(Name("Bob", "Marley"),   20, listOf(Score("music", 5))),
)

// maxDepth = 1 → Name becomes ColumnGroup, scores becomes FrameColumn
val df2 = students.toDataFrame(maxDepth = 1)

// Advanced DSL with custom columns, property exclusions, and column groups
val df3 = students.toDataFrame {
    "year of birth" from { 2021 - it.age }
    properties(maxDepth = 1) {
        exclude(Score::subject)
        preserve<Name>()          // keep Name as an object, don't decompose
    }
    "summary" {
        "max score" from { it.scores.maxOf { s -> s.value } }
        "min score" from { it.scores.minOf { s -> s.value } }
    }
}

// From an IntRange with a builder lambda (useful for random/generated data)
val generated = (0 until 7).toDataFrame {
    "id"       from { "P${1000 + it}" }
    "price"    from { kotlin.random.Random.nextDouble(10.0, 500.0) }
    "inStock"  from { kotlin.random.Random.nextInt(0..100) }
}
```

---

## Reading Data

### `DataFrame.read` — auto-detect format

Reads a `DataFrame` from a file path or URL, automatically detecting the format from the file extension.

```kotlin
val df = DataFrame.read("input.csv")
val df2 = DataFrame.read("https://example.com/data.json")
```

---

### `DataFrame.readCsv` / `readCsvStr` — read CSV

Reads CSV (or TSV/delimited) files with automatic type inference for `Int`, `Long`, `Double`, and `Boolean`. Supports custom delimiters, null-string sets, locale-specific number formats, and custom date-time patterns via `ParserOptions`.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.io.*
import java.io.File
import java.util.Locale

// From file
val df = DataFrame.readCsv("data.csv")
DataFrame.readCsv(File("data.csv"))

// From URL
DataFrame.readCsv(java.net.URI("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv").toURL())

// From String
val csv = """
    A,B,C,D
    12,tuv,0.12,true
    41,xyz,3.6,not assigned
    89,abc,7.1,false
""".trimIndent()
DataFrame.readCsvStr(csv)
// Schema inferred: A: Int, B: String, C: Double, D: Boolean?

// Custom delimiter + null strings
val df2 = DataFrame.readCsv(
    File("data.psv"),
    delimiter = '|',
    header = listOf("A", "B", "C", "D"),
    parserOptions = ParserOptions(nullStrings = setOf("not assigned")),
)

// Locale-specific numbers (comma as decimal separator)
val df3 = DataFrame.readCsv(File("eu_data.csv"), parserOptions = ParserOptions(locale = Locale.GERMAN))

// Custom date-time pattern
val df4 = DataFrame.readCsv(
    File("log.csv"),
    parserOptions = ParserOptions(
        dateTime = DateTimeParserOptions.Java.withPattern("dd/MMM/yy h:mm a"),
    ),
)

// Disable type inference for all columns (keep everything as String)
val df5 = DataFrame.readCsv(File("data.csv"), colTypes = mapOf(ColType.DEFAULT to ColType.String))
```

---

### `DataFrame.readJson` / `readJsonStr` — read JSON

Reads JSON arrays or objects into a hierarchical `DataFrame`. Nested JSON objects become `ColumnGroup`s and arrays of objects become `FrameColumn`s. Type clashes (same key holding different JSON types) can be resolved via `typeClashTactic`. Use `keyValuePaths` to read large key-value maps as `FrameColumn`s instead of exploding into hundreds of columns.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.io.*

// From file or URL
val df = DataFrame.readJson(File("data.json"))
DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json")

// Type-clash tactic: fold mixed-type fields into (value, array, objectProps) groups
DataFrame.readJsonStr(json, typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS)

// ANY_COLUMNS: keep mixed-type column as Any
DataFrame.readJsonStr(json, typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS)

// keyValuePaths: read {"dogs": {"fido": {...}, "spot": {...}}} as FrameColumn
// instead of one column per dog name
DataFrame.readJsonStr(
    text = myJson,
    keyValuePaths = listOf(
        JsonPath().append("dogs"),
        JsonPath().append("cats"),
    ),
)
// Result schema: dogs: *[name: String, value: {age: Int, breed: String}]
```

---

### `DataFrame.readExcel` — read Excel (XLS/XLSX)

Reads Excel spreadsheets. Date cells are read as `kotlinx.datetime.LocalDateTime`, numeric cells as `Double`. Use `stringColumns` to force specific columns to be read as `String`.

```kotlin
import org.jetbrains.kotlinx.dataframe.io.*

val df = DataFrame.readExcel(File("report.xlsx"))
DataFrame.readExcel("https://example.com/data.xlsx")

// Force column "A" to be String to avoid numeric/string mixed type
val df2 = DataFrame.readExcel("mixed_column.xlsx", stringColumns = StringColumns("A"))
```

---

### `DataFrame.readArrowFeather` — read Apache Arrow

Reads Arrow IPC streaming format or Feather (random access) format from a file, `InputStream`, `Channel`, or `ByteArray`.

```kotlin
import org.jetbrains.kotlinx.dataframe.io.*

val df = DataFrame.readArrowFeather(File("data.feather"))
val df2 = DataFrame.readArrowIPC(File("data.arrow"))
```

---

### SQL Database — `DataFrame.readSqlTable` / `readSqlQuery`

Reads data from SQL databases (PostgreSQL, MySQL, MariaDB, SQLite, MS SQL, DuckDB) via JDBC. Requires the `dataframe-jdbc` artifact and the appropriate JDBC driver.

```kotlin
// build.gradle.kts
// implementation("org.jetbrains.kotlinx:dataframe-jdbc:1.0.0-Beta5")
// implementation("org.postgresql:postgresql:$version")

import org.jetbrains.kotlinx.dataframe.io.DbConnectionConfig
import org.jetbrains.kotlinx.dataframe.api.print

val dbConfig = DbConnectionConfig(
    url      = "jdbc:postgresql://localhost:5432/testDatabase",
    username = "postgres",
    password = "password",
)

// Read an entire table (first 100 rows)
val df = DataFrame.readSqlTable(dbConfig, "Customer", limit = 100)
df.print()

// Execute an arbitrary SQL query
val result = DataFrame.readSqlQuery(dbConfig, "SELECT id, name, age FROM Customer WHERE age > 30")

// Read using an existing JDBC Connection
import java.sql.DriverManager
val conn = DriverManager.getConnection("jdbc:sqlite:local.db")
val df2 = conn.readDataFrame("SELECT * FROM orders")

// Inspect schema without reading data
val schema = DataFrameSchema.readSqlTable(dbConfig, "Customer")
```

---

## Column Selection DSL

### Columns Selection DSL — multi-column selectors

A powerful DSL used across `select`, `filter`, `update`, `remove`, `move`, `convert`, and many other operations. Supports property-based access (with compiler plugin / `@DataSchema`), string-based access, index-based access, and predicate-based selection.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Select specific columns
df.select { age and weight }
df.select("age", "weight")
df["age", "weight"]

// Type-filtered columns at any depth
df.select { colsAtAnyDepth().colsOf<String>() }

// Filter columns by predicate
df.remove { cols { it.hasNulls() } }

// Nested column access
df.select { name.firstName and name.lastName }  // ColumnGroup navigation

// Range of columns
df.select { age.."weight" }

// All columns in a group
df.select { name.allCols() }

// Combine in real operations
df.fillNaNs { colsAtAnyDepth().colsOf<Double>() }.withZero()
df.update { city }.notNull { it.lowercase() }
df.move { name.firstName and name.lastName }.after { city }
```

---

## Filtering Rows

### `filter` — row filtering

Returns a `DataFrame` with rows that satisfy a row condition. The lambda receiver is `DataRow`, giving access to column values as properties (with compiler plugin) or via typed string access.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Property-based (requires @DataSchema or compiler plugin)
df.filter { age > 18 && name.firstName.startsWith("A") }

// String-based
df.filter { "age"<Int>() > 18 && "name"["firstName"]<String>().startsWith("A") }

// Combined with other operations
df
    .filter { age in 18..65 }
    .sortBy { age }
    .select { name and age and city }
```

---

## Column Operations

### `add` — add computed columns

Appends new columns derived from a row expression. Supports multiple columns at once, column groups, and recurrent (row-to-row) calculations.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Single column
df.add("year of birth") { 2021 - age }

// Multiple columns including a nested group
df.add {
    "year of birth" from { 2021 - age }
    expr { age > 18 } into "is adult"
    "details" {
        name.lastName.map { it.length } into "last name length"
        "full name" from { name.firstName + " " + name.lastName }
    }
}

// Recurrent computation (Fibonacci)
df.add("fibonacci") {
    if (index() < 2) 1
    else prev()!!.newValue<Int>() + prev()!!.prev()!!.newValue<Int>()
}

// Add sequential id column (prepended as first column)
df.addId()
df.addId("rowId")
```

---

### `update` — update cell values (same type)

Changes values in selected cells without altering the column type. Supports row conditions, index ranges, per-column and per-row-column expressions.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.update { age }.with { it * 2 }
df.update { colsAtAnyDepth().colsOf<String>() }.with { it.uppercase() }
df.update { weight }.at(1..4).notNull { it / 2 }
df.update { name.lastName and age }.at(1, 3, 4).withNull()

// Conditional update
df.update { city }.where { name.firstName == "Alice" }.with { "Paris" }

// Row-dependent update
df.update { city }.with { name.firstName + " from " + it }

// Per-column update (replace with column mean)
df.update { colsOf<Number?>() }.perCol { mean(skipNaN = true) }

// Per-row-col update
df.update { colsOf<String?>() }.perRowCol { row, col -> col.name() + ": " + row.index() }

// Update a ColumnGroup as a DataFrame
df.update { name }.asFrame { select { lastName } }
```

---

### `convert` — change column types

Returns a `DataFrame` with column values converted to a different type. Supports automatic conversions between primitives, date/time types, enums, value classes, and custom converters via `ParserOptions`.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.convert { age }.with { it.toDouble() }
df.convert { colsAtAnyDepth().colsOf<String>() }.with { it.toCharArray().toList() }

// Automatic type conversion shortcuts
df.convert { age }.to<Double>()
df.convert { weight }.toFloat()
df.convert { colsOf<Number>() }.to<String>()

// Column-level conversion
df.weight.convertTo<Float?>()
df.age.convertToDouble()

// String → enum
enum class Direction { NORTH, SOUTH, WEST, EAST }
dataFrameOf("direction")("NORTH", "WEST").convert("direction").to<Direction>()

// String → value class
@JvmInline value class IntClass(val value: Int)
dataFrameOf("value")("1", "2").convert("value").to<IntClass>()

// String with custom locale / date format
stringDf.convert { value }.to<Double?>(
    parserOptions = ParserOptions(locale = Locale.GERMAN, nullStrings = setOf("-")),
)
stringDf.convert { date }.toLocalDate(kotlinx.datetime.LocalDate.Formats.ISO)
```

---

### `rename` / `renameToCamelCase` — rename columns

Renames one or more columns by new name, name expression, or bulk camelCase normalization.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Rename a single column
df.rename { name }.into("fullName")
df.rename("name").into("fullName")

// Rename with an expression using column statistics
df.rename { age }.into { val mean = it.data.mean(); "age [mean = $mean]" }

// Rename a subset to camelCase
df.rename { ColumnA and `COLUMN-C` }.toCamelCase()

// Rename ALL columns (including nested) to camelCase
// e.g. "first_name" → "firstName", "RESTApi" → "restApi"
df.renameToCamelCase()
```

---

### `remove` — drop columns

Returns a `DataFrame` without the selected columns.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.remove { name and weight }
df.remove("name", "weight")

// Remove all columns that have nulls
df.remove { cols { it.hasNulls() } }
```

---

### `select` — project columns

Creates a new `DataFrame` containing only the specified columns, preserving order.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.select { age and weight }
df.select("age", "weight")
df.select { colsOf<Int>() }
df.select { name.allCols() }    // flatten a ColumnGroup into top-level
```

---

### `move` — reorder / restructure columns

Moves columns to a different position or restructures them into/out of `ColumnGroup`s.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.move { age }.toStart()
df.move { weight }.to(1)

// Group age + weight under "info"
df.move { age and weight }.under("info")

// Rename path while moving: name.firstName → fullName.first
df.move { name.firstName and name.lastName }.into { pathOf("fullName", it.name().dropLast(4)) }

// Flatten a ColumnGroup to top level
df.move { name.allCols() }.toTop()

// Split pipe-separated column names into hierarchy: "a|b|c" → a.b.c
dataFrameOf("a|b|c", "a|d|e")(0, 0)
    .move { all() }.into { it.name().split("|").toPath() }
```

---

### `split` — split column values horizontally or vertically

Splits `String`, `List`, or `DataFrame` column values and stores them in new columns or as new rows.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Split String column into characters → separate columns
df.split { name.lastName }.by { it.asIterable() }.into("char1", "char2")

// Split comma-separated string into a list in-place
df.split { "tags"<String>() }.by(",").inplace()

// Expand a list column into separate rows (explode variant)
val dfWithLists = dataFrameOf(
    "a" to columnOf(listOf(1, 2), listOf(3, 4, 5)),
    "b" to columnOf(listOf(1, 2, 3), listOf(4, 5)),
)
dfWithLists.split { a }.intoRows()
```

---

### `explode` — spread list/frame values into rows

Expands list-valued cells into individual rows, duplicating values in other columns. Reverses `implode`.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

val df = dataFrameOf("a", "b")(
    1, listOf(1, 2),
    2, listOf(3, 4),
)
df.explode("b")
// Result: a=1/b=1, a=1/b=2, a=2/b=3, a=2/b=4

// Explode multiple columns simultaneously (values aligned)
val df2 = dataFrameOf(
    "a" to columnOf(listOf(1, 2), listOf(3, 4, 5)),
    "b" to columnOf(listOf(1, 2, 3), listOf(4, 5)),
)
df2.explode("a", "b")

// Explode a FrameColumn
val col by columnOf(
    dataFrameOf("a", "b")(1, 2, 3, 4),
    dataFrameOf("a", "b")(5, 6, 7, 8),
)
col.explode()
```

---

## Sorting

### `sortBy` / `sortByDesc` / `sortWith` — sort rows

Returns a `DataFrame` with rows sorted by one or more columns. Modifiers `.desc()` and `.nullsLast()` control order per column.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.sortBy { age }
df.sortBy { age and name.firstName.desc() }
df.sortBy { weight.nullsLast() }

df.sortByDesc { age and weight }

// Custom comparator
df.sortWith { row1, row2 ->
    when {
        row1.age < row2.age -> -1
        row1.age > row2.age -> 1
        else -> row1.name.firstName.compareTo(row2.name.firstName)
    }
}
```

---

## Deduplication

### `distinct` / `distinctBy` — remove duplicate rows

Removes duplicate rows from the `DataFrame`. `distinctBy` keeps only the first row per group defined by given columns.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.distinct()

// Distinct on a subset of columns
df.distinct { age and name }
df.distinct("age", "name")

// Keep first row per group
df.distinctBy { age and name }
```

---

## Aggregation

### `groupBy` — group rows and aggregate

Splits rows into groups by one or more key columns, then aggregates each group. Returns a `GroupBy` object that can be aggregated, pivoted, sorted, or converted back to a `DataFrame`.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

// Simple grouping
df.groupBy { name }
df.groupBy { city and name.lastName }

// Inline computed key
df.groupBy { expr { name.firstName.length + name.lastName.length } named "nameLength" }

// Multi-statistic aggregation
df.groupBy { city }.aggregate {
    count()                     into "total"
    count { age > 18 }          into "adults"
    median { age }              into "median age"
    min { age }                 into "min age"
    maxBy { age }.name          into "oldest"
}

// Direct aggregation shortcuts
df.groupBy { city }.max()                           // max per comparable column
df.groupBy { city }.mean()                          // mean per numeric column
df.groupBy { city }.max { age }                     // → column "age"
df.groupBy { city }.sum("total weight") { weight }  // → column "total weight"
df.groupBy { city }.count()                         // → column "count"

// Collect raw values without aggregation
df.groupBy { city }.values { name and age }
df.groupBy { city }.values { weight into "weights" }

// Concat groups back into a DataFrame (preserving group order)
df.groupBy { name }.concat()
```

---

### `pivot` — pivot columns from row values

Reshapes the `DataFrame` by turning distinct values in one column into new column headers, optionally combined with `groupBy` for a full cross-tabulation matrix.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.pivot { city }
df.pivot { city and name.firstName }      // independent pivots
df.pivot { city then name.firstName }     // hierarchical pivot

// pivot + groupBy (cross-tabulation)
df.pivot { city }.groupBy { name }.aggregate { mean { age } }
df.groupBy { name }.pivot { city }.aggregate { mean { age } }
// Both produce the same result: rows indexed by name, columns by city
```

---

### `describe` — summary statistics

Produces a summary `DataFrame` with per-column statistics: count, unique, nulls, top, freq, mean, std, min, p25, median, p75, max.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.describe()

// Describe specific columns only
df.describe { age and name.allCols() }
```

---

## Joining DataFrames

### `joinWith` — expression-based join

Joins two `DataFrame` objects using an arbitrary Boolean expression. Supports Inner, Left, Right, Full, Filter, and Exclude join types via shortcut functions.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*
import kotlinx.datetime.LocalDate

// Inner join on a date range condition
campaigns.innerJoinWith(visits) {
    right.date in startDate..endDate
}

// Left join
campaigns.leftJoinWith(visits) {
    right.date in startDate..endDate
}

// Right join
campaigns.rightJoinWith(visits) {
    right.date in startDate..endDate
}

// Full join
campaigns.fullJoinWith(visits) {
    right.date in startDate..endDate
}

// Filter join (inner but keeps only left columns)
campaigns.filterJoinWith(visits) {
    right.date in startDate..endDate
}

// Exclude join (rows from left with no match on the right)
campaigns.excludeJoinWith(visits) {
    right.date in startDate..endDate
}

// Cross product (cartesian join)
campaigns.joinWith(visits) { true }

// String-based inner join on equality (deduplicates matched columns)
df1.innerJoin(df2, "index", "age")
```

---

## Combining DataFrames

### `concat` — union rows from multiple DataFrames

Vertically concatenates rows from multiple `DataFrame` objects. Unifies schemas: matching column names get the lowest common supertype; missing columns are filled with `null`.

```kotlin
import org.jetbrains.kotlinx.dataframe.api.*

df.concat(df1, df2)
listOf(df1, df2).concat()

// Concat selected rows
val rows = listOf(df[2], df[4], df[5])
rows.concat()

// Concat two DataColumn instances
val a by columnOf(1, 2)
val b by columnOf(3, 4)
a.concat(b)

// Explode a FrameColumn (i.e. concat its nested frames)
val frameColumn by columnOf(
    dataFrameOf("a", "b")(1, 2, 3, 4),
    dataFrameOf("b", "c")(5, 6, 7, 8),
)
frameColumn.concat()
```

---

## Type-Safe Access with `@DataSchema`

### `@DataSchema` — declare typed DataFrame schemas

Annotate an interface or data class with `@DataSchema` to generate extension properties for type-safe column access. Use with `cast<>()` (assertion) or `convertTo<>()` (coercing conversion) to apply the schema to a raw `DataFrame`.

```kotlin
import org.jetbrains.kotlinx.dataframe.annotations.*
import org.jetbrains.kotlinx.dataframe.api.*

@DataSchema
interface Person {
    val firstName: String
    @ColumnName("last_name")    // maps to a column literally named "last_name"
    val lastName: String
    val age: Int
    val city: String?
}

// Cast: compile-time assertion that schema matches (no data conversion)
val df = DataFrame.readCsv("people.csv").cast<Person>()

// Type-safe access via generated extension properties
df.filter { firstName.startsWith("A") && age > 18 }
df.add("greeting") { "Hello, $firstName $lastName" }
df.select { firstName and lastName }
df.convert { firstName }.with { it.uppercase() }

// Schemas for nested / hierarchical data
@DataSchema data class Name(val firstName: String, val lastName: String)
@DataSchema data class PersonNested(val name: Name, val age: Int, val city: String?)
@DataSchema data class Group(val id: String, val participants: List<PersonNested>)

val url = "https://raw.githubusercontent.com/Kotlin/dataframe/refs/heads/master/data/participants.json"
val groupDf = DataFrame.readJson(url).cast<Group>()
groupDf.participants.explode()  // expand FrameColumn into rows
```

---

### `convertTo<Schema>` — coerce DataFrame to schema

Converts all columns to match the target `@DataSchema`, applying automatic type conversion and accepting custom converters, parsers, and fillers for missing or non-standard columns.

```kotlin
import org.jetbrains.kotlinx.dataframe.annotations.*
import org.jetbrains.kotlinx.dataframe.api.*

class MyType(val value: Int)

@DataSchema
class MySchema(val a: MyType, val b: MyType, val c: Int)

val raw: AnyFrame = dataFrameOf(
    "a" to columnOf(1, 2, 3),
    "b" to columnOf("1", "2", "3"),
)

val typed = raw.convertTo<MySchema> {
    convert<Int>().with { MyType(it) }        // Int → MyType for column "a"
    parser { MyType(it.toInt()) }             // String → MyType for column "b"
    fill { c }.with { a.value + b.value }     // compute missing column "c"
}
```

---

## Writing Data

### `writeCsv` / `writeJson` / `writeExcel` / `writeArrowIPC` — persist DataFrames

Saves a `DataFrame` to CSV, JSON, Excel (XLS/XLSX), or Apache Arrow formats.

```kotlin
import org.jetbrains.kotlinx.dataframe.io.*
import java.io.File

val df = DataFrame.readCsv("input.csv")
    .filter { "age"<Int>() > 18 }
    .rename("stargazers_count").into("stars")

// CSV
df.writeCsv(File("output.csv"))
val csvString = df.toCsvStr(delimiter = ';', recordSeparator = System.lineSeparator())

// JSON
df.writeJson(File("output.json"))
val jsonString = df.toJson(prettyPrint = true)

// Excel – single sheet
df.writeExcel(File("output.xlsx"))

// Excel – multiple sheets (keepFile = true appends without overwriting)
df.filter { "isHappy"<Boolean>() }.remove("isHappy")
    .writeExcel(File("report.xlsx"), sheetName = "happyPersons", keepFile = true)

// Apache Arrow IPC streaming
df.writeArrowIPC(File("data.arrow"))
val bytes: ByteArray = df.saveArrowIPCToByteArray()

// Apache Arrow Feather (random access)
df.writeArrowFeather(File("data.feather"))
val featherBytes: ByteArray = df.saveArrowFeatherToByteArray()

// Arrow with a target schema and strict mode
val schema = org.apache.arrow.vector.types.pojo.Schema.fromJSON(schemaJson)
df.arrowWriter(
    targetSchema = schema,
    mode = ArrowWriter.Mode(restrictWidening = true, strictType = true, strictNullable = false),
).use { writer ->
    writer.writeArrowFeather(File("typed.feather"))
}
```

---

## Compiler Plugin (Type-Safe Extension Properties)

### Kotlin DataFrame Compiler Plugin — live schema tracking

When the Kotlin DataFrame Compiler Plugin is enabled in a Gradle/Maven project, every `DataFrame` operation that changes the schema generates new extension properties on the fly. This means the IDE and the Kotlin compiler always know the exact shape of the `DataFrame` at each step — no `@DataSchema` annotation required.

```kotlin
// build.gradle.kts
plugins {
    kotlin("jvm") version "2.3.20"
    kotlin("plugin.dataframe") version "2.3.20"
}
dependencies {
    implementation("org.jetbrains.kotlinx:dataframe:1.0.0-Beta5")
}
// gradle.properties:  kotlin.incremental=false

// ── Full end-to-end example ──────────────────────────────────────────
import org.jetbrains.kotlinx.dataframe.api.*

// Step 1: read — plugin infers schema from CSV headers + content
val df = DataFrame.readCsv(
    "https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"
)

// Step 2: rename/clean — new schema propagated; "stars" property available below
val repos = df
    .renameToCamelCase()
    .rename { stargazersCount }.into("stars")

// Step 3: filter — typed access to "stars" (Int), no cast needed
val popular = repos.filter { stars > 50 }

// Step 4: transform — convert "topics" String column to List<String>
val enriched = popular.convert { topics }.with { raw ->
    val inner = raw.removeSurrounding("[", "]")
    if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
}

// Step 5: add derived column
val result = enriched.add("topicCount") { topics.size }

// Step 6: persist
result.writeCsv("jetbrains_repositories_enriched.csv")
```

---

Kotlin DataFrame is primarily used for two broad patterns. The first is **ETL and data engineering pipelines** on the JVM: ingesting raw data from CSV files, REST JSON endpoints, SQL databases, or Parquet/Arrow stores; normalizing schemas (including hierarchical/nested ones via `ColumnGroup` and `FrameColumn`); applying filtering, transformation, deduplication, join, and aggregation chains; and writing cleaned data back to file or database. The library's immutable, functional style makes it straightforward to compose multi-step transformations in a readable, auditable manner without side-effects.

The second major pattern is **interactive data exploration in Kotlin Notebook or Datalore**, where the Jupyter integration (`%use dataframe`) and the compiler plugin combine to give a pandas-like workflow with full Kotlin type safety. Analysts can read data with auto-detected schemas, explore structure via `describe()`, filter and pivot with DSL expressions, visualise results with Kandy, and export processed frames — all while benefiting from IDE autocompletion and null-safety. The `@DataSchema` annotation enables teams to define shared typed contracts for DataFrames that cross service boundaries or are read from external sources, making the library suitable for production data services as well as exploratory notebooks.