# Kotlin DataFrame Kotlin DataFrame is a typesafe in-memory structured data processing library for the JVM that reconciles Kotlin's static typing with the dynamic nature of data. It provides a functional, immutable approach to data transformation with a readable DSL close to natural language. The library supports hierarchical data structures (JSON, nested objects), offers compile-time type safety through on-the-fly generation of extension properties, and integrates seamlessly with Kotlin Notebook and IDE tooling. The library features a minimalistic data model with three column kinds: ValueColumn (contains data), ColumnGroup (contains columns), and FrameColumn (contains dataframes). It provides comprehensive I/O support for CSV, JSON, Excel, SQL databases, and Apache Arrow formats, along with powerful transformation operations including filtering, selection, grouping, joining, pivoting, and aggregation. Integration with the Kotlin compiler plugin enables static interpretation of DataFrame operations, automatic schema inference, and full IDE support with autocompletion and refactoring. ## Reading CSV Files Read comma-separated value files with customizable options including delimiter, header handling, type inference, and parser configuration for locale-aware parsing and null value handling. ```kotlin // Basic CSV reading val df = DataFrame.readCsv("data.csv") // With custom options val df = DataFrame.readCsv( file = "data.csv", delimiter = '|', header = listOf("A", "B", "C", "D"), parserOptions = ParserOptions( nullStrings = setOf("not assigned"), locale = Locale.UK ), colTypes = mapOf("colName" to ColType.String), skipLines = 1, readLines = 100 ) // Write back to CSV df.writeCsv("output.csv") // To CSV string val csvStr = df.toCsvStr(delimiter = ';', recordSeparator = System.lineSeparator()) ``` ## Reading JSON Files Parse JSON data into DataFrames with automatic schema inference, supporting both flat and nested hierarchical structures with support for arrays and objects. ```kotlin // Read JSON file val df = DataFrame.readJson("data.json") // Read JSON row val row = DataRow.readJson("data.json") // Write to JSON df.writeJson("output.json") // To JSON string with formatting val jsonStr = df.toJson(prettyPrint = true) ``` ## Reading Excel Files Import Excel workbooks with support for multiple sheets, column type specification, and automatic type inference from cell formats. ```kotlin // Basic Excel reading val df = DataFrame.readExcel("file.xlsx") // With string column specification val df = DataFrame.readExcel("mixed_column.xlsx", stringColumns = StringColumns("A")) // Write to Excel df.writeExcel("output.xlsx") // Multiple sheets df.writeExcel(file, sheetName = "allPersons") df.filter { isHappy }.writeExcel(file, sheetName = "happy", keepFile = true) ``` ## Reading from SQL Databases Execute SQL queries or read entire tables from JDBC-compatible databases with automatic type mapping, nullability inference, and connection pooling support. ```kotlin // Read from SQL table val df = DataFrame.readSqlTable( dbConfig = DbConnectionConfig( url = "jdbc:postgresql://localhost:5432/mydb", user = "user", password = "password" ), tableName = "users", limit = 1000, inferNullability = true ) // Read with custom query val df = DataFrame.readSqlQuery( dbConfig = dbConfig, sqlQuery = "SELECT * FROM users WHERE age > 18" ) ``` ## Reading Apache Arrow Files Import data from Apache Arrow IPC and Feather formats with efficient columnar storage, zero-copy reads, and schema preservation for interoperability with other data systems. ```kotlin // Read Arrow Feather val df = DataFrame.readArrowFeather("data.arrow") // Write Arrow IPC df.writeArrowIPC("output.arrow") // Write Arrow Feather df.writeArrowFeather("output.feather") // To byte arrays val ipcBytes = df.saveArrowIPCToByteArray() val featherBytes = df.saveArrowFeatherToByteArray() // With custom schema val schema = Schema.fromJSON(schemaJson) df.arrowWriter( targetSchema = schema, mode = ArrowWriter.Mode(restrictWidening = true, strictType = true) ).use { writer -> writer.writeArrowFeather(file) } ``` ## Filtering Rows Select rows matching boolean predicates with type-safe column access, supporting complex conditions with logical operators and null-safe navigation. ```kotlin // Filter rows based on condition df.filter { age > 18 && name.firstName.startsWith("A") } // Filter with null safety df.filter { city != null && city.length > 5 } // Filter DataColumn column.filter { it > 10 } // Filter ColumnSet df.select { cols().filter { it.hasNulls() } } ``` ## Selecting Columns Extract specific columns using DSL syntax, column names, type filters, or predicates, supporting hierarchical navigation through nested column groups. ```kotlin // Select columns using DSL df.select { name and age and city } // Select by column names df.select("name", "age", "city") // Select from column group df.select { myColGroup.select { colA and colB } } // Select by type df.select { colsOf() and cols { it.name().startsWith("total") } } // Remove columns df.remove { age and city } ``` ## GroupBy Operations Partition rows into groups based on key columns and perform aggregations within each group, supporting nested grouping and custom aggregation functions. ```kotlin // Basic grouping df.groupBy { city } // Multiple columns df.groupBy { city and country } // Group by column names df.groupBy("city", "country") // Group and aggregate df.groupBy { city } .aggregate { count() into "count" age.mean() into "avgAge" income.sum() into "totalIncome" } // Group with filtering df.groupBy { category } .filter { count() > 10 } .aggregate { values.mean() into "average" } ``` ## Join Operations Combine two DataFrames based on matching columns using various join types (inner, left, right, full, filter, exclude) with support for multiple key columns and custom match conditions. ```kotlin // Inner join df1.innerJoin(df2) { name } // Left join df1.leftJoin(df2) { id } // Join with explicit column matching df1.join(df2, JoinType.Inner) { leftColumn match rightColumn } // Multiple column matching df1.join(df2) { id match id and name match fullName } // Other join types df1.rightJoin(df2) { key } df1.fullJoin(df2) { key } df1.filterJoin(df2) { key } // Keep only matched rows from left df1.excludeJoin(df2) { key } // Keep only unmatched rows from left ``` ## Converting Column Types Transform column values with type conversion functions, supporting nullable/non-nullable conversions, custom transformations, and per-row operations with index access. ```kotlin // Convert with custom lambda df.convert { topics }.with { val inner = it.removeSurrounding("[", "]") if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim) } // Convert to specific type df.convert { age }.to() // Type-specific conversions df.convert { price }.toDouble() df.convert { date }.toLocalDateTime() df.convert { flag }.toBoolean() // Convert with row context df.convert { score }.perRowCol { index, col -> col[index] * 2 } // Convert not null only df.convert { nullable }.notNull { it.toString().uppercase() } ``` ## Converting to Data Schema Transform untyped DataFrames into strongly-typed DataFrames using data classes annotated with @DataSchema, enabling type-safe column access and IDE autocompletion. ```kotlin // Define data schema @DataSchema data class Repositories( @ColumnName("full_name") val fullName: String, @ColumnName("html_url") val htmlUrl: URL, @ColumnName("stargazers_count") val stargazersCount: Int, val topics: String ) // Convert untyped DataFrame to typed val repos = DataFrame .readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv") .convertTo() // Access with type-safe properties val filtered = repos .filter { stargazersCount > 50 } .convert { topics }.with { val inner = it.removeSurrounding("[", "]") if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim) } ``` ## Aggregating Data Compute summary statistics across rows or within groups using built-in aggregation functions (sum, mean, min, max, std, median, count, countDistinct). ```kotlin // Count operations df.count() // Total rows df.count { age > 18 } // Conditional count // Aggregation after groupBy df.groupBy { city } .aggregate { count() into "population" age.mean() into "avgAge" age.min() into "minAge" age.max() into "maxAge" income.sum() into "totalIncome" income.std() into "stdIncome" } // Column aggregations column.sum() column.mean() column.median() column.min() column.max() column.std() column.countDistinct() ``` ## Adding Columns Create new columns with expressions computed from existing columns, supporting row context with access to previous rows and complex multi-column calculations. ```kotlin // Add new column with expression df.add("fullName") { firstName + " " + lastName } // Add with previous row access df.add("cumulative") { prev()?.cumulative?.plus(value) ?: value } // Add multiple columns df.add("topicCount") { topics.size } .add("kind") { getKind(fullName, topics) } // Add with row index df.add("rowNum") { index() } ``` ## Updating Columns Modify existing column values in place with conditional updates, index-based updates, and null-aware transformations while preserving column types. ```kotlin // Update column values df.update { age }.with { it + 1 } // Update with condition df.update { age }.where { age < 18 }.with { 18 } // Update at specific indices df.update { score }.at(0, 2, 4).with { it * 2 } // Update not null values df.update { nullable }.notNull { it.toString().uppercase() } // Update with row context df.update { price }.with { price * discountRate } ``` ## Pivoting Data Reshape DataFrames by converting unique values from one column into multiple columns, with support for aggregation, one-hot encoding, and hierarchical pivots. ```kotlin // Pivot with aggregation df.pivot { category } .groupBy { region } .aggregate { count() into "count" sales.sum() into "totalSales" } // Pivot matches (one-hot encoding) df.pivotMatches { pclass and sex } // Pivot counts df.pivotCounts { category } ``` ## Sorting DataFrames Order rows based on one or more columns in ascending or descending order, supporting null positioning and custom comparators. ```kotlin // Sort by single column df.sortBy { age } // Sort descending df.sortByDesc { score } // Sort by multiple columns df.sortBy { city and age } // Mixed ascending/descending df.sortBy { city }.sortByDesc { age } ``` ## Handling Null Values Fill, remove, or replace null values using constants, column statistics, or custom expressions, with per-column strategies and null safety guarantees. ```kotlin // Fill nulls with constant df.fillNulls { age }.with { 0 } // Fill with column mean df.fillNulls { numericColumns }.perCol { it.mean() } // Fill multiple columns df.fillNulls { sibsp and parch and age and fare }.perCol { it.mean() } .fillNulls { sex }.with { "female" } // Remove rows with nulls df.dropNulls() df.dropNulls { age and city } ``` ## Complete ETL Pipeline Example End-to-end data processing pipeline demonstrating reading, filtering, type conversion, grouping, aggregation, sorting, and writing in a single fluent chain. ```kotlin @DataSchema interface Movie { val movieId: String val title: String val genres: String } // Read, clean, transform, and analyze movie data val result = DataFrame .readCsv("movies.csv") .convertTo() .split { genres }.by("|").inplace() .split { title }.by { listOf( """\s*\(\d{4}\)\s*$""".toRegex().replace(it, ""), "\\d{4}".toRegex().findAll(it).lastOrNull()?.value?.toIntOrNull() ?: -1 ) }.into("title", "year") .explode("genres") .filter { "year"() >= 0 && genres != "(no genres listed)" } .groupBy("year") .sortBy("year") .pivot("genres", inward = false) .aggregate { count() into "count" mean() into "mean" } .writeJson("movie_analysis.json") ``` ## Machine Learning Feature Engineering Prepare data for ML models with imputation, one-hot encoding, feature extraction, normalization, and train-test splitting. ```kotlin val (train, test) = df // Imputing missing values .fillNulls { sibsp and parch and age and fare }.perCol { it.mean() } .fillNulls { sex }.with { "female" } // One hot encoding .pivotMatches { pclass and sex } // Feature extraction .select { survived and pclass and sibsp and parch and age and fare and sex } // Normalization (custom) .convert { numericColumns }.with { (it - mean) / std } // Split .shuffle() .toTrainTest(0.7) { survived } ``` Kotlin DataFrame serves as a comprehensive solution for data analysis, ETL pipelines, and data science workflows in Kotlin projects. Its primary use cases include exploratory data analysis in Kotlin Notebooks with rich HTML rendering, data preprocessing and feature engineering for machine learning models, ETL pipeline development with type-safe transformations, and report generation from various data sources including SQL databases, CSV files, and APIs. The library's immutable, functional design ensures thread-safe operations suitable for concurrent data processing. Integration patterns vary by environment: In Kotlin Notebook and Jupyter environments, use `%use dataframe` magic command for automatic dependency resolution and schema inference; in Gradle projects, add dependencies for required I/O modules (csv, json, excel, jdbc, arrow) and optionally enable the compiler plugin for compile-time type checking and IDE integration. The compiler plugin (requires Kotlin 2.2.20+ and IntelliJ IDEA 2025.2+) provides static interpretation of DataFrame operations, automatic schema updates after each transformation, and full IDE support with autocompletion and refactoring. For production use, the library supports conversion to/from Kotlin collections and data classes, enabling seamless integration with existing codebases and serialization frameworks.