### Example of pandas groupby head Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example demonstrates how to get the first N rows for each group using `groupby().head()`. ```python df.groupby('grp').head(2) ``` -------------------------------- ### Install Query.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Use this command to install the Query.jl package using the Julia package manager. ```julia using Pkg Pkg.add("Query") ``` -------------------------------- ### Example Doctest Source: https://github.com/juliadata/dataframes.jl/blob/main/CONTRIBUTING.md Doctests are examples written within docstrings that can be used as test cases. They need to match an interactive REPL, including the `julia>` prompt. Add the header `# Examples` above doctests. ```jldoctest julia> uppercase("Docstring test") "DOCSTRING TEST" ``` -------------------------------- ### Setup a DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Initialize a DataFrame with sample integer data for demonstration purposes. This setup is required before performing manipulation or indexing operations. ```julia df = DataFrame(x = 1:3, y = 4:6, z = 7:9) # define data frame ``` -------------------------------- ### Install DataFramesMeta.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Use Pkg.add to install the DataFramesMeta.jl package. This is the first step before using its features. ```julia using Pkg Pkg.add("DataFramesMeta") ``` -------------------------------- ### Get Column Vector by Copying Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Examples of how to get a column as a vector with a copy of the data. ```julia german[:, :Age] ``` ```julia german[:, "Age"] ``` ```julia german[:, 2] ``` -------------------------------- ### Install TidierData.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Use Pkg.add to install the TidierData.jl package. ```julia using Pkg Pkg.add("TidierData") ``` -------------------------------- ### Install CSV.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/importing_and_exporting.md Install the CSV.jl package using the Pkg manager if it's not already installed. ```julia using Pkg Pkg.add("CSV") ``` -------------------------------- ### Install CSV.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Use Pkg.add to install the CSV.jl package. This is a prerequisite for reading CSV files. ```julia using Pkg Pkg.add("CSV") ``` -------------------------------- ### Install DataFrameMacros.jl Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Use Pkg.add to install the DataFrameMacros.jl package. ```julia using Pkg Pkg.add("DataFrameMacros") ``` -------------------------------- ### Create a basic string vector Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/categorical.md This example shows a naive string vector representation. Use CategoricalArrays for more efficient storage. ```jldoctest julia> v = ["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"] 6-element Vector{String}: "Group A" "Group A" "Group A" "Group B" "Group B" "Group B" ``` -------------------------------- ### Get DataFrame with Copied Columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Examples of how to get a DataFrame with copied columns using various selectors. ```julia german[:, 1:2] ``` ```julia german[:, [:id, :Age]] ``` ```julia german[:, ["id", "Age"]] ``` -------------------------------- ### Example of pandas aggregation returning list Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example shows how to use `.agg()` to return a list of values (min and max) for a column. ```python df[['x']].agg(lambda x: [min(x), max(x)]) ``` -------------------------------- ### Example of pandas aggregate multiple columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example shows how to use `.agg()` to apply different functions to different columns. ```python df.agg({'x': max, 'y': min}) ``` -------------------------------- ### Example of pandas join with grouped aggregation and column selection Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example demonstrates joining aggregated data and then selecting specific columns from the result. ```python df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')[['grp', 'x_mean']] ``` -------------------------------- ### Example of pandas groupby aggregation with rename Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example shows how to group by a column, calculate the mean, and rename the resulting series. ```python df.groupby('grp')['x'].mean().rename("my_mean") ``` -------------------------------- ### Example of pandas groupby aggregation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This is a pandas example demonstrating how to group by a column and calculate the mean of another column. ```python df.groupby('grp')['x'].mean() ``` -------------------------------- ### Get DataFrame with Reused Columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Examples of how to get a DataFrame with reused columns (no copy) using various selectors. ```julia german[!, 1:2] ``` ```julia german[!, [:id, :Age]] ``` ```julia german[!, ["id", "Age"]] ``` -------------------------------- ### Install DataFrames.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/getting_started.md Use this command to add the DataFrames package to your Julia environment. ```julia using Pkg Pkg.add("DataFrames") ``` -------------------------------- ### Example of pandas join with grouped aggregation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example illustrates joining the original DataFrame with the result of a grouped aggregation. ```python df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean') ``` -------------------------------- ### Combine with Multiple Operations Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Shows how to use `combine` to apply multiple functions to grouped data. This example calculates the correlation between SepalLength and SepalWidth, and the number of rows in each group. ```julia combine(iris_gdf, 1:2 => cor, nrow) ``` -------------------------------- ### Get Type of Basic Operation Pairs Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Demonstrates the types of basic operation pairs created using symbols, strings, or integers. ```julia julia> typeof(:x => :a) Pair{Symbol, Symbol} ``` ```julia julia> typeof("x" => "a") Pair{String, String} ``` ```julia julia> typeof(1 => "a") Pair{Int64, String} ``` -------------------------------- ### Example of pandas row-wise argmax with apply Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example uses `.apply()` with `axis=1` to find the column name corresponding to the maximum value in each row. ```python df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1)) ``` -------------------------------- ### Add DataFrames.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Use this command to install the DataFrames.jl package using Julia's Pkg manager. ```julia using Pkg Pkg.add("DataFrames") ``` -------------------------------- ### Combine with Do Block for Grouped Statistics Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Demonstrates the `do` block form of the `combine` function for applying operations to grouped data. This example calculates the mean and variance of PetalLength for each species. ```julia combine(iris_gdf) do df (m = mean(df.PetalLength), s² = var(df.PetalLength)) end ``` -------------------------------- ### Construct DataFrame Column by Column Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/getting_started.md Start with an empty DataFrame and add columns one by one. Use `df.ColumnName = data` or `df[:, :ColumnName] = data` to add or replace columns. ```jldoctest julia> df = DataFrame() 0×0 DataFrame julia> df.A = 1:8 1:8 julia> df[:, :B] = ["M", "F", "F", "M", "F", "M", "M", "F"] 8-element Vector{String}: "M" "F" "F" "M" "F" "M" "M" "F" julia> df[!, :C] .= 0 8-element Vector{Int64}: 0 0 0 0 0 0 0 0 julia> df 8×3 DataFrame Row │ A B C │ Int64 String Int64 ─────┼────────────────────── 1 │ 1 M 0 2 │ 2 F 0 3 │ 3 F 0 4 │ 4 M 0 5 │ 5 F 0 6 │ 6 M 0 7 │ 7 M 0 8 │ 8 F 0 ``` ```jldoctest julia> size(df, 1) 8 ``` ```jldoctest julia> size(df, 2) 3 ``` ```jldoctest julia> size(df) (8, 3) ``` -------------------------------- ### Iterating Over Query Results Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md This example shows how to loop through the iterator returned by a Query.jl query using a standard Julia for loop to process the results. ```jldoctest julia> total_children = 0 0 julia> for i in q2 global total_children += i.number_of_children end julia> total_children 4 ``` -------------------------------- ### Example of pandas groupby output Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This is the output of a pandas `groupby().mean()` operation, showing a Series with the group keys and the aggregated values. ```python grp 1 4 2 3 Name: x, dtype: int64 ``` -------------------------------- ### Manage Column Metadata Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/lib/metadata.md Illustrates adding and retrieving metadata for specific columns. Use `colmetadata!` to add, `colmetadatakeys` to list keys, and `colmetadata` to get values. `emptycolmetadata!` removes all column metadata. ```jldoctest julia> colmetadatakeys(df) () ``` ```jldoctest julia> colmetadata!(df, :name, "label", "First and last name of a player", style=:note); ``` ```jldoctest julia> colmetadata!(df, :date, "label", "Rating date in yyyy-u format", style=:note); ``` ```jldoctest julia> colmetadata!(df, :rating, "label", "ELO rating in classical time control", style=:note); ``` ```jldoctest julia> "label" in colmetadatakeys(df, :rating) true ``` ```jldoctest julia> colmetadata(df, :rating, "label") "ELO rating in classical time control" ``` ```jldoctest julia> colmetadata(df, :rating, "label", style=true) ("ELO rating in classical time control", :note) ``` ```jldoctest julia> collect(colmetadatakeys(df)) 3-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}: :date => ["label"] :rating => ["label"] :name => ["label"] ``` ```jldoctest julia> [only(names(df, col)) => [key => colmetadata(df, col, key) for key in metakeys] for (col, metakeys) in colmetadatakeys(df)] 3-element Vector{Pair{String, Vector{Pair{String, String}}}}: "date" => ["label" => "Rating date in yyyy-u format"] "rating" => ["label" => "ELO rating in classical time control"] "name" => ["label" => "First and last name of a player"] ``` ```jldoctest julia> emptycolmetadata!(df); julia> colmetadatakeys(df) () ``` -------------------------------- ### Subset rows and select columns with DataFramesMeta.jl Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md This example shows how to subset rows based on a condition and select specific columns, renaming one during the process. ```julia using DataFramesMeta df = DataFrame(name=["John", "Sally", "Roger"], age=[54.0, 34.0, 79.0], children=[0, 2, 4]) @chain df begin @rsubset :age > 40 @select(:number_of_children = :children, :name) end ``` -------------------------------- ### Create and Display a DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/working_with_dataframes.md Demonstrates creating a DataFrame and its default printing behavior, which shows a sample of rows and columns. Requires the DataFrames package. ```julia using DataFrames df = DataFrame(A=1:2:1000, B=repeat(1:10, inner=50), C=1:500) ``` -------------------------------- ### Example of DataFrames.jl groupby output Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This is the output of a DataFrames.jl `combine(groupby(df, :grp), :x => mean)` operation, showing a DataFrame with the group keys and the aggregated mean values. ```julia 2×2 DataFrame Row │ grp x_mean │ Int64 Float64 ─────┼──────────────── 1 │ 1 4.0 2 │ 2 3.0 ``` -------------------------------- ### Manage DataFrame Metadata Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/lib/metadata.md Demonstrates adding, checking for, and retrieving DataFrame-level metadata. Use `metadata!` to add, `metadatakeys` to list keys, and `metadata` to get values. `emptymetadata!` removes all metadata. ```jldoctest julia> metadatakeys(df) () ``` ```jldoctest julia> metadata!(df, "caption", "ELO ratings of chess players", style=:note); julia> collect(metadatakeys(df)) 1-element Vector{String}: "caption" ``` ```jldoctest julia> "caption" in metadatakeys(df) true ``` ```jldoctest julia> metadata(df, "caption") "ELO ratings of chess players" ``` ```jldoctest julia> metadata(df, "caption", style=true) ("ELO ratings of chess players", :note) ``` ```jldoctest julia> emptymetadata!(df); julia> metadatakeys(df) () ``` -------------------------------- ### Stack DataFrame for aggregation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/reshaping_and_pivoting.md Use `stack` to prepare data for aggregation. This example stacks the `iris` DataFrame, excluding the `:Species` column, to facilitate split-apply-combine operations. ```jldoctest julia> using Statistics julia> d = stack(iris, Not(:Species)) 750×3 DataFrame Row │ Species variable value │ String15 String Float64 ─────┼────────────────────────────────────── 1 │ Iris-setosa SepalLength 5.1 2 │ Iris-setosa SepalLength 4.9 3 │ Iris-setosa SepalLength 4.7 4 │ Iris-setosa SepalLength 4.6 5 │ Iris-setosa SepalLength 5.0 6 │ Iris-setosa SepalLength 5.4 7 │ Iris-setosa SepalLength 4.6 8 │ Iris-setosa SepalLength 5.0 ``` -------------------------------- ### Get Column Vector Without Copying Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Examples of how to get a column as a vector without copying the data, which is more memory-efficient. ```julia german.Age ``` ```julia german."Age" ``` ```julia german[!, :Age] ``` ```julia german[!, "Age"] ``` ```julia german[!, 2] ``` -------------------------------- ### Combined Query with Filtering and Selection Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md This example shows a Query.jl query that combines filtering and selection, and collects the results into a Vector. It demonstrates selecting a single value per row. ```jldoctest julia> q3 = @from i in df begin @where i.age > 40 && i.children > 0 @select i.name @collect end 1-element Vector{String}: "Roger" ``` -------------------------------- ### Create and Combine DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Demonstrates creating a DataFrame and then using the `combine` function to aggregate a column. ```julia julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) 3×2 DataFrame Row │ a b │ Int64 Int64 ─────┼────────────── 1 │ 1 4 2 │ 2 5 3 │ 3 4 julia> combine(df, :a => sum) 1×1 DataFrame Row │ a_sum │ Int64 ─────┼─────── 1 │ 6 ``` -------------------------------- ### Get DataFrame Dimensions with size() Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Use the `size` function to get the dimensions (rows, columns) of a DataFrame. You can also specify a dimension to get only the number of rows or columns. ```jldoctest julia> german = copy(german_ref); julia> size(german) (1000, 10) julia> size(german, 1) 1000 julia> size(german, 2) 10 ``` -------------------------------- ### Display All Rows and Columns of a DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/working_with_dataframes.md Shows how to use the `show` function with `allrows=true` and `allcols=true` to display all rows and columns of a DataFrame, respectively. This is useful when the default sample is insufficient. ```julia show(df, allrows=true) show(df, allcols=true) ``` -------------------------------- ### Create DataFrame and GroupedDataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Initializes a sample DataFrame and groups it by 'customer_id' for subsequent operations. ```julia julia> df = DataFrame(customer_id=["a", "b", "b", "b", "c", "c"], transaction_id=[12, 15, 19, 17, 13, 11], volume=[2, 3, 1, 4, 5, 9]) 6×3 DataFrame Row │ customer_id transaction_id volume │ String Int64 Int64 ─────┼───────────────────────────────────── 1 │ a 12 2 2 │ b 15 3 3 │ b 19 1 4 │ b 17 4 5 │ c 13 5 6 │ c 11 9 julia> gdf = groupby(df, :customer_id, sort=true) julia> show(gdf, allgroups=true) GroupedDataFrame with 3 groups based on key: customer_id Group 1 (1 row): customer_id = "a" Row │ customer_id transaction_id volume │ String Int64 Int64 ─────┼───────────────────────────────────── 1 │ a 12 2 Group 2 (3 rows): customer_id = "b" Row │ customer_id transaction_id volume │ String Int64 Int64 ─────┼───────────────────────────────────── 1 │ b 15 3 2 │ b 19 1 3 │ b 17 4 Group 3 (2 rows): customer_id = "c" Row │ customer_id transaction_id volume │ String Int64 Int64 ─────┼───────────────────────────────────── 1 │ c 13 5 2 │ c 11 9 ``` -------------------------------- ### Create DataFrame with Temperature Data in Julia Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Initializes a DataFrame with time and temperature readings from three different locations. ```julia julia> df = DataFrame(Time = 1:4, Temperature1 = [20, 23, 25, 28], Temperature2 = [33, 37, 41, 44], Temperature3 = [15, 10, 4, 0]) 4×4 DataFrame Row │ Time Temperature1 Temperature2 Temperature3 │ Int64 Int64 Int64 Int64 ─────┼───────────────────────────────────────────────── 1 │ 1 20 33 15 2 │ 2 23 37 10 3 │ 3 25 41 4 4 │ 4 28 44 0 ``` -------------------------------- ### Example of pandas mean aggregation on multiple columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example calculates the mean for multiple specified columns. ```python df[['x', 'y']].mean() ``` -------------------------------- ### Create and Initialize DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/lib/metadata.md Initializes a DataFrame with sample data for names, dates, and ratings. This serves as the base for metadata operations. ```jldoctest julia> using DataFrames julia> df = DataFrame(name=["Jan Krzysztof Duda", "Jan Krzysztof Duda", "Radosław Wojtaszek", "Radosław Wojtaszek"], date=["2022-Jun", "2021-Jun", "2022-Jun", "2021-Jun"], rating=[2750, 2729, 2708, 2687]) 4×3 DataFrame Row │ name date rating │ String String Int64 ─────┼────────────────────────────────────── 1 │ Jan Krzysztof Duda 2022-Jun 2750 2 │ Jan Krzysztof Duda 2021-Jun 2729 3 │ Radosław Wojtaszek 2022-Jun 2708 4 │ Radosław Wojtaszek 2021-Jun 2687 ``` -------------------------------- ### Get Group Indices as a Vector Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md The `groupindices` function can be called directly on a `GroupedDataFrame` to get a vector of group indices for each row. ```jldoctest julia> groupindices(gdf) 6-element Vector{Union{Missing, Int64}}: 1 2 2 2 3 3 ``` -------------------------------- ### Compare DataFrame column selection with Vector indexing Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/working_with_dataframes.md Demonstrates the difference between selecting a single column using `select` (returns DataFrame) and using standard indexing `[:, :column_name]` (returns Vector). ```julia julia> df[:, :x1] 2-element Vector{Int64}: 1 2 ``` -------------------------------- ### Initialize DataFrame with Columns and Scalar Broadcasting Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Create a DataFrame by providing column names and their corresponding data. Scalars are automatically broadcasted to fill all rows. ```jldoctest julia> DataFrame(A=1:3, B=5:7, fixed=1) 3×3 DataFrame Row │ A B fixed │ Int64 Int64 Int64 ─────┼───────────────────── 1 │ 1 5 1 2 │ 2 6 1 3 │ 3 7 1 ``` -------------------------------- ### Create Sample DataFrames in Python (pandas) Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md Creates two sample DataFrames in Python using the pandas and numpy libraries. Note that pandas supports multi-index, so the example data frame is set up with 'a' to 'f' as row indices rather than a separate 'id' column. ```python import pandas as pd import numpy as np df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2], 'x': range(6, 0, -1), 'y': range(4, 10), 'z': [3, 4, 5, 6, 7, None]}, index = list('abcdef')) df2 = pd.DataFrame({'grp': [1, 3], 'w': [10, 11]}) ``` -------------------------------- ### Get Total Number of Rows (Regular Function) Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Demonstrates the use of `nrow` as a regular function to get the total number of rows in a DataFrame. ```julia julia> nrow(df) 6 ``` -------------------------------- ### Performance Comparison: Indexing vs. View Creation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Benchmarks demonstrate that creating a view is significantly faster and allocates less memory than direct indexing for large DataFrame subsets. However, views share memory with the parent DataFrame. ```julia julia> using BenchmarkTools julia> @btime $german[1:end-1, 1:end-1]; 9.900 μs (44 allocations: 57.56 KiB) ``` ```julia julia> @btime @view $german[1:end-1, 1:end-1]; 67.332 ns (2 allocations: 32 bytes) ``` -------------------------------- ### Copy Columns vs. Reuse Columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Demonstrates the difference between copying columns (`:`) and reusing columns (`!`). `!` avoids copying, saving memory and improving performance, but can lead to bugs. ```julia julia> german[:, [:Sex]] 1000×1 DataFrame Row │ Sex │ String7 ──────┼───────── 1 │ male 2 │ female 3 │ male 4 │ male 5 │ male 6 │ male 7 │ male 8 │ male ⋮ │ ⋮ 994 │ male 995 │ male 996 │ female 997 │ male 998 │ male 999 │ male 1000 │ male 985 rows omitted ``` ```julia julia> german[!, [:Sex]] 1000×1 DataFrame Row │ Sex │ String7 ──────┼───────── 1 │ male 2 │ female 3 │ male 4 │ male 5 │ male 6 │ male 7 │ male 8 │ male ⋮ │ ⋮ 994 │ male 995 │ male 996 │ female 997 │ male 998 │ male 999 │ male 1000 │ male 985 rows omitted ``` -------------------------------- ### Get Number of Rows/Columns with nrow() and ncol() Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md The `nrow` and `ncol` functions provide a direct way to get the number of rows and columns in a DataFrame, respectively. ```jldoctest julia> nrow(german) 1000 julia> ncol(german) 10 ``` -------------------------------- ### Combine with Custom Output Column Names Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Illustrates using `combine` to apply a function that returns multiple values, aliasing them to new column names. This example finds the minimum and maximum PetalLength for each species. ```julia combine(iris_gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max]) ``` -------------------------------- ### Example of pandas assign with correlation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example uses `.assign()` to add a new column calculated using the correlation between two existing columns. ```python df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1]) ``` -------------------------------- ### Example of pandas complex function aggregation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example uses `.agg()` with a lambda function to apply a complex operation (mean of cosine) to a column. ```python df[['z']].agg(lambda v: np.mean(np.cos(v))) ``` -------------------------------- ### Create DataFrame from a Dictionary with Symbol Keys Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Initialize a DataFrame from a dictionary where keys are Symbols representing column names. Using Symbols is generally faster than strings. ```jldoctest julia> dict = Dict(:customer_age => [15, 20, 25], :first_name => ["Rohit", "Rahul", "Akshat"]) Dict{Symbol, Vector} with 2 entries: :customer_age => [15, 20, 25] :first_name => ["Rohit", "Rahul", "Akshat"] julia> DataFrame(dict) 3×2 DataFrame Row │ customer_age first_name │ Int64 String ─────┼────────────────────────── 1 │ 15 Rohit 2 │ 20 Rahul 3 │ 25 Akshat ``` -------------------------------- ### Example of pandas mean aggregation on columns matching regex Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example calculates the mean for columns whose names match a given regular expression. ```python df.filter(regex=("^x")).mean() ``` -------------------------------- ### Select Rows and All Columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Selects the first 5 rows and all columns. The colon ':' indicates all items. ```julia julia> german[1:5, :] 5×10 DataFrame Row │ id Age Sex Job Housing Saving accounts Checking accoun ⋯ │ Int64 Int64 String7 Int64 String7 String15 String15 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 0 67 male 2 own NA little ⋯ 2 │ 1 22 female 2 own little moderate 3 │ 2 49 male 1 own little NA 4 │ 3 45 male 2 free little little 5 │ 4 53 male 2 free little little ⋯ 4 columns omitted ``` -------------------------------- ### Create and Filter DataFrame with TidierData.jl Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Demonstrates creating a DataFrame and then filtering and selecting columns using TidierData.jl's @chain, @filter, and @select macros. ```jldoctest tidierdata julia> using TidierData julia> df = DataFrame( name = ["John", "Sally", "Roger"], age = [54.0, 34.0, 79.0], children = [0, 2, 4] ) 3×3 DataFrame Row │ name age children │ String Float64 Int64 ─────┼─────────────────────────── 1 │ John 54.0 0 2 │ Sally 34.0 2 3 │ Roger 79.0 4 julia> @chain df begin @filter(children != 2) @select(name, num_children = children) end 2×2 DataFrame Row │ name num_children │ String Int64 ─────┼────────────────────── 1 │ John 0 2 │ Roger 4 ``` -------------------------------- ### Example of pandas row-wise operation with apply Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md This pandas example uses `.apply()` with `axis=1` to perform an operation (finding the minimum) row by row across specified columns. ```python df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1)) ``` -------------------------------- ### Initialize data.table objects in R Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/comparisons.md Initializes two data.table objects in R for demonstration purposes. ```R library(data.table) df <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9, z = c(3:7, NA), id = letters[1:6]) df2 <- data.table(grp=c(1,3), w = c(10,11)) ``` -------------------------------- ### Safe Group Retrieval with get Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/lib/indexing.md Use the `get` function to retrieve a group by its key (Tuple or NamedTuple), providing a default value if the key does not exist. This prevents errors for missing keys. ```julia get(gd, key::Union{Tuple, NamedTuple}, default) ``` -------------------------------- ### Basic Query with Filtering and Projection Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md This example demonstrates a basic Query.jl query that filters rows based on age and selects specific columns, renaming one, and collecting the results into a new DataFrame. ```jldoctest julia> using DataFrames, Query julia> df = DataFrame(name=["John", "Sally", "Roger"], age=[54.0, 34.0, 79.0], children=[0, 2, 4]) 3×3 DataFrame Row │ name age children │ String Float64 Int64 ─────┼─────────────────────────── 1 │ John 54.0 0 2 │ Sally 34.0 2 3 │ Roger 79.0 4 julia> q1 = @from i in df begin @where i.age > 40 @select {number_of_children=i.children, i.name} @collect DataFrame end 2×2 DataFrame Row │ number_of_children name │ Int64 String ─────┼──────────────────────────── 1 │ 0 John 2 │ 4 Roger ``` -------------------------------- ### Get Row Count of Grouped DataFrames Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Demonstrates how to get the number of rows for each group in a GroupedDataFrame using `map(nrow, ...)` and the broadcast operator `nrow.(...)`. These methods are suitable for iterating over groups. ```julia map(nrow, sdf_vec) 3-element Vector{Int64}: 50 50 50 ``` ```julia nrow.(sdf_vec) 3-element Vector{Int64}: 50 50 50 ``` -------------------------------- ### Get All Indices with `eachindex` Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md The `eachindex` function, when applied to a vector or array, returns a sequence of all indices. ```jldoctest julia> collect(eachindex(df.customer_id)) 6-element Vector{Int64}: 1 2 3 4 5 6 ``` -------------------------------- ### Get Single Cell from DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Use matrix-like indexing to retrieve a single cell's value from a DataFrame. ```jldoctest julia> german[4, 4] 2 ``` -------------------------------- ### Get Group Indices with `combine` Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Use `combine` with `groupindices` to add a column with the group number for each row. This operation is column-independent. ```jldoctest julia> combine(gdf, groupindices) 3×2 DataFrame Row │ customer_id groupindices │ String Int64 ─────┼─────────────────────────── 1 │ a 1 2 │ b 2 3 │ c 3 ``` -------------------------------- ### Nested Pair Creation and Access in Julia Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Demonstrates the creation of nested pairs in Julia and how to access their elements. Accessing an index beyond the pair's structure results in a BoundsError. ```julia julia> p = :x => :y => :z :x => (:y => :z) ``` ```julia julia> p[1] :x ``` ```julia julia> p[2] :y => :z ``` ```julia julia> p[2][1] :y ``` ```julia julia> p[2][2] :z ``` ```julia julia> p[3] # there is no index 3 for a pair ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] ``` -------------------------------- ### Retrieve levels from a CategoricalArray Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/categorical.md Uses the `levels` function to get the unique categories present in a CategoricalArray. The order of levels is maintained. ```jldoctest julia> levels(cv) 2-element CategoricalArray{String,1,UInt32}: "Group A" "Group B" ``` -------------------------------- ### Construct DataFrame with Keyword Arguments Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/getting_started.md Use keyword arguments to construct a DataFrame where each argument represents a column. This is a common and straightforward method. ```julia using DataFrames DataFrame(a=1:4, b=["M", "F", "F", "M"]) # keyword argument constructor ``` -------------------------------- ### Check DataFrames.jl Package Status Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Use the 'status' command in Pkg REPL mode to view the version of DataFrames.jl that is currently installed. ```julia ] (@v1.9) pkg> status DataFrames Status `~\v1.13\Project.toml` [a93c6f00] DataFrames v1.8.0 ``` -------------------------------- ### Split-Apply-Combine with DataFramesMeta.jl Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md Demonstrates the split-apply-combine pattern using @rsubset, @by, and @select for calculating ranges within groups. ```julia using DataFramesMeta df = DataFrame(key=repeat(1:3, 4), value=1:12) @chain df begin @rsubset :value > 3 @by(:key, :min = minimum(:value), :max = maximum(:value)) @select(:key, :range = :max - :min) end ``` -------------------------------- ### Create DataFrame from a Matrix with Auto-Generated Column Names Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Initialize a DataFrame from a matrix. Use `:auto` as the second argument to automatically generate column names like `x1`, `x2`, etc. ```jldoctest julia> DataFrame([1 0; 2 0], :auto) 2×2 DataFrame Row │ x1 x2 ``` -------------------------------- ### Get Element Types of DataFrame Columns Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Iterate over columns using `eachcol` and broadcast `eltype` to find the data type of each column. ```jldoctest julia> eltype.(eachcol(german)) 10-element Vector{DataType}: Int64 Int64 String7 Int64 String7 String15 String15 Int64 Int64 String31 ``` -------------------------------- ### View First Rows of a DataFrame Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/working_with_dataframes.md Illustrates using the `first` function to view a specified number of the initial rows of a DataFrame. This is helpful for quickly inspecting the beginning of the data. ```julia first(df, 6) ``` -------------------------------- ### Get Group Number Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Use `groupindices` to return the group number for each row. This can be helpful for tracking which group a row belongs to after transformations. ```julia combine(grouped_df, groupindices) ``` -------------------------------- ### Get DataFrame Column Names as Strings Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md The `names` function returns a vector of column names as `String`s. It can also filter by element type. ```jldoctest julia> names(german) 10-element Vector{String}: "id" "Age" "Sex" "Job" "Housing" "Saving accounts" "Checking account" "Credit amount" "Duration" "Purpose" ``` ```jldoctest julia> names(german, AbstractString) 5-element Vector{String}: "Sex" "Housing" "Saving accounts" "Checking account" "Purpose" ``` -------------------------------- ### DataFrame Construction Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/lib/functions.md Functions for creating and initializing DataFrames. ```APIDOC ## Constructing data frames ### `allcombinations` Creates all combinations of elements from input iterables. ### `copy` Creates a copy of a DataFrame. ### `similar` Creates a new DataFrame with the same structure but uninitialized data. ``` -------------------------------- ### Extracting Data with Comprehension Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/querying_frameworks.md This example uses a Julia comprehension to extract specific data from the iterator returned by a Query.jl query, applying a condition. ```jldoctest julia> y = [i.name for i in q2 if i.number_of_children > 0] 1-element Vector{String}: "Roger" ``` -------------------------------- ### Test DataFrames.jl Package Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Run this command to execute the tests for DataFrames.jl. Be aware that this process can take over 30 minutes to complete. ```julia using Pkg Pkg.test("DataFrames") # Warning! This will take more than 30 minutes. ``` -------------------------------- ### Custom Number Formatting for DataFrames Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/customizing_output.md Define a custom function to format numbers in a DataFrame. This example formats negative numbers by enclosing them in parentheses. ```julia function parentheses_fmt(v, i, j) !(v isa Number) && return v v < 0 && return "($(-v))" return v end ``` ```julia show(df; formatters = [parentheses_fmt]) ``` -------------------------------- ### Customize DataFrame Display with show() Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/basics.md Manually call the `show` function to control how a DataFrame is displayed. Use `allrows=true` to show all rows and `allcols=true` to show all columns, regardless of screen size. ```jldoctest julia> show(german, allcols=true) 1000×10 DataFrame Row │ id Age Sex Job Housing Saving accounts Checking account Credit amount Duration Purpose │ Int64 Int64 String7 Int64 String7 String15 String15 Int64 Int64 String31 ──────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 0 67 male 2 own NA little 1169 6 radio/TV 2 │ 1 22 female 2 own little moderate 5951 48 radio/TV 3 │ 2 49 male 1 own little NA 2096 12 education 4 │ 3 45 male 2 free little little 7882 42 furniture/equipment 5 │ 4 53 male 2 free little little 4870 24 car 6 │ 5 35 male 1 free NA NA 9055 36 education 7 │ 6 53 male 2 own quite rich NA 2835 24 furniture/equipment 8 │ 7 35 male 3 rent little moderate 6948 36 car ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 994 │ 993 30 male 3 own little little 3959 36 furniture/equipment 995 │ 994 50 male 2 own NA NA 2390 12 car 996 │ 995 31 female 1 own little NA 1736 12 furniture/equipment 997 │ 996 40 male 3 own little little 3857 30 car ``` -------------------------------- ### Get Row Indices within Groups with `combine` Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Use `combine` with `eachindex` to add a column with the index of each row within its group. This operation is column-independent. ```jldoctest julia> combine(gdf, eachindex) 6×2 DataFrame Row │ customer_id eachindex │ String Int64 ─────┼──────────────────────── 1 │ a 1 2 │ b 1 3 │ b 2 4 │ b 3 5 │ c 1 6 │ c 2 ``` -------------------------------- ### Select with Grouped Correlation Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/split_apply_combine.md Shows how `select` can be used with grouped data to apply a function that returns a single value per group, which is then broadcast to all rows in that group. This example calculates the correlation between SepalLength and SepalWidth. ```julia select(iris_gdf, 1:2 => cor) ``` -------------------------------- ### Combine DataFrame Columns with Sum Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/working_with_dataframes.md Use `combine` with `All() .=> sum` to apply the sum function to all columns and get a single row DataFrame with the sums. ```julia df = DataFrame(A=1:4, B=4.0:-1.0:1.0) julia> df 4×2 DataFrame Row │ A B │ Int64 Float64 ─────┼──────────────── 1 │ 1 4.0 2 │ 2 3.0 3 │ 3 2.0 4 │ 4 1.0 julia> combine(df, All() .=> sum) 1×2 DataFrame Row │ A_sum B_sum │ Int64 Float64 ─────┼──────────────── 1 │ 10 10.0 ``` -------------------------------- ### Create DataFrames for Joins Source: https://github.com/juliadata/dataframes.jl/blob/main/docs/src/man/joins.md Initializes two DataFrames, 'people' and 'jobs', with common 'ID' columns to be used in join operations. ```julia using DataFrames people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"]) jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"]) ```