### Installing from GitHub using devtools Source: https://github.com/rdatatable/data.table/wiki/Installation Example of using devtools::install_github to install the data.table package directly from GitHub. ```R library(devtools) install_github("Rdatatable/data.table", build_vignettes=FALSE) ``` -------------------------------- ### R Installation and Setup Script Source: https://github.com/rdatatable/data.table/wiki/Amazon-EC2-for-beginners A startup script for installing R and necessary packages on an Ubuntu EC2 instance. ```bash sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 sudo add-apt-repository 'deb http://cran.stat.ucla.edu/bin/linux/ubuntu trusty/' sudo apt-get update sudo apt-get -y install r-base-core sudo apt-get -y install libcurl4-openssl-dev # for RCurl which devtools depends on sudo apt-get -y install htop # to monitor RAM and CPU R options(repos = "http://cran.stat.ucla.edu") install.packages("devtools") require(devtools) ``` -------------------------------- ### Makevars file for OpenMP on Linux Source: https://github.com/rdatatable/data.table/wiki/Installation Example content for a Makevars file to specify GCC version for OpenMP support on Linux. ```text CC = /usr/bin/gcc64 CXX = /usr/bin/g++ SHLIB_OPENMP_CFLAGS = -fopenmp ``` -------------------------------- ### GCC installation via Homebrew Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install GCC using Homebrew. ```bash brew install gcc ``` -------------------------------- ### Install LLVM/clang with brew Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install LLVM (which includes clang) using Homebrew. ```bash brew update && brew install llvm ``` -------------------------------- ### Install XQuartz Source: https://github.com/rdatatable/data.table/wiki/Installation Homebrew command to install XQuartz to resolve library not loaded errors. ```bash brew install xquartz --cask ``` -------------------------------- ### Install Macports clang Source: https://github.com/rdatatable/data.table/wiki/Installation Commands to install and select clang-10 from Macports. ```bash sudo port install clang-10 pkgconfig sudo port select clang mp-clang-10 ``` -------------------------------- ### Testing Installation on Windows Source: https://github.com/rdatatable/data.table/wiki/Installation R commands to test the data.table installation on Windows. ```R require(data.table) test.data.table() ``` -------------------------------- ### Install macOS SDK headers Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install SDK headers for older macOS versions. ```bash sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target / ``` -------------------------------- ### Install Command Line Tools Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install Xcode command line tools on macOS. ```bash xcode-select --install ``` -------------------------------- ### Install libomp with brew Source: https://github.com/rdatatable/data.table/wiki/Installation Command to update brew and install the OpenMP runtime (libomp). ```bash brew update && brew install libomp ``` -------------------------------- ### SSH Connection Example Source: https://github.com/rdatatable/data.table/wiki/Amazon-EC2-for-beginners Example command to connect to an EC2 instance via SSH. ```bash ssh -i mdowle.pem ubuntu@54.67.82.235 ``` -------------------------------- ### Pandas Installation and Configuration Source: https://github.com/rdatatable/data.table/wiki/Benchmarks-:-Grouping Steps to install pandas and view its compiler flags on a Debian-based system. ```bash wget -O- http://neuro.debian.net/lists/trusty.us-ca.libre | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list sudo apt-key adv --recv-keys --keyserver hkp://pgp.mit.edu:80 2649A5A9 sudo apt-get update sudo apt-get install python3-pandas python3-dev python3-config --cflags -I/usr/include/python3.4m -I/usr/include/python3.4m -Wno-unused-result -Werror=declaration-after-statement -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes ``` -------------------------------- ### Install pkg-config Source: https://github.com/rdatatable/data.table/wiki/Installation Homebrew command to install pkg-config to resolve shared object not found errors. ```bash brew install pkg-config ``` -------------------------------- ### Running a single performance test Source: https://github.com/rdatatable/data.table/wiki/Performance-testing Example of how to run a single performance test using atime_versions, including setup, expression, and specifying specific commits for comparison. ```r vinfo <- atime::atime_versions( pkg.path="path/to/data.table", pkg.edit.fun = pkg.edit.fun, N = 10^seq(1, 7, by=0.5), setup = { L = as.data.table(as.character(rnorm(N, 1L, 0.5))) setkey(L, V1) }, expr = { data.table:::`[.data.table`(L, , .SD) }, Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR that fixes the issue Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue ) plot(vinfo) refs <- atime::references_best(vinfo) plot(refs) pred <- predict(refs) plot(pred) ``` -------------------------------- ### Windows Development Version Installation Source: https://github.com/rdatatable/data.table/wiki/Installation Commands to install the development version of data.table on Windows, including options for binary and source installations, and updating the development package. ```R install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table") ## defaults to binary version and only works if you are using a recent version of R install.packages("data.table", type="source", repos="https://Rdatatable.gitlab.io/data.table") ## installs from source, which requires Rtools; use this for older versions of R # or, which install only if newer version (by git commit hash) is available data.table::update.dev.pkg() ``` -------------------------------- ### CRAN Installation Source: https://github.com/rdatatable/data.table/wiki/Installation Standard R commands to install and load the data.table package from CRAN. ```R install.packages("data.table") # install it library(data.table) # attach it example(data.table) # run the examples section of ?data.table ?data.table # read ?fread # read update.packages() # keep up to date ``` -------------------------------- ### Compile data.table directly with OpenMP flags Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install data.table from source with specific OpenMP flags. ```bash PKG_CPPFLAGS='-Xclang -fopenmp' PKG_LIBS=-lomp R CMD INSTALL path/to/data.table_.tar.gz ``` -------------------------------- ### fwrite compressed file example Source: https://github.com/rdatatable/data.table/blob/master/NEWS.1.md Example showing how to write compressed .gz files using `fwrite` and comparing file sizes and write times. ```R DT = data.table(A=rep(1:2, 100e6), B=rep(1:4, 50e6)) fwrite(DT, "data.csv") # 763MB; 1.3s fwrite(DT, "data.csv.gz") # 2MB; 1.6s identical(fread("data.csv.gz"), DT) ``` -------------------------------- ### GCC Brew (Option 5 above) Source: https://github.com/rdatatable/data.table/wiki/Installation Compiler flags for GCC installed via Homebrew on macOS. ```bash VER=9 # set the version of brew gcc. (as of today: 11) CC=gcc-$(VER) -fopenmp # brew gcc nicely creates gcc-9 as symlink CXX=g++-$(VER) -fopenmp CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe ``` -------------------------------- ### Linux and Mac Development Version Installation Source: https://github.com/rdatatable/data.table/wiki/Installation Commands to remove the existing data.table package and install the development version from source on Linux and Mac. ```R remove.packages("data.table") install.packages("data.table", type = "source", repos = "https://Rdatatable.gitlab.io/data.table") ``` -------------------------------- ### fread select argument example Source: https://github.com/rdatatable/data.table/blob/master/NEWS.1.md Examples demonstrating the use of the `select` argument in `fread` to specify column types. ```R fread(file, select=c(colD="character", # returns 2 columns: colD,colA colA="integer64")) read(file, select=list(character="colD", # returns 5 columns: colD,8,9,10,colA integer= 8:10, character="colA")) ``` -------------------------------- ### Reverting to CRAN Version on Linux/Mac Source: https://github.com/rdatatable/data.table/wiki/Installation Commands to remove the development version and install the CRAN version of data.table on Linux and Mac. ```R remove.packages("data.table") install.packages("data.table") ``` -------------------------------- ### Makevars flags for Apple Silicon Macs Source: https://github.com/rdatatable/data.table/wiki/Installation Configuration for ~/.R/Makevars on Apple Silicon Macs to enable OpenMP. ```bash LDFLAGS += -L/opt/homebrew/opt/libomp/lib -lomp CPPFLAGS += -I/opt/homebrew/opt/libomp/include -Xclang -fopenmp ``` -------------------------------- ### address() function example Source: https://github.com/rdatatable/data.table/blob/master/NEWS.0.md Shows how to use the new address() function to get the memory address of an object. ```R x <- 1:10 address(x) # [1] "0x10a3a8000" ``` -------------------------------- ### Macports variant for data.table installation Source: https://github.com/rdatatable/data.table/wiki/Installation Command to install r-data.table with OpenMP option using Macports. ```bash sudo port install r-data.table +openmp ``` ```bash CC=/opt/local/bin/clang -fopenmp CXX=/opt/local/bin/clang++ -fopenmp ``` -------------------------------- ### Using setindexv and checking index attributes Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Demonstrates how setindexv works and how to inspect the attributes of the created index, including group positions and statistics. ```R d2 = data.table(id=2:1, v2=1:2) setindexv(d2, "id") str(attr(attr(d2, "index"), "__id")) # int [1:2] 2 1 # - attr(*, "starts")= int [1:2] 1 2 # - attr(*, "maxgrpn")= int 1 # - attr(*, "anyna")= int 0 # - attr(*, "anyinfnan")= int 0 # - attr(*, "anynotascii")= int 0 # - attr(*, "anynotutf8")= int 0 d2 = data.table(id=2:1, v2=1:2) invisible(d2[id==1L]) str(attr(attr(d2, "index"), "__id")) # int [1:2] 2 1 ``` -------------------------------- ### Python pandas setup Source: https://github.com/rdatatable/data.table/wiki/Benchmarks-:-Grouping Python code snippet for setting up pandas for benchmarking. ```python $ python3 import pandas as pd import numpy as np import timeit # randChar is workaround for MemoryError in mtrand.RandomState.choice ``` -------------------------------- ### dcast.data.table example Source: https://github.com/rdatatable/data.table/blob/master/NEWS.0.md Example of using dcast.data.table with a formula interface and an aggregate function. ```R dcast.data.table(x, a ~ ., mean, value.var="b") ``` -------------------------------- ### GForce optimization examples Source: https://github.com/rdatatable/data.table/blob/master/NEWS.0.md Examples demonstrating when GForce optimization applies for grouped operations. ```R DT[,sum(x,na.rm=),by=...] # yes DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...] # yes DT[,lapply(.SD,sum,na.rm=),by=...] # yes DT[,list(sum(x),min(y)),by=...] # no. gmin not yet available, only sum and mean so far. ``` -------------------------------- ### Makevars flags for M2 Mac with Sequoia 15.6+ Source: https://github.com/rdatatable/data.table/wiki/Installation Specific Makevars configuration for recent M2 Macs to resolve OpenMP issues. ```bash CPPFLAGS += -I/opt/homebrew/opt/libomp/include -I/opt/homebrew/opt/gettext/include -Xclang -fopenmp LDFLAGS += -L/opt/homebrew/opt/gettext/lib PKG_LIBS += /opt/homebrew/opt/libomp/lib/libomp.dylib -Wl,-rpath,/opt/homebrew/opt/libomp/lib ``` -------------------------------- ### GForce optimization example with locale considerations Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Demonstrates how subtle changes in 'j' expressions, particularly involving locale-dependent operations like tolower(), can affect GForce optimization and potentially lead to different results. Namespace qualification or disabling optimizations can ensure consistency. ```R DT[, .(max(a), max(b)), by=grp] DT[, .(max(a), max(tolower(b))), by=grp] ``` ```R DT[, .(base::max(a), base::max(b)), by=grp] ``` ```R options(datatable.optimize = 0) ``` ```R Sys.setlocale("LC_COLLATE", "C") ``` -------------------------------- ### Example of using PKG_CPPFLAGS and PKG_LIBS on Mac Source: https://github.com/rdatatable/data.table/blob/master/NEWS.1.md This snippet shows how to set environment variables for compiling data.table on a Mac, potentially enabling OpenMP support. ```bash PKG_CPPFLAGS='-Xclang -fopenmp' PKG_LIBS=-lomp R CMD INSTALL data.table_.tar.gz ``` -------------------------------- ### macOS CFLAGS, CCFLAGS, LDFLAGS, CPPFLAGS Source: https://github.com/rdatatable/data.table/wiki/Installation Compiler and linker flags for macOS, potentially optimizing for -O3. ```bash CFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe CCFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk CXXFLAGS=-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe LDFLAGS=-L/opt/local/lib -Wl,-rpath,/opt/local/lib CPPFLAGS=-I/opt/local/include ``` -------------------------------- ### Install and update data.table Source: https://github.com/rdatatable/data.table/wiki/revdep-issue-template Command to install the development version of data.table and update the development package. ```r data.table::update_dev_pkg() ``` -------------------------------- ### GCC (Official GNU fortran) ver Source: https://github.com/rdatatable/data.table/wiki/Installation Compiler flags for GCC on macOS, including OpenMP support. ```bash LOC = /usr/local/gfortran CC=$(LOC)/bin/gcc -fopenmp CXX=$(LOC)/bin/g++ -fopenmp CXX11 = $(LOC)/bin/g++ -fopenmp # for fst package CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe LDFLAGS=-L$(LOC)/lib -Wl,-rpath,$(LOC)/lib CPPFLAGS=-I$(LOC)/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include ``` -------------------------------- ### Demonstrating rbind() with fill=TRUE and use.names=FALSE Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Illustrates the behavior of rbind() with fill=TRUE and use.names=FALSE, showing how columns are aligned and filled when data frames have different column sets. Note the change in output behavior in newer versions. ```R DT1 # A B # # 1: 1 5 # 2: 2 6 DT2 # foo # # 1: 3 # 2: 4 rbind(DT1, DT2, fill=TRUE) # no change # A B foo # # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 rbind(DT1, DT2, fill=TRUE, use.names=FALSE) # was: # A B foo # # 1: 1 5 NA # 2: 2 6 NA # 3: NA NA 3 # 4: NA NA 4 # Warning message: # In rbindlist(l, use.names, fill, idcol) : # use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE. # now: # A B # # 1: 1 5 # 2: 2 6 # 3: 3 NA # 4: 4 NA ``` -------------------------------- ### Installation Source: https://github.com/rdatatable/data.table/blob/master/README.md Instructions for installing the data.table package from CRAN and the development version from a specific repository. ```r install.packages("data.table") # latest development version (only if newer available) data.table::update_dev_pkg() # latest development version (force install) install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table") ``` -------------------------------- ### Efficient data.table construction with do.call Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Compares the performance of constructing a data.table directly versus using do.call. Demonstrates a significant performance improvement after a bug fix. ```R DF = data.frame(a=runif(1e6), b=runif(1e6)) DT1 = data.table(DF) # 0.02s before and after DT2 = do.call(data.table, list(DF)) # 3.07s before, 0.02s after identical(DT1, DT2) # TRUE ``` -------------------------------- ### fread integer overflow example Source: https://github.com/rdatatable/data.table/blob/master/NEWS.0.md Example demonstrating how fread handles large integers that would overflow a standard integer type. ```R fread("Col1\n12345678901234567890") # works as before, bumped to character ``` -------------------------------- ### Applying a custom function with frollapply and by.column=FALSE Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Demonstrates using a custom function with frollapply on a data.table, specifying by.column=FALSE to apply the function across columns. The example shows fitting a linear model within rolling windows. ```R x = data.table(v1=rnorm(120), v2=rnorm(120)) f = function(x) coef(lm(v2 ~ v1, data=x)) frollapply(x, 4, f, by.column=FALSE) # (Intercept) v1 # # 1: NA NA # 2: NA NA # 3: NA NA # 4: -0.04648236 -0.6349687 # 5: 0.09208733 -0.4964023 #--- #116: -0.21169439 0.7421358 #117: -0.19729119 0.4926939 #118: -0.04217896 0.0452713 #119: 0.22472549 -0.5245874 #120: 0.54540359 -0.1638333 ``` -------------------------------- ### Using keyby with data.table Source: https://github.com/rdatatable/data.table/blob/master/NEWS.md Shows how keyby= can be used with TRUE/FALSE alongside by= for grouping in data.table operations. ```R DT[, sum(colB), keyby="colA"] DT[, sum(colB), by="colA", keyby=TRUE] # same ```