# Intel Optimization Zone - Tuning Guides

The Intel Optimization Zone repository provides comprehensive performance tuning guides and optimization recipes for software, workloads, tools, and hardware configurations on Intel architectures. It covers databases (Cassandra, MySQL, PostgreSQL), data processing frameworks (Apache Spark, Gluten), programming languages (Java), and includes best practices for BIOS settings, memory management, NUMA configurations, and system-level optimizations to maximize performance on Intel Xeon processors.

This collection is designed for software developers, database administrators, and performance engineers who want to unlock the full potential of Intel hardware. The guides include specific configuration recommendations, JVM tuning parameters, kernel optimizations, and benchmark methodologies with real-world performance comparisons across different Intel Xeon generations including the latest Xeon 6 processors with P-cores and E-cores.

---

## Java JIT Compiler Tuning

Optimizes Java applications by adjusting C2 JIT compiler thresholds for better inlining and compilation of methods with deep call stacks, particularly beneficial for functional programming styles, Stream API usage, and auto-generated code like Protobuf.

```bash
# JIT compiler inlining optimizations for modern Java applications
java -XX:MaxInlineLevel=40 -XX:TypeProfileMajorReceiverPercent=30 -jar app.jar

# For applications with large auto-generated methods (e.g., Protobuf)
java -XX:-DontCompileHugeMethods -XX:MaxNodeLimit=150000 -XX:NodeLimitFudgeFactor=3000 -jar app.jar

# Combined optimizations with heap tuning (set Xms equal to Xmx)
java \
  -XX:MaxInlineLevel=40 \
  -XX:TypeProfileMajorReceiverPercent=30 \
  -XX:-DontCompileHugeMethods \
  -XX:MaxNodeLimit=150000 \
  -Xms24g -Xmx24g \
  -jar app.jar
```

---

## Java String Optimization

Replaces `String.replaceAll()` with `String.replace()` for literal string replacements to avoid unnecessary regex compilation overhead.

```java
// BEFORE: Using replaceAll() with literal strings (slower)
String result = "a-b-c".replaceAll("-", ":");
String formatted = input.replaceAll("_", " ");

// AFTER: Using replace() for literal strings (faster, no regex overhead)
String result = "a-b-c".replace("-", ":");
String formatted = input.replace("_", " ");

// Both produce identical output: "a:b:c" and formatted input
// replace() is significantly faster for non-regex patterns
```

---

## Apache Cassandra BIOS Configuration

Configures BIOS settings for optimal Cassandra database performance on Intel Xeon processors, including hyperthreading, Sub-NUMA Clustering, and latency optimization.

```bash
# Recommended BIOS Settings for Cassandra:
# | Parameter              | Default    | Recommended                           | Impact   |
# |------------------------|------------|---------------------------------------|----------|
# | Hyperthreading/SMT     | Enabled    | Enabled                               | up to 22%|
# | SNC (Sub-NUMA Cluster) | Disabled   | SNC 2-4 (based on instances)          | up to 15%|
# | Latency Optimized Mode | Disabled   | Enabled                               | 2-4%     |

# Check NUMA topology for SNC configuration
numactl -H

# Example output showing 4 NUMA nodes with SNC enabled:
# available: 4 nodes (0-3)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 256000 MB
# ...
```

---

## Cassandra JVM Configuration

Configures Java Virtual Machine settings for Cassandra including garbage collector selection, huge pages, and heap sizing for optimal throughput.

```bash
# jvm-server.options - Enable large pages (3-5% speedup)
-XX:+UseLargePages
-XX:+UseTransparentHugePages

# jvm<version>-server.options - G1GC configuration (up to 10% speedup)
# Disable NUMA in JVM (handled by numactl instead)
# -XX:+UseNUMA

# Comment out CMS garbage collector (use G1GC instead)
#-XX:+UseParNewGC
#-XX:+UseConcMarkSweepGC
#-XX:+CMSParallelRemarkEnabled

# Enable G1GC (default in Cass5, must enable in Cass3/4)
-XX:+UseG1GC

# Heap sizing (25-50% of system RAM, avoid 32-38GB range)
# Example for system with 128GB RAM running 4 instances:
-Xms31G
-Xmx31G

# Set JAVA_HOME before starting Cassandra
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
```

---

## Cassandra NUMA Binding

Binds Cassandra instances to specific NUMA nodes for optimal memory locality, providing up to 15% performance improvement on multi-socket systems.

```bash
# Modify <CASS_HOME_INST0>/bin/cassandra for NUMA node 0 binding:
NUMACTL_ARGS=""
if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null 2>/dev/null
then
  NUMACTL="numactl -m 0 -N 0 $NUMACTL_ARGS"
else
  NUMACTL=""
fi

# Start Cassandra instances bound to different NUMA nodes:
# Instance 0 on NUMA 0
numactl -m 0 -N 0 /opt/cassandra-inst0/bin/cassandra -R

# Instance 1 on NUMA 1
numactl -m 1 -N 1 /opt/cassandra-inst1/bin/cassandra -R

# Instance 2 on NUMA 2
numactl -m 2 -N 2 /opt/cassandra-inst2/bin/cassandra -R

# Instance 3 on NUMA 3
numactl -m 3 -N 3 /opt/cassandra-inst3/bin/cassandra -R

# Verify NUMA topology
numactl -H
```

---

## Cassandra YAML Configuration

Configures essential Cassandra parameters in cassandra.yaml for network, storage, and concurrent operations tuning.

```yaml
# cassandra.yaml essential settings
seeds: "192.168.1.100"
listen_address: 192.168.1.100
rpc_address: 192.168.1.100

# Storage directories
data_file_directories:
  - /mnt/nvme1/cass_db
commitlog_directory: /mnt/nvme1/cass_db/commitlog
cdc_raw_directory: /mnt/nvme1/cass_db/cdc_raw
saved_caches_directory: /mnt/nvme1/cass_db/saved_caches

# Concurrent operations (up to 15% speedup)
# concurrent_reads: 3x logical CPUs allocated to this instance
concurrent_reads: 96  # For 32 CPU instance
# concurrent_writes: same as logical CPU count (avoid contention issue)
concurrent_writes: 32

# For Cass5: rename cassandra_latest.yaml to cassandra.yaml
# to enable new features (up to 27% speedup):
# - Unified Compaction
# - Trie Memtable
# - BTI SStables
```

---

## Cassandra Storage Optimization

Configures Linux storage settings for NVMe devices to maximize Cassandra I/O performance, particularly for small random read workloads.

```bash
# Disk optimizations for NVMe devices (can double throughput for small random reads)

# Set scheduler to none for NVMe
echo none > /sys/block/nvme1n1/queue/scheduler
echo none > /sys/block/nvme2n1/queue/scheduler
echo none > /sys/block/nvme3n1/queue/scheduler
echo none > /sys/block/nvme4n1/queue/scheduler

# Mark as non-rotational
echo 0 > /sys/class/block/nvme1n1/queue/rotational
echo 0 > /sys/class/block/nvme2n1/queue/rotational
echo 0 > /sys/class/block/nvme3n1/queue/rotational
echo 0 > /sys/class/block/nvme4n1/queue/rotational

# Critical: Reduce read_ahead_kb from default 128 to 8
# This alone can double throughput for small random requests
echo 8 > /sys/class/block/nvme1n1/queue/read_ahead_kb
echo 8 > /sys/class/block/nvme2n1/queue/read_ahead_kb
echo 8 > /sys/class/block/nvme3n1/queue/read_ahead_kb
echo 8 > /sys/class/block/nvme4n1/queue/read_ahead_kb
```

---

## Cassandra Stress Benchmark

Runs cassandra-stress benchmark for performance testing with optimized schema configuration for mixed read/write workloads.

```bash
# Schema configuration for 80:20 read:write workload
# cqlstress-insanity-example.yaml modifications:

# Cass3 - SizeTieredCompactionStrategy
# ) WITH compaction = {'class':'SizeTieredCompactionStrategy'}
# AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':64}

# Cass4 - SizeTieredCompactionStrategy
# ) WITH compaction = {'class':'SizeTieredCompactionStrategy'}
# AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':16}

# Cass5 - UnifiedCompactionStrategy (auto-adjusts between Leveled/SizeTiered)
# ) WITH compaction = { 'class':'unifiedCompactionStrategy' }
# AND compression = { 'class':'LZ4Compressor','chunk_length_in_kb':16}

# Create initial database (670 bytes per entry)
<CASS_HOME>/tools/bin/cassandra-stress \
  user profile=<CASS_HOME>/tools/cql-insanity-example.yaml \
  ops\(insert=1\) no-warmup \
  cl=ONE \
  n=600000000 \
  -mode native cql3 \
  -pop seq=1..600000000 \
  -node 192.168.1.100 \
  -rate threads=32

# Run 80/20 mixed workload benchmark
<CASS_HOME>/tools/bin/cassandra-stress user \
  profile=<CASS_HOME>/tools/cql-insanity-example.yaml \
  ops\(insert=20,simple1=80\) no-warmup \
  cl=ONE \
  duration=600s \
  -mode native cql3 \
  -pop dist=uniform\(1..600000000\) \
  -node 192.168.1.100 \
  -rate threads=256
```

---

## Apache Spark Configuration

Configures Apache Spark parameters for optimal performance on Intel Xeon platforms, including executor sizing, memory allocation, and query optimizations.

```bash
# spark-defaults.conf or spark-submit parameters

# Executor configuration
spark.executor.cores=8  # 4-8 cores per executor recommended
spark.executor.instances=16  # nproc / executor.cores
spark.executor.memory=28g  # 35% of total memory / executors
spark.executor.memoryOverhead=1g
spark.executor.extraJavaOptions="-XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0"

# Off-heap memory
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=36g  # 45% of total memory / executors

# Driver configuration
spark.driver.memory=20g
spark.driver.maxResultSize=4g

# Query optimizations
spark.sql.adaptive.enabled=true
spark.sql.files.maxPartitionBytes=2g
spark.sql.shuffle.partitions=256  # 2-4x CPU threads
spark.sql.broadcastTimeout=4800
spark.sql.autoBroadcastJoinThreshold=10m
spark.sql.optimizer.dynamicPartitionPruning.enabled=true

# Bloom filter optimization
spark.sql.optimizer.runtime.bloomFilter.enabled=true
spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold=0

# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m

# Batch sizes
spark.sql.execution.arrow.maxRecordsPerBatch=20480
spark.sql.parquet.columnarReaderBatchSize=20480
spark.sql.inMemoryColumnarStorage.batchSize=20480

# GC tuning
spark.cleaner.periodicGC.interval=10s
spark.task.cpus=1
```

---

## Spark Linux System Configuration

Configures Linux kernel settings for optimal Spark performance including resource limits and huge pages.

```bash
# Configure /etc/security/limits.conf for Spark user
echo "$USER soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "$USER hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "$USER soft nproc unlimited" | sudo tee -a /etc/security/limits.conf
echo "$USER hard nproc unlimited" | sudo tee -a /etc/security/limits.conf
echo "$USER soft nofile 655360" | sudo tee -a /etc/security/limits.conf
echo "$USER hard nofile 655360" | sudo tee -a /etc/security/limits.conf

# Check maximum file descriptors limit
cat /proc/sys/fs/nr_open

# Enable Transparent Huge Pages for Spark
echo always > /sys/kernel/mm/transparent_hugepage/defrag
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Verify THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
```

---

## Apache Gluten with Velox Configuration

Configures Apache Gluten parameters for native vectorized query execution using the Velox backend, achieving up to 3.3x speedup over vanilla Spark.

```bash
# Gluten + Velox spark-defaults.conf

# Essential Gluten settings
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=200g  # Allocate as much as possible for offHeap

# Velox memory configuration (Gluten v1.0.0+)
spark.gluten.sql.columnar.backend.velox.memoryCapRatio=0.75

# Use jemalloc for up to 10% performance improvement
spark.executorEnv.LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

# Force shuffled hash join (better than sort-merge join in native engine)
spark.gluten.sql.columnar.forceshuffledhashjoin=true

# Batch and partition settings
spark.gluten.sql.columnar.maxBatchSize=4096
spark.sql.files.maxPartitionBytes=2g  # At least 2GB for better native scan
spark.sql.shuffle.partitions=288  # 2x total cores

# Join optimization for complex queries (e.g., TPC-DS q72)
spark.gluten.sql.columnar.joinOptimizationLevel=18
spark.gluten.sql.columnar.logicalJoinOptimizationLevel=19
spark.gluten.sql.columnar.logicalJoinOptimizeEnable=true
spark.gluten.sql.columnar.physicalJoinOptimizationLevel=18
spark.gluten.sql.columnar.physicalJoinOptimizeEnable=true

# Fallback threshold for unsupported operators
spark.gluten.sql.columnar.wholeStage.fallback.threshold=2

# NUMA binding (bare metal only)
spark.gluten.sql.columnar.numaBinding=true  # false for cloud
spark.gluten.sql.columnar.coreRange=0-71  # Align with NUMA topology

# Bloom filter optimization
spark.sql.optimizer.runtime.bloomFilter.enabled=true
spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold=0
```

---

## MySQL NUMA Binding and Configuration

Configures MySQL for optimal performance on Intel Xeon 6 processors with NUMA binding and key tuning parameters.

```bash
# Check NUMA topology
numactl -H

# Check CPU and NUMA node mapping
lscpu -e

# Start MySQL bound to NUMA node 0
numactl -N 0 -m 0 /usr/local/mysql/bin/mysqld_safe \
  --defaults-file=/etc/my.cnf \
  --user=mysql &

# For systemd service, modify /lib/systemd/system/mysql.service:
# ExecStart=numactl -N 0 -m 0 /usr/sbin/mysqld

# Multi-instance startup with core binding (8 cores per instance)
for i in {0..7}; do
  echo "Starting instance ${i}"
  numactl -C $((${i}*4))-$((${i}*4+3)),$((${i}*4+192))-$((${i}*4+192+3)) -m 0 \
    mysqld_safe --defaults-file=/mnt/data/conf/my-inst${i}.cnf &
done
```

```ini
# my.cnf key parameters for performance
[mysqld]
# Buffer pool: 50-75% of total RAM for single instance
# For multi-instance, total of all instances = 50-75% of server memory
innodb_buffer_pool_size = 64G

# Disable binary logging for benchmark scenarios
skip-log-bin

# Flush method to avoid double buffering
innodb_flush_method = O_DIRECT_NO_FSYNC

# Transaction durability vs performance trade-off
# 1 = most durable, 0 or 2 = better performance
innodb_flush_log_at_trx_commit = 2
```

---

## PostgreSQL Configuration and Tuning

Configures PostgreSQL for optimal performance including WAL settings, shared buffers, and huge pages support.

```bash
# Initialize PostgreSQL with 1GB WAL segment size (vs default 16MB)
${PG_HOME}/bin/initdb -D ./data --wal-segsize=1024

# Verify WAL segment size
du -hcs pg_wal/* | head -5
# 1.0G 0000000100000A0400000002
# 1.0G 0000000100000A0400000003

# Start PostgreSQL with NUMA binding
numactl -N 0 -m 0 ${PG_HOME}/bin/pg_ctl start -D /data/pgbase -l logfile &

# Configure huge pages (approximately 10% improvement)
# Calculate required hugepages for 10GB shared_buffers
echo 6000 > /proc/sys/vm/nr_hugepages_mempolicy

# Or add to /etc/sysctl.conf for persistence
# vm.nr_hugepages = 6000
# sysctl -p

# For 1GB huge pages (slightly better than 2MB), add to GRUB:
# GRUB_CMDLINE_LINUX="rhgb default_hugepagesz=1G hugepagesz=1G"
# Then: sudo grub2-mkconfig -o /boot/efi/EFI/ubuntu/grub.cfg
```

```ini
# postgresql.conf key parameters
# Shared buffers: start at 25% of system memory
shared_buffers = 64GB

# Enable huge pages
huge_pages = on

# WAL settings for performance benchmarks
wal_level = minimal
synchronous_commit = off  # Use 'on' for production

# Enable autovacuum to avoid performance dips
autovacuum = on
```

---

## HammerDB Benchmark for MySQL

Runs HammerDB TPC-C benchmark against MySQL for database performance evaluation.

```tcl
# schemabuild.tcl - Create TPC-C schema
puts "SETTING CONFIGURATION"
dbset db mysql
dbset bm TPC-C
diset connection mysql_host localhost
diset connection mysql_port 3306
diset tpcc mysql_user root
diset tpcc mysql_pass MyNewPass4!
diset tpcc mysql_count_ware 100
diset tpcc mysql_partition false
diset tpcc mysql_num_vu 8
diset tpcc mysql_storage_engine innodb
print dict
vuset logtotemp 1
vuset unique 1
buildschema
waittocomplete
```

```tcl
# mysqlrun.tcl - Run TPC-C benchmark
puts "SETTING CONFIGURATION"
dbset db mysql
diset connection mysql_host localhost
diset connection mysql_port 3306
diset tpcc mysql_user root
diset tpcc mysql_pass MyNewPass4!
diset tpcc mysql_driver timed
diset tpcc mysql_prepared false
diset tpcc mysql_rampup 2
diset tpcc mysql_duration 5
diset tpcc mysql_timeprofile true
diset tpcc mysql_allwarehouse true
vuset logtotemp 1
vuset unique 1
loadscript
puts "TEST STARTED"
vuset vu 64
vucreate
vurun
runtimer 500
vudestroy
puts "TEST COMPLETE"
```

```bash
# Execute HammerDB scripts
./hammerdbcli auto schemabuild.tcl
./hammerdbcli auto mysqlrun.tcl

# Multi-instance benchmark execution
for i in {0..23}; do
  numactl -N 3-5 -m 3-5 ${HammerDB-HOME}/hammerdbcli auto \
    ${CUR_FOLDER}/mysqlrun_${i}.tcl > ${CUR_FOLDER}/result-${i}.log 2>&1 &
done
```

---

## Sysbench Benchmark for MySQL

Runs Sysbench OLTP benchmark for MySQL performance testing.

```bash
# Prepare Sysbench database (30 tables, 1M rows each, ~8GB data)
sysbench /usr/local/share/sysbench/oltp_read_write.lua \
  --tables=30 \
  --threads=100 \
  --table-size=1000000 \
  --mysql-db=sbtest \
  --mysql-user=root \
  --mysql-password=MyNewPass4! \
  --db-driver=mysql \
  --mysql-host=localhost \
  --mysql_storage_engine=innodb \
  prepare

# Run OLTP read/write benchmark (single instance)
numactl -N 3-5 -m 3-5 sysbench oltp_read_write \
  --tables=32 \
  --threads=64 \
  --table-size=1000000 \
  --mysql-host=127.0.0.1 \
  --mysql-db=sbtest \
  --mysql-user=intel \
  --db-driver=mysql \
  --mysql_storage_engine=innodb \
  --report-interval=60 \
  --time=180 \
  --mysql-port=3306 \
  --mysql-password=XXX \
  --warmup-time=60 \
  run

# Multi-instance benchmark (24 instances)
for i in {0..23}; do
  numactl -N 3-5 -m 3-5 sysbench /usr/local/share/sysbench/oltp_read_write.lua \
    --tables=32 \
    --threads=64 \
    --table-size=1000000 \
    --mysql-host=127.0.0.1 \
    --mysql-db=sbtest \
    --mysql-user=intel \
    --db-driver=mysql \
    --mysql_storage_engine=innodb \
    --report-interval=60 \
    --time=180 \
    --mysql-port=$((${i}+3306)) \
    --mysql-password=XXX \
    --warmup-time=60 \
    run > sysbench-${i}.txt 2>&1 &
done
sleep 250
```

---

## HammerDB Benchmark for PostgreSQL

Runs HammerDB TPC-C benchmark against PostgreSQL for database performance evaluation.

```tcl
# pgbuild.tcl - Create TPC-C schema for PostgreSQL
dbset db pg
dbset bm TPC-C
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 100
diset tpcc pg_num_vu 8
diset tpcc pg_superuser intel
diset tpcc pg_superuserpass postgres
diset tpcc pg_storedprocs false
diset tpcc pg_raiseerror true
vuset logtotemp 1
vuset unique 1
buildschema
waittocomplete
```

```tcl
# pgtest.tcl - Run TPC-C benchmark for PostgreSQL
puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_superuser intel
diset tpcc pg_superuserpass postgres
diset tpcc pg_vacuum true
diset tpcc pg_driver timed
diset tpcc pg_rampup 2
diset tpcc pg_duration 5
diset tpcc pg_storedprocs false
diset tpcc pg_count_ware 100
diset tpcc pg_raiseerror true
vuset logtotemp 1
vuset unique 1
loadscript
puts "TEST STARTED"
vuset vu 64
vucreate
vurun
runtimer 300
vudestroy
puts "TEST COMPLETE"
```

```bash
# Execute HammerDB for PostgreSQL
./hammerdbcli auto pgbuild.tcl
./hammerdbcli auto pgtest.tcl

# Example output:
# Vuser 1:64 Active Virtual Users configured
# Vuser 1:TEST RESULT : System achieved 1483849 NOPM from 3515684 PostgreSQL TPM
```

---

## Intel Performance Counter Monitor (PCM)

Monitors CPU performance metrics including memory bandwidth, cache misses, and energy states using Intel PCM tools.

```bash
# Install PCM from GitHub
git clone https://github.com/intel/pcm.git
cd pcm
mkdir build && cd build
cmake ..
make -j

# Monitor real-time CPU metrics
sudo ./pcm

# Monitor memory bandwidth per channel
sudo ./pcm-memory

# Monitor PCIe bandwidth
sudo ./pcm-pcie

# Monitor power and energy states
sudo ./pcm-power

# Monitor cache metrics
sudo ./pcm-core

# Example: Monitor during Cassandra stress test
sudo ./pcm -r -- cassandra-stress write n=1000000

# Cloud vPMU availability testing
perf stat --timeout 3000 -a -e cycles stress-ng -m $(nproc)
perf stat --timeout 3000 -a -M CPI stress-ng -m $(nproc)
```

---

## Intel PerfSpect

Collects system configuration, performance metrics, and generates flame graphs for performance analysis.

```bash
# Install PerfSpect
git clone https://github.com/intel/PerfSpect.git
cd PerfSpect
pip install -r requirements.txt

# Collect system configuration report
sudo ./perfspect report

# Monitor CPU metrics in real-time
sudo ./perfspect monitor

# Collect telemetry during workload
sudo ./perfspect collect -d 60 -o ./results

# Generate flame graph from call stacks
sudo ./perfspect flame -d 30 -o ./flamegraph.svg

# Modify performance-related settings
sudo ./perfspect config --governor=performance
```

---

## Intel gProfiler

Profiles system-wide CPU usage across native programs, Java, Python, and kernel routines with flame graph visualization.

```bash
# Install gProfiler
pip install gprofiler

# Or download binary
wget https://github.com/intel/gprofiler/releases/latest/download/gprofiler
chmod +x gprofiler

# Run system-wide profiling for 60 seconds
sudo ./gprofiler -d 60 -o profile_output

# Profile specific Java application
sudo ./gprofiler -d 60 --java-async-profiler-mode cpu -o java_profile

# Profile with Python support
sudo ./gprofiler -d 60 --python-mode py-spy -o python_profile

# Upload results to gProfiler Performance Studio
sudo ./gprofiler -d 60 \
  --upload-results \
  --server-host=gprofiler-studio.example.com \
  --service-name=my-application

# Continuous profiling with periodic uploads
sudo ./gprofiler \
  --continuous \
  --upload-results \
  --server-host=gprofiler-studio.example.com \
  --profiling-frequency=11 \
  --service-name=my-application
```

---

## Summary

The Intel Optimization Zone provides actionable tuning recipes for maximizing performance on Intel Xeon processors across databases, data processing frameworks, and Java applications. Key optimization patterns include: NUMA-aware deployment with numactl binding, JVM tuning with G1GC and huge pages, storage optimization for NVMe devices, and Spark/Gluten configurations for vectorized query execution. The guides demonstrate performance improvements ranging from 10-100% depending on workload characteristics, with Gluten + Velox achieving up to 3.3x speedup over vanilla Spark on Intel Xeon 6 processors.

Integration patterns focus on multi-instance deployments for high-core-count systems, where applications like MySQL, PostgreSQL, and Cassandra are bound to specific NUMA nodes or Sub-NUMA Clusters (SNC). Benchmarking methodologies using HammerDB, Sysbench, cassandra-stress, and TPC-DS/TPC-H provide standardized ways to measure and validate optimizations. Performance monitoring tools including Intel PCM, PerfSpect, gProfiler, and VTune enable deep analysis of bottlenecks at the CPU, memory, storage, and application levels to guide further tuning efforts.