# Intel Optimization Zone - Tuning Guides The Intel Optimization Zone repository provides comprehensive performance tuning guides and optimization recipes for software, workloads, tools, and hardware configurations on Intel architectures. It covers databases (Cassandra, MySQL, PostgreSQL), data processing frameworks (Apache Spark, Gluten), programming languages (Java), and includes best practices for BIOS settings, memory management, NUMA configurations, and system-level optimizations to maximize performance on Intel Xeon processors. This collection is designed for software developers, database administrators, and performance engineers who want to unlock the full potential of Intel hardware. The guides include specific configuration recommendations, JVM tuning parameters, kernel optimizations, and benchmark methodologies with real-world performance comparisons across different Intel Xeon generations including the latest Xeon 6 processors with P-cores and E-cores. --- ## Java JIT Compiler Tuning Optimizes Java applications by adjusting C2 JIT compiler thresholds for better inlining and compilation of methods with deep call stacks, particularly beneficial for functional programming styles, Stream API usage, and auto-generated code like Protobuf. ```bash # JIT compiler inlining optimizations for modern Java applications java -XX:MaxInlineLevel=40 -XX:TypeProfileMajorReceiverPercent=30 -jar app.jar # For applications with large auto-generated methods (e.g., Protobuf) java -XX:-DontCompileHugeMethods -XX:MaxNodeLimit=150000 -XX:NodeLimitFudgeFactor=3000 -jar app.jar # Combined optimizations with heap tuning (set Xms equal to Xmx) java \ -XX:MaxInlineLevel=40 \ -XX:TypeProfileMajorReceiverPercent=30 \ -XX:-DontCompileHugeMethods \ -XX:MaxNodeLimit=150000 \ -Xms24g -Xmx24g \ -jar app.jar ``` --- ## Java String Optimization Replaces `String.replaceAll()` with `String.replace()` for literal string replacements to avoid unnecessary regex compilation overhead. ```java // BEFORE: Using replaceAll() with literal strings (slower) String result = "a-b-c".replaceAll("-", ":"); String formatted = input.replaceAll("_", " "); // AFTER: Using replace() for literal strings (faster, no regex overhead) String result = "a-b-c".replace("-", ":"); String formatted = input.replace("_", " "); // Both produce identical output: "a:b:c" and formatted input // replace() is significantly faster for non-regex patterns ``` --- ## Apache Cassandra BIOS Configuration Configures BIOS settings for optimal Cassandra database performance on Intel Xeon processors, including hyperthreading, Sub-NUMA Clustering, and latency optimization. ```bash # Recommended BIOS Settings for Cassandra: # | Parameter | Default | Recommended | Impact | # |------------------------|------------|---------------------------------------|----------| # | Hyperthreading/SMT | Enabled | Enabled | up to 22%| # | SNC (Sub-NUMA Cluster) | Disabled | SNC 2-4 (based on instances) | up to 15%| # | Latency Optimized Mode | Disabled | Enabled | 2-4% | # Check NUMA topology for SNC configuration numactl -H # Example output showing 4 NUMA nodes with SNC enabled: # available: 4 nodes (0-3) # node 0 cpus: 0 1 2 3 4 5 6 7 # node 0 size: 256000 MB # ... ``` --- ## Cassandra JVM Configuration Configures Java Virtual Machine settings for Cassandra including garbage collector selection, huge pages, and heap sizing for optimal throughput. ```bash # jvm-server.options - Enable large pages (3-5% speedup) -XX:+UseLargePages -XX:+UseTransparentHugePages # jvm-server.options - G1GC configuration (up to 10% speedup) # Disable NUMA in JVM (handled by numactl instead) # -XX:+UseNUMA # Comment out CMS garbage collector (use G1GC instead) #-XX:+UseParNewGC #-XX:+UseConcMarkSweepGC #-XX:+CMSParallelRemarkEnabled # Enable G1GC (default in Cass5, must enable in Cass3/4) -XX:+UseG1GC # Heap sizing (25-50% of system RAM, avoid 32-38GB range) # Example for system with 128GB RAM running 4 instances: -Xms31G -Xmx31G # Set JAVA_HOME before starting Cassandra export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 export PATH=$JAVA_HOME/bin:$PATH ``` --- ## Cassandra NUMA Binding Binds Cassandra instances to specific NUMA nodes for optimal memory locality, providing up to 15% performance improvement on multi-socket systems. ```bash # Modify /bin/cassandra for NUMA node 0 binding: NUMACTL_ARGS="" if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null 2>/dev/null then NUMACTL="numactl -m 0 -N 0 $NUMACTL_ARGS" else NUMACTL="" fi # Start Cassandra instances bound to different NUMA nodes: # Instance 0 on NUMA 0 numactl -m 0 -N 0 /opt/cassandra-inst0/bin/cassandra -R # Instance 1 on NUMA 1 numactl -m 1 -N 1 /opt/cassandra-inst1/bin/cassandra -R # Instance 2 on NUMA 2 numactl -m 2 -N 2 /opt/cassandra-inst2/bin/cassandra -R # Instance 3 on NUMA 3 numactl -m 3 -N 3 /opt/cassandra-inst3/bin/cassandra -R # Verify NUMA topology numactl -H ``` --- ## Cassandra YAML Configuration Configures essential Cassandra parameters in cassandra.yaml for network, storage, and concurrent operations tuning. ```yaml # cassandra.yaml essential settings seeds: "192.168.1.100" listen_address: 192.168.1.100 rpc_address: 192.168.1.100 # Storage directories data_file_directories: - /mnt/nvme1/cass_db commitlog_directory: /mnt/nvme1/cass_db/commitlog cdc_raw_directory: /mnt/nvme1/cass_db/cdc_raw saved_caches_directory: /mnt/nvme1/cass_db/saved_caches # Concurrent operations (up to 15% speedup) # concurrent_reads: 3x logical CPUs allocated to this instance concurrent_reads: 96 # For 32 CPU instance # concurrent_writes: same as logical CPU count (avoid contention issue) concurrent_writes: 32 # For Cass5: rename cassandra_latest.yaml to cassandra.yaml # to enable new features (up to 27% speedup): # - Unified Compaction # - Trie Memtable # - BTI SStables ``` --- ## Cassandra Storage Optimization Configures Linux storage settings for NVMe devices to maximize Cassandra I/O performance, particularly for small random read workloads. ```bash # Disk optimizations for NVMe devices (can double throughput for small random reads) # Set scheduler to none for NVMe echo none > /sys/block/nvme1n1/queue/scheduler echo none > /sys/block/nvme2n1/queue/scheduler echo none > /sys/block/nvme3n1/queue/scheduler echo none > /sys/block/nvme4n1/queue/scheduler # Mark as non-rotational echo 0 > /sys/class/block/nvme1n1/queue/rotational echo 0 > /sys/class/block/nvme2n1/queue/rotational echo 0 > /sys/class/block/nvme3n1/queue/rotational echo 0 > /sys/class/block/nvme4n1/queue/rotational # Critical: Reduce read_ahead_kb from default 128 to 8 # This alone can double throughput for small random requests echo 8 > /sys/class/block/nvme1n1/queue/read_ahead_kb echo 8 > /sys/class/block/nvme2n1/queue/read_ahead_kb echo 8 > /sys/class/block/nvme3n1/queue/read_ahead_kb echo 8 > /sys/class/block/nvme4n1/queue/read_ahead_kb ``` --- ## Cassandra Stress Benchmark Runs cassandra-stress benchmark for performance testing with optimized schema configuration for mixed read/write workloads. ```bash # Schema configuration for 80:20 read:write workload # cqlstress-insanity-example.yaml modifications: # Cass3 - SizeTieredCompactionStrategy # ) WITH compaction = {'class':'SizeTieredCompactionStrategy'} # AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':64} # Cass4 - SizeTieredCompactionStrategy # ) WITH compaction = {'class':'SizeTieredCompactionStrategy'} # AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':16} # Cass5 - UnifiedCompactionStrategy (auto-adjusts between Leveled/SizeTiered) # ) WITH compaction = { 'class':'unifiedCompactionStrategy' } # AND compression = { 'class':'LZ4Compressor','chunk_length_in_kb':16} # Create initial database (670 bytes per entry) /tools/bin/cassandra-stress \ user profile=/tools/cql-insanity-example.yaml \ ops\(insert=1\) no-warmup \ cl=ONE \ n=600000000 \ -mode native cql3 \ -pop seq=1..600000000 \ -node 192.168.1.100 \ -rate threads=32 # Run 80/20 mixed workload benchmark /tools/bin/cassandra-stress user \ profile=/tools/cql-insanity-example.yaml \ ops\(insert=20,simple1=80\) no-warmup \ cl=ONE \ duration=600s \ -mode native cql3 \ -pop dist=uniform\(1..600000000\) \ -node 192.168.1.100 \ -rate threads=256 ``` --- ## Apache Spark Configuration Configures Apache Spark parameters for optimal performance on Intel Xeon platforms, including executor sizing, memory allocation, and query optimizations. ```bash # spark-defaults.conf or spark-submit parameters # Executor configuration spark.executor.cores=8 # 4-8 cores per executor recommended spark.executor.instances=16 # nproc / executor.cores spark.executor.memory=28g # 35% of total memory / executors spark.executor.memoryOverhead=1g spark.executor.extraJavaOptions="-XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0" # Off-heap memory spark.memory.offHeap.enabled=true spark.memory.offHeap.size=36g # 45% of total memory / executors # Driver configuration spark.driver.memory=20g spark.driver.maxResultSize=4g # Query optimizations spark.sql.adaptive.enabled=true spark.sql.files.maxPartitionBytes=2g spark.sql.shuffle.partitions=256 # 2-4x CPU threads spark.sql.broadcastTimeout=4800 spark.sql.autoBroadcastJoinThreshold=10m spark.sql.optimizer.dynamicPartitionPruning.enabled=true # Bloom filter optimization spark.sql.optimizer.runtime.bloomFilter.enabled=true spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold=0 # Serialization spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max=512m # Batch sizes spark.sql.execution.arrow.maxRecordsPerBatch=20480 spark.sql.parquet.columnarReaderBatchSize=20480 spark.sql.inMemoryColumnarStorage.batchSize=20480 # GC tuning spark.cleaner.periodicGC.interval=10s spark.task.cpus=1 ``` --- ## Spark Linux System Configuration Configures Linux kernel settings for optimal Spark performance including resource limits and huge pages. ```bash # Configure /etc/security/limits.conf for Spark user echo "$USER soft memlock unlimited" | sudo tee -a /etc/security/limits.conf echo "$USER hard memlock unlimited" | sudo tee -a /etc/security/limits.conf echo "$USER soft nproc unlimited" | sudo tee -a /etc/security/limits.conf echo "$USER hard nproc unlimited" | sudo tee -a /etc/security/limits.conf echo "$USER soft nofile 655360" | sudo tee -a /etc/security/limits.conf echo "$USER hard nofile 655360" | sudo tee -a /etc/security/limits.conf # Check maximum file descriptors limit cat /proc/sys/fs/nr_open # Enable Transparent Huge Pages for Spark echo always > /sys/kernel/mm/transparent_hugepage/defrag echo always > /sys/kernel/mm/transparent_hugepage/enabled # Verify THP status cat /sys/kernel/mm/transparent_hugepage/enabled # Output: [always] madvise never ``` --- ## Apache Gluten with Velox Configuration Configures Apache Gluten parameters for native vectorized query execution using the Velox backend, achieving up to 3.3x speedup over vanilla Spark. ```bash # Gluten + Velox spark-defaults.conf # Essential Gluten settings spark.memory.offHeap.enabled=true spark.memory.offHeap.size=200g # Allocate as much as possible for offHeap # Velox memory configuration (Gluten v1.0.0+) spark.gluten.sql.columnar.backend.velox.memoryCapRatio=0.75 # Use jemalloc for up to 10% performance improvement spark.executorEnv.LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 # Force shuffled hash join (better than sort-merge join in native engine) spark.gluten.sql.columnar.forceshuffledhashjoin=true # Batch and partition settings spark.gluten.sql.columnar.maxBatchSize=4096 spark.sql.files.maxPartitionBytes=2g # At least 2GB for better native scan spark.sql.shuffle.partitions=288 # 2x total cores # Join optimization for complex queries (e.g., TPC-DS q72) spark.gluten.sql.columnar.joinOptimizationLevel=18 spark.gluten.sql.columnar.logicalJoinOptimizationLevel=19 spark.gluten.sql.columnar.logicalJoinOptimizeEnable=true spark.gluten.sql.columnar.physicalJoinOptimizationLevel=18 spark.gluten.sql.columnar.physicalJoinOptimizeEnable=true # Fallback threshold for unsupported operators spark.gluten.sql.columnar.wholeStage.fallback.threshold=2 # NUMA binding (bare metal only) spark.gluten.sql.columnar.numaBinding=true # false for cloud spark.gluten.sql.columnar.coreRange=0-71 # Align with NUMA topology # Bloom filter optimization spark.sql.optimizer.runtime.bloomFilter.enabled=true spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold=0 ``` --- ## MySQL NUMA Binding and Configuration Configures MySQL for optimal performance on Intel Xeon 6 processors with NUMA binding and key tuning parameters. ```bash # Check NUMA topology numactl -H # Check CPU and NUMA node mapping lscpu -e # Start MySQL bound to NUMA node 0 numactl -N 0 -m 0 /usr/local/mysql/bin/mysqld_safe \ --defaults-file=/etc/my.cnf \ --user=mysql & # For systemd service, modify /lib/systemd/system/mysql.service: # ExecStart=numactl -N 0 -m 0 /usr/sbin/mysqld # Multi-instance startup with core binding (8 cores per instance) for i in {0..7}; do echo "Starting instance ${i}" numactl -C $((${i}*4))-$((${i}*4+3)),$((${i}*4+192))-$((${i}*4+192+3)) -m 0 \ mysqld_safe --defaults-file=/mnt/data/conf/my-inst${i}.cnf & done ``` ```ini # my.cnf key parameters for performance [mysqld] # Buffer pool: 50-75% of total RAM for single instance # For multi-instance, total of all instances = 50-75% of server memory innodb_buffer_pool_size = 64G # Disable binary logging for benchmark scenarios skip-log-bin # Flush method to avoid double buffering innodb_flush_method = O_DIRECT_NO_FSYNC # Transaction durability vs performance trade-off # 1 = most durable, 0 or 2 = better performance innodb_flush_log_at_trx_commit = 2 ``` --- ## PostgreSQL Configuration and Tuning Configures PostgreSQL for optimal performance including WAL settings, shared buffers, and huge pages support. ```bash # Initialize PostgreSQL with 1GB WAL segment size (vs default 16MB) ${PG_HOME}/bin/initdb -D ./data --wal-segsize=1024 # Verify WAL segment size du -hcs pg_wal/* | head -5 # 1.0G 0000000100000A0400000002 # 1.0G 0000000100000A0400000003 # Start PostgreSQL with NUMA binding numactl -N 0 -m 0 ${PG_HOME}/bin/pg_ctl start -D /data/pgbase -l logfile & # Configure huge pages (approximately 10% improvement) # Calculate required hugepages for 10GB shared_buffers echo 6000 > /proc/sys/vm/nr_hugepages_mempolicy # Or add to /etc/sysctl.conf for persistence # vm.nr_hugepages = 6000 # sysctl -p # For 1GB huge pages (slightly better than 2MB), add to GRUB: # GRUB_CMDLINE_LINUX="rhgb default_hugepagesz=1G hugepagesz=1G" # Then: sudo grub2-mkconfig -o /boot/efi/EFI/ubuntu/grub.cfg ``` ```ini # postgresql.conf key parameters # Shared buffers: start at 25% of system memory shared_buffers = 64GB # Enable huge pages huge_pages = on # WAL settings for performance benchmarks wal_level = minimal synchronous_commit = off # Use 'on' for production # Enable autovacuum to avoid performance dips autovacuum = on ``` --- ## HammerDB Benchmark for MySQL Runs HammerDB TPC-C benchmark against MySQL for database performance evaluation. ```tcl # schemabuild.tcl - Create TPC-C schema puts "SETTING CONFIGURATION" dbset db mysql dbset bm TPC-C diset connection mysql_host localhost diset connection mysql_port 3306 diset tpcc mysql_user root diset tpcc mysql_pass MyNewPass4! diset tpcc mysql_count_ware 100 diset tpcc mysql_partition false diset tpcc mysql_num_vu 8 diset tpcc mysql_storage_engine innodb print dict vuset logtotemp 1 vuset unique 1 buildschema waittocomplete ``` ```tcl # mysqlrun.tcl - Run TPC-C benchmark puts "SETTING CONFIGURATION" dbset db mysql diset connection mysql_host localhost diset connection mysql_port 3306 diset tpcc mysql_user root diset tpcc mysql_pass MyNewPass4! diset tpcc mysql_driver timed diset tpcc mysql_prepared false diset tpcc mysql_rampup 2 diset tpcc mysql_duration 5 diset tpcc mysql_timeprofile true diset tpcc mysql_allwarehouse true vuset logtotemp 1 vuset unique 1 loadscript puts "TEST STARTED" vuset vu 64 vucreate vurun runtimer 500 vudestroy puts "TEST COMPLETE" ``` ```bash # Execute HammerDB scripts ./hammerdbcli auto schemabuild.tcl ./hammerdbcli auto mysqlrun.tcl # Multi-instance benchmark execution for i in {0..23}; do numactl -N 3-5 -m 3-5 ${HammerDB-HOME}/hammerdbcli auto \ ${CUR_FOLDER}/mysqlrun_${i}.tcl > ${CUR_FOLDER}/result-${i}.log 2>&1 & done ``` --- ## Sysbench Benchmark for MySQL Runs Sysbench OLTP benchmark for MySQL performance testing. ```bash # Prepare Sysbench database (30 tables, 1M rows each, ~8GB data) sysbench /usr/local/share/sysbench/oltp_read_write.lua \ --tables=30 \ --threads=100 \ --table-size=1000000 \ --mysql-db=sbtest \ --mysql-user=root \ --mysql-password=MyNewPass4! \ --db-driver=mysql \ --mysql-host=localhost \ --mysql_storage_engine=innodb \ prepare # Run OLTP read/write benchmark (single instance) numactl -N 3-5 -m 3-5 sysbench oltp_read_write \ --tables=32 \ --threads=64 \ --table-size=1000000 \ --mysql-host=127.0.0.1 \ --mysql-db=sbtest \ --mysql-user=intel \ --db-driver=mysql \ --mysql_storage_engine=innodb \ --report-interval=60 \ --time=180 \ --mysql-port=3306 \ --mysql-password=XXX \ --warmup-time=60 \ run # Multi-instance benchmark (24 instances) for i in {0..23}; do numactl -N 3-5 -m 3-5 sysbench /usr/local/share/sysbench/oltp_read_write.lua \ --tables=32 \ --threads=64 \ --table-size=1000000 \ --mysql-host=127.0.0.1 \ --mysql-db=sbtest \ --mysql-user=intel \ --db-driver=mysql \ --mysql_storage_engine=innodb \ --report-interval=60 \ --time=180 \ --mysql-port=$((${i}+3306)) \ --mysql-password=XXX \ --warmup-time=60 \ run > sysbench-${i}.txt 2>&1 & done sleep 250 ``` --- ## HammerDB Benchmark for PostgreSQL Runs HammerDB TPC-C benchmark against PostgreSQL for database performance evaluation. ```tcl # pgbuild.tcl - Create TPC-C schema for PostgreSQL dbset db pg dbset bm TPC-C diset connection pg_host localhost diset connection pg_port 5432 diset tpcc pg_count_ware 100 diset tpcc pg_num_vu 8 diset tpcc pg_superuser intel diset tpcc pg_superuserpass postgres diset tpcc pg_storedprocs false diset tpcc pg_raiseerror true vuset logtotemp 1 vuset unique 1 buildschema waittocomplete ``` ```tcl # pgtest.tcl - Run TPC-C benchmark for PostgreSQL puts "SETTING CONFIGURATION" dbset db pg diset connection pg_host localhost diset connection pg_port 5432 diset tpcc pg_superuser intel diset tpcc pg_superuserpass postgres diset tpcc pg_vacuum true diset tpcc pg_driver timed diset tpcc pg_rampup 2 diset tpcc pg_duration 5 diset tpcc pg_storedprocs false diset tpcc pg_count_ware 100 diset tpcc pg_raiseerror true vuset logtotemp 1 vuset unique 1 loadscript puts "TEST STARTED" vuset vu 64 vucreate vurun runtimer 300 vudestroy puts "TEST COMPLETE" ``` ```bash # Execute HammerDB for PostgreSQL ./hammerdbcli auto pgbuild.tcl ./hammerdbcli auto pgtest.tcl # Example output: # Vuser 1:64 Active Virtual Users configured # Vuser 1:TEST RESULT : System achieved 1483849 NOPM from 3515684 PostgreSQL TPM ``` --- ## Intel Performance Counter Monitor (PCM) Monitors CPU performance metrics including memory bandwidth, cache misses, and energy states using Intel PCM tools. ```bash # Install PCM from GitHub git clone https://github.com/intel/pcm.git cd pcm mkdir build && cd build cmake .. make -j # Monitor real-time CPU metrics sudo ./pcm # Monitor memory bandwidth per channel sudo ./pcm-memory # Monitor PCIe bandwidth sudo ./pcm-pcie # Monitor power and energy states sudo ./pcm-power # Monitor cache metrics sudo ./pcm-core # Example: Monitor during Cassandra stress test sudo ./pcm -r -- cassandra-stress write n=1000000 # Cloud vPMU availability testing perf stat --timeout 3000 -a -e cycles stress-ng -m $(nproc) perf stat --timeout 3000 -a -M CPI stress-ng -m $(nproc) ``` --- ## Intel PerfSpect Collects system configuration, performance metrics, and generates flame graphs for performance analysis. ```bash # Install PerfSpect git clone https://github.com/intel/PerfSpect.git cd PerfSpect pip install -r requirements.txt # Collect system configuration report sudo ./perfspect report # Monitor CPU metrics in real-time sudo ./perfspect monitor # Collect telemetry during workload sudo ./perfspect collect -d 60 -o ./results # Generate flame graph from call stacks sudo ./perfspect flame -d 30 -o ./flamegraph.svg # Modify performance-related settings sudo ./perfspect config --governor=performance ``` --- ## Intel gProfiler Profiles system-wide CPU usage across native programs, Java, Python, and kernel routines with flame graph visualization. ```bash # Install gProfiler pip install gprofiler # Or download binary wget https://github.com/intel/gprofiler/releases/latest/download/gprofiler chmod +x gprofiler # Run system-wide profiling for 60 seconds sudo ./gprofiler -d 60 -o profile_output # Profile specific Java application sudo ./gprofiler -d 60 --java-async-profiler-mode cpu -o java_profile # Profile with Python support sudo ./gprofiler -d 60 --python-mode py-spy -o python_profile # Upload results to gProfiler Performance Studio sudo ./gprofiler -d 60 \ --upload-results \ --server-host=gprofiler-studio.example.com \ --service-name=my-application # Continuous profiling with periodic uploads sudo ./gprofiler \ --continuous \ --upload-results \ --server-host=gprofiler-studio.example.com \ --profiling-frequency=11 \ --service-name=my-application ``` --- ## Summary The Intel Optimization Zone provides actionable tuning recipes for maximizing performance on Intel Xeon processors across databases, data processing frameworks, and Java applications. Key optimization patterns include: NUMA-aware deployment with numactl binding, JVM tuning with G1GC and huge pages, storage optimization for NVMe devices, and Spark/Gluten configurations for vectorized query execution. The guides demonstrate performance improvements ranging from 10-100% depending on workload characteristics, with Gluten + Velox achieving up to 3.3x speedup over vanilla Spark on Intel Xeon 6 processors. Integration patterns focus on multi-instance deployments for high-core-count systems, where applications like MySQL, PostgreSQL, and Cassandra are bound to specific NUMA nodes or Sub-NUMA Clusters (SNC). Benchmarking methodologies using HammerDB, Sysbench, cassandra-stress, and TPC-DS/TPC-H provide standardized ways to measure and validate optimizations. Performance monitoring tools including Intel PCM, PerfSpect, gProfiler, and VTune enable deep analysis of bottlenecks at the CPU, memory, storage, and application levels to guide further tuning efforts.