# Highway SIMD Library Highway is a C++ library providing portable SIMD/vector intrinsics that enable efficient cross-platform vectorization. It abstracts CPU-specific SIMD instructions across seven architectures (x86, ARM, RISC-V, POWER, LoongArch, IBM Z, WebAssembly) through a unified API, allowing developers to write performance-critical code once while maintaining optimal performance on different hardware. The library supports 27 different instruction set targets, from basic SSE2 to advanced AVX-512 and ARM SVE, with automatic runtime or compile-time dispatch to select the best available implementation. Highway achieves 5-10x speedups and up to 5x energy reduction compared to scalar code by making SIMD programming practical. Unlike compiler autovectorization, Highway provides predictable code generation through carefully-designed functions that map directly to CPU instructions. The library requires only C++11, supports dynamic dispatch for heterogeneous environments, and includes extensive utilities for sorting, mathematical operations, image processing, and data transformations. ## Core APIs and Functions ### Basic Vector Operations with ScalableTag Creating and manipulating SIMD vectors using type-agnostic tags for maximum portability. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; void VectorAddExample(const float* HWY_RESTRICT a, const float* HWY_RESTRICT b, float* HWY_RESTRICT out, size_t count) { const hn::ScalableTag d; // Use best available vector size const size_t N = hn::Lanes(d); // Number of lanes in vector size_t i = 0; for (; i + N <= count; i += N) { auto va = hn::Load(d, a + i); auto vb = hn::Load(d, b + i); auto sum = hn::Add(va, vb); hn::Store(sum, d, out + i); } // Handle remainder with capped tag if (i < count) { const hn::CappedTag d1; for (; i < count; ++i) { auto va = hn::Load(d1, a + i); auto vb = hn::Load(d1, b + i); hn::Store(hn::Add(va, vb), d1, out + i); } } } ``` ### Dynamic Dispatch for Multiple Targets Implementing runtime-adaptive code that automatically selects the best SIMD instruction set. ```cpp // skeleton.cc #undef HWY_TARGET_INCLUDE #define HWY_TARGET_INCLUDE "myfile.cc" #include "hwy/foreach_target.h" #include "hwy/highway.h" HWY_BEFORE_NAMESPACE(); namespace myproject { namespace HWY_NAMESPACE { namespace hn = hwy::HWY_NAMESPACE; void FloorLog2(const uint8_t* HWY_RESTRICT values, size_t count, uint8_t* HWY_RESTRICT log2) { const hn::ScalableTag df; const hn::RebindToSigned d32; const hn::Rebind d8; const size_t N = hn::Lanes(df); for (size_t i = 0; i + N <= count; i += N) { auto vi32 = hn::PromoteTo(d32, hn::Load(d8, values + i)); auto bits = hn::BitCast(d32, hn::ConvertTo(df, vi32)); auto exponent = hn::Sub(hn::ShiftRight<23>(bits), hn::Set(d32, 127)); hn::Store(hn::DemoteTo(d8, exponent), d8, log2 + i); } } } // namespace HWY_NAMESPACE } // namespace myproject HWY_AFTER_NAMESPACE(); #if HWY_ONCE namespace myproject { HWY_EXPORT(FloorLog2); void CallFloorLog2(const uint8_t* in, size_t count, uint8_t* out) { return HWY_DYNAMIC_DISPATCH(FloorLog2)(in, count, out); } } // namespace myproject #endif ``` ### FMA and Arithmetic Operations Fused multiply-add and combined arithmetic operations for maximum performance. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; // SAXPY: y[i] = alpha * x[i] + y[i] void SAXPY(const float alpha, const float* HWY_RESTRICT x, float* HWY_RESTRICT y, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); const auto valpha = hn::Set(d, alpha); size_t i = 0; for (; i + N <= n; i += N) { auto vx = hn::Load(d, x + i); auto vy = hn::Load(d, y + i); // MulAdd(a, b, c) = a * b + c (uses FMA if available) auto result = hn::MulAdd(valpha, vx, vy); hn::Store(result, d, y + i); } // Scalar remainder for (; i < n; ++i) { y[i] = alpha * x[i] + y[i]; } } ``` ### Masked Operations and Conditional Processing Using masks for conditional operations and partial vector operations. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; void ClampArray(float* HWY_RESTRICT data, size_t count, float min_val, float max_val) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); const auto vmin = hn::Set(d, min_val); const auto vmax = hn::Set(d, max_val); size_t i = 0; for (; i + N <= count; i += N) { auto v = hn::Load(d, data + i); // Create masks for out-of-range values auto too_low = hn::Lt(v, vmin); auto too_high = hn::Gt(v, vmax); // Clamp: replace out-of-range with limits v = hn::IfThenElse(too_low, vmin, v); v = hn::IfThenElse(too_high, vmax, v); hn::Store(v, d, data + i); } // Handle remainder with BlendedStore (safe, no fault) if (i < count) { size_t remaining = count - i; auto v = hn::LoadN(d, data + i, remaining); auto mask = hn::FirstN(d, remaining); v = hn::IfThenElse(hn::Lt(v, vmin), vmin, v); v = hn::IfThenElse(hn::Gt(v, vmax), vmax, v); hn::BlendedStore(v, mask, d, data + i); } } ``` ### Transform Functions for Array Processing High-level utilities that handle loop iteration and remainder processing automatically. ```cpp #include "hwy/highway.h" #include "hwy/contrib/algo/transform-inl.h" namespace hn = hwy::HWY_NAMESPACE; void MultiplyArrays(const float* HWY_RESTRICT a, const float* HWY_RESTRICT b, float* HWY_RESTRICT out, size_t n) { const hn::ScalableTag d; // Transform handles loop + remainder automatically hn::Transform2(d, a, b, n, out, [](auto d, const auto va, const auto vb) HWY_ATTR { return hn::Mul(va, vb); }); } // In-place transformation void SquareArray(float* HWY_RESTRICT data, size_t n) { const hn::ScalableTag d; hn::Transform(d, data, n, [](auto d, const auto v) HWY_ATTR { return hn::Mul(v, v); }); } // Complex transformation with generator void GenerateSequence(float* HWY_RESTRICT out, size_t n, float scale) { const hn::ScalableTag d; const auto vscale = hn::Set(d, scale); hn::Generate(d, out, n, [vscale](auto d, const auto vidx) HWY_ATTR { auto idx_float = hn::ConvertTo(d, vidx); return hn::Mul(idx_float, vscale); }); } ``` ### Vectorized Quicksort (VQSort) High-performance sorting with automatic instruction set selection. ```cpp #include "hwy/contrib/sort/vqsort.h" void SortExample() { std::vector data(1000000); // ... fill data ... // Sort ascending (unstable sort) hwy::VQSort(data.data(), data.size(), hwy::SortAscending()); // Sort descending hwy::VQSort(data.data(), data.size(), hwy::SortDescending()); } void PartialSortExample() { std::vector data(1000000); // ... fill data ... // Get smallest 1000 elements (partially sorted) size_t k = 1000; hwy::VQPartialSort(data.data(), data.size(), k, hwy::SortAscending()); // data[0..999] now contains smallest 1000 elements, sorted } void SelectExample() { std::vector data(1000000); // ... fill data ... // Find median (partition around middle element) size_t k = data.size() / 2; hwy::VQSelect(data.data(), data.size(), k, hwy::SortAscending()); int32_t median = data[k]; // Elements before data[k] are <= median, after are >= median } void SortKeyValuePairs() { std::vector pairs(100000); // K32V32 = struct with 32-bit key and 32-bit value for (size_t i = 0; i < pairs.size(); ++i) { pairs[i].key = rand(); pairs[i].value = i; // preserve original index } // Sort by key, values follow their keys hwy::VQSort(pairs.data(), pairs.size(), hwy::SortAscending()); } ``` ### SIMD Math Functions Vectorized mathematical operations including trigonometry, logarithms, and exponentials. ```cpp #include "hwy/highway.h" #include "hwy/contrib/math/math-inl.h" namespace hn = hwy::HWY_NAMESPACE; void ApplySinCos(const float* HWY_RESTRICT angles, float* HWY_RESTRICT sin_out, float* HWY_RESTRICT cos_out, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); for (size_t i = 0; i + N <= n; i += N) { auto vangle = hn::Load(d, angles + i); auto vsin = hn::Sin(d, vangle); // ULP error <= 3 auto vcos = hn::Cos(d, vangle); // ULP error <= 3 hn::Store(vsin, d, sin_out + i); hn::Store(vcos, d, cos_out + i); } } void LogExpOperations(const float* HWY_RESTRICT x, float* HWY_RESTRICT out, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); for (size_t i = 0; i + N <= n; i += N) { auto v = hn::Load(d, x + i); // Natural logarithm (ULP error <= 1) auto log_v = hn::Log(d, v); // Exponential (ULP error <= 1) auto exp_v = hn::Exp(d, v); // Log2 and Exp2 auto log2_v = hn::Log2(d, v); auto exp2_v = hn::Exp(d, hn::Mul(v, hn::Set(d, 0.693147f))); // ln(2) // Combined: exp(log(x)) should equal x auto result = hn::Exp(d, log_v); hn::Store(result, d, out + i); } } void HyperbolicFunctions(float* HWY_RESTRICT data, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); for (size_t i = 0; i + N <= n; i += N) { auto v = hn::Load(d, data + i); auto sinh_v = hn::Sinh(d, v); // hyperbolic sine auto cosh_v = hn::Cosh(d, v); // hyperbolic cosine auto tanh_v = hn::Tanh(d, v); // hyperbolic tangent auto result = hn::Add(sinh_v, hn::Mul(cosh_v, tanh_v)); hn::Store(result, d, data + i); } } ``` ### Memory Alignment and Allocation Allocating aligned memory for optimal SIMD performance. ```cpp #include "hwy/aligned_allocator.h" #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; class AlignedBuffer { public: explicit AlignedBuffer(size_t size) : size_(size) { // Allocates memory aligned to HWY_ALIGNMENT (128 bytes) data_ = static_cast( hwy::AllocateAlignedBytes(size * sizeof(float), nullptr, nullptr)); if (!data_) throw std::bad_alloc(); } ~AlignedBuffer() { hwy::FreeAlignedBytes(data_, nullptr, nullptr); } float* get() { return data_; } const float* get() const { return data_; } size_t size() const { return size_; } // Verify alignment bool is_aligned() const { return hwy::IsAligned(data_, HWY_ALIGNMENT); } private: float* data_; size_t size_; }; void UseAlignedBuffer() { const size_t n = 10000; AlignedBuffer buffer(n); const hn::ScalableTag d; const size_t N = hn::Lanes(d); // Can use aligned Load/Store for better performance for (size_t i = 0; i + N <= n; i += N) { auto v = hn::Load(d, buffer.get() + i); // LoadU not needed v = hn::Add(v, hn::Set(d, 1.0f)); hn::Store(v, d, buffer.get() + i); } } // Using STL container with aligned allocator void STLAlignedVector() { std::vector> aligned_vec(1000); // This vector's data is properly aligned for SIMD const hn::ScalableTag d; const size_t N = hn::Lanes(d); for (size_t i = 0; i + N <= aligned_vec.size(); i += N) { auto v = hn::Load(d, aligned_vec.data() + i); hn::Store(hn::Mul(v, hn::Set(d, 2.0f)), d, aligned_vec.data() + i); } } ``` ### CappedTag and FixedTag for Bounded Vectors Using tags with size constraints for specific data structures or algorithms. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; // Process small fixed-size arrays (e.g., 3D/4D vectors) void Transform4x4Matrix(float matrix[16]) { // Exactly 4 float lanes (works even on AVX-512) const hn::FixedTag d4; for (int row = 0; row < 4; ++row) { auto v = hn::Load(d4, matrix + row * 4); v = hn::Mul(v, hn::Set(d4, 2.0f)); hn::Store(v, d4, matrix + row * 4); } } // Process up to N lanes (rounds down to power of 2) void ProcessNarrowData(const uint16_t* in, uint16_t* out, size_t count) { // Process up to 8 lanes at once (will be 8, 4, 2, or 1 depending on HW) const hn::CappedTag d; const size_t N = hn::Lanes(d); size_t i = 0; for (; i + N <= count; i += N) { auto v = hn::Load(d, in + i); v = hn::Add(v, hn::Set(d, uint16_t{1})); hn::Store(v, d, out + i); } // Remainder for (; i < count; ++i) { out[i] = in[i] + 1; } } // RGB to grayscale with CappedTag (3 channels) void RGBToGray(const uint8_t* rgb, float* gray, size_t pixel_count) { const hn::CappedTag d8; const hn::CappedTag df; const size_t N = hn::Lanes(d8); for (size_t i = 0; i + N <= pixel_count * 3; i += N) { auto v = hn::Load(d8, rgb + i); auto vf = hn::ConvertTo(df, hn::PromoteTo(hn::RebindToSigned(), v)); // Apply grayscale weights: 0.299*R + 0.587*G + 0.114*B (simplified) hn::Store(vf, df, gray + i / 3); } } ``` ### Horizontal Operations and Reductions Combining vector lanes into scalar results or performing cross-lane operations. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; // Sum all elements in array float SumArray(const float* data, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); auto vsum = hn::Zero(d); size_t i = 0; for (; i + N <= n; i += N) { auto v = hn::Load(d, data + i); vsum = hn::Add(vsum, v); } // Reduce vector to single value float sum = hn::ReduceSum(d, vsum); // Add scalar remainder for (; i < n; ++i) { sum += data[i]; } return sum; } // Dot product float DotProduct(const float* a, const float* b, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); auto vdot = hn::Zero(d); size_t i = 0; for (; i + N <= n; i += N) { auto va = hn::Load(d, a + i); auto vb = hn::Load(d, b + i); vdot = hn::MulAdd(va, vb, vdot); } float dot = hn::ReduceSum(d, vdot); for (; i < n; ++i) { dot += a[i] * b[i]; } return dot; } // Find minimum value float FindMin(const float* data, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); if (n == 0) return 0.0f; auto vmin = hn::Set(d, data[0]); size_t i = 0; for (; i + N <= n; i += N) { auto v = hn::Load(d, data + i); vmin = hn::Min(vmin, v); } float min_val = hn::ReduceMin(d, vmin); for (; i < n; ++i) { min_val = std::min(min_val, data[i]); } return min_val; } ``` ### Type Conversions and Promotions Converting between different element types and sizes. ```cpp #include "hwy/highway.h" namespace hn = hwy::HWY_NAMESPACE; // Convert uint8 to float with scaling void Uint8ToFloat(const uint8_t* in, float* out, size_t n, float scale) { const hn::ScalableTag d8; const hn::Rebind d32; const hn::Rebind df; const size_t N = hn::Lanes(d8); const auto vscale = hn::Set(df, scale); for (size_t i = 0; i + N <= n; i += N) { auto v8 = hn::Load(d8, in + i); // Promote uint8 -> int32 -> float auto v32 = hn::PromoteTo(d32, v8); auto vf = hn::ConvertTo(df, v32); // Scale and store vf = hn::Mul(vf, vscale); hn::Store(vf, df, out + i); } } // Convert float to int16 with saturation void FloatToInt16(const float* in, int16_t* out, size_t n) { const hn::ScalableTag df; const hn::Rebind d32; const hn::Rebind d16; const size_t N = hn::Lanes(df); for (size_t i = 0; i + N <= n; i += N) { auto vf = hn::Load(df, in + i); // Convert float -> int32, then demote int32 -> int16 (with saturation) auto v32 = hn::ConvertTo(d32, vf); auto v16 = hn::DemoteTo(d16, v32); hn::Store(v16, d16, out + i); } } // Interleave two arrays (useful for complex numbers, stereo audio) void InterleaveArrays(const float* left, const float* right, float* interleaved, size_t n) { const hn::ScalableTag d; const size_t N = hn::Lanes(d); for (size_t i = 0; i + N <= n; i += N) { auto vleft = hn::Load(d, left + i); auto vright = hn::Load(d, right + i); // Store interleaved: L0 R0 L1 R1 L2 R2 ... hn::StoreInterleaved2(vleft, vright, d, interleaved + i * 2); } } ``` ## Summary and Integration Patterns Highway is designed for performance-critical applications requiring portable SIMD acceleration, including image/video processing (codecs like JPEG XL, AOM), machine learning inference (TensorFlow, gemma.cpp), database indexing (iresearch, ScaNN), compression algorithms, and scientific computing. The library's primary use cases span scenarios where processing large arrays with uniform operations provides significant speedup, typically for datasets above 100 KiB where vectorization overhead is amortized. Key applications include dot products and matrix operations in ML, sorting and searching in databases, trigonometric functions in signal processing, and pixel transformations in graphics pipelines. Integration typically follows one of two patterns: static dispatch for single-target builds (embedded systems, mobile apps) where `HWY_STATIC_DISPATCH` compiles code once for a known architecture, or dynamic dispatch for multi-platform deployment where `HWY_DYNAMIC_DISPATCH` generates code for multiple targets and selects the best at runtime. The library integrates seamlessly with existing C++ codebases through header-only includes, requires no special build configuration beyond enabling optimizations (`-O2`), and works with standard package managers (vcpkg, conan, conda-forge). For maximum performance, developers should design data structures using aligned allocations, structure-of-array layouts, and padded buffers while leveraging Highway's high-level APIs like `Transform` and `VQSort` that automatically handle edge cases, remainder loops, and platform-specific optimizations.