On designing Tenseur, A C++ tensor library with lazy evaluation

Marc

2024/10/27

Tenseur is a header only c++ numerical library. It’s designed to provide high-performance linear algebra and multilinear algebra operations by laveraging the capabilities of BLAS (Basic Linear Algebra Subprograms) and SIMD (Single Instruction, Multiple Data) backends. This library, currently in development, aims to offer a robust framework for numerical computations focusing on both performance and easy of use. In this blog post, we will first describe some design decisions. Then we will show some of its features and advantages over other libraries.

Lazy evaluation

First, let’s explore lazy evaluation of expressions. Lazy evaluation means that expressions are not computed immediately. Instead, they are evaluated only when necessary. This approach can lead to significant performance improvement.

Expression template

Expression template is a powerful technique that enables representing mathematical expression as types. This allows for compile-time optimizations. The compiler can analyze the expressions and generated efficient code.

Combining lazy evaluation and expression templates

One of the design decisions of Tenseur is to combine both lazy evaluation and expression templates. By combining these two features, the library can execute complex mathematical expressions with minimal runtime overhead. Here are some of the advantages regarding these two features:

Here’s a step by step description of how expresions and templates metaprogramming are combined in C++ to create lazy evaluation of expressions:

  1. Define expression class, this is a base class for all expressions.
/// Represent an expression
template <typename Derived> class expr {
 protected:
   expr() = default;
   expr(const expr &) = default;
   expr(expr &&) = default;
};
  1. Define expressions: adding, sub, mult, div. An example for adding expression is described below:
template <typename left_expr, typename right_expr>
   requires ::ten::is_expr<std::remove_cvref_t<left_expr>> &&
            ::ten::is_expr<std::remove_cvref_t<right_expr>>
auto operator+(left_expr &&left, right_expr &&right) {
   using L = std::remove_cvref_t<left_expr>;
   using R = std::remove_cvref_t<right_expr>;
   return ::ten::binary_expr<
       typename L::node_type, typename R::node_type,
       ::ten::functional::binary_func<::ten::binary_operation::add>::func>(
       left.node(), right.node());
}

template <typename T, typename E>
   requires ::ten::is_expr<std::remove_cvref_t<E>> &&
            std::is_floating_point_v<T>
auto operator+(T &&scalar, E &&expr) {
   using R = std::remove_cvref_t<E>;
   return ::ten::scalar<T>(scalar) + std::forward<R>(expr);
}

template <typename E, typename T>
   requires ::ten::is_expr<std::remove_cvref_t<E>> &&
            std::is_floating_point_v<T>
auto operator+(E &&expr, T &&scalar) {
   using R = std::remove_cvref_t<E>;
   return std::forward<R>(expr) + ::ten::scalar<T>(scalar);
}
  1. Define a scalar node and a scalar type
/// \class scalar_operations
template <class T> struct scalar_operations {
   [[nodiscard]] inline static constexpr size_type rank() { return 0; }
};

/// \class scalar_node
/// Node of scalar type
template <typename __t>
class scalar_node : public scalar_operations<scalar_node<__t>> {
...
};

/// \class scalar
/// Hold a single value of type __t.
template <typename __t>
class scalar : public expr<scalar<__t>>, public scalar_operations<scalar<__t>> {
...
};
  1. Define a tensor node and a tensor type
/// \class tensor_operations
/// Tensor operations
template <class __t, class shape, storage_order order, class storage,
          class allocator>
struct tensor_operations {
...
};

/// \class tensor_node
/// Tensor node
template <class __t, class __shape, storage_order __order, class __storage,
          class __allocator>
class tensor_node
    : public tensor_operations<__t, __shape, __order, __storage, __allocator> {
...
};

/// Tensor represented by a multidimentional array.
template <class __t, class __shape, storage_order __order = default_order,
          class __storage = default_storage<__t, __shape>,
          class __allocator = typename details::allocator_type<__storage>::type>
class ranked_tensor final
    : public ::ten::expr<
          ranked_tensor<__t, __shape, __order, __storage, __allocator>>,
      public ::ten::tensor_operations<__t, __shape, __order, __storage,
                                      __allocator>,
      public ::ten::tensor_base {
...
};
  1. Define a unary node and unary expression
// \class unary_node
// Apply a function to a ten::Tensor or a ten::Scalar
template <class __input, class __output, template <typename...> class __func,
          typename... __args>
class unary_node {
...
};

/// \class unary_expr
/// Unary expression.
template <typename __expr, template <class...> class __func, typename... __args>
class unary_expr : ::ten::expr<unary_expr<__expr, __func, __args...>> {
...
};

The unary node holds a shared pointer to a unary node.

  1. Define a binary node and binary expression
// \class binary_node
// Node of a binary expresion
// Left and Right can be scalar_node, tensor_node or binary_node
template <class __left, class __right, class __output,
          template <typename...> class __func, typename... __args>
class binary_node {
...
};

/// \class binary_expr
/// Binary expression
// left and right can be scalar_node, tensor_node or binary_expr
// left is std::shared_ptr<__left> and right is std::shared_ptr<__right>
template <typename __left, typename __right, template <typename...> class __func,
          typename... __args>
class binary_expr : ::ten::expr<binary_expr<__left, __right, __func, __args...>> {
...
};

The binary expresion holds a shared pointer to a binary node.

  1. Define useful aliases. Here a vector and a matrix. Other aliases like static tensor, static vector, static matrix and diagonal matrix are defined.
// vector<__t>
template <class __t, storage_order __order = default_order,
          class __storage = default_storage<__t, shape<::ten::dynamic>>,
          class __allocator = typename details::allocator_type<__storage>::type>
using vector =
    ranked_tensor<__t, shape<::ten::dynamic>, __order, __storage, __allocator>;

// matrix<__t> or matrix<__t, __shape>
template <class __t, class __shape = dynamic_shape<2>,
          storage_order __order = default_order,
          class __storage = default_storage<__t, __shape>,
          class __allocator = typename details::allocator_type<__storage>::type>
   requires(__shape::rank() == 2)
using matrix = ranked_tensor<__t, __shape, __order, __storage, __allocator>;

SIMD

The library support SIMD operations for addition, multiplication, substraction and division. By using SIMD instructions, it can process multiple data points in parallel, significantly speeding up computations. The simd backend implementation is available at Simd backend.

BLAS

The BLAS backend allow users to take advantages of highly optimized routines for vector and matrix operations that are tuned for various architectures. This is particularly benefical for large-scale matrix operations, where reference methods may perform very poorly. The blas backend implementations are available at Blas API.

Allocator

An allocator manages memory allocation and deallocation. Allocators are crucial for efficient memory management. They also help ensure that memory is used effectively and that ressources are freed when they are no longer necessary. Tenseur uses std::allocator under the hood for aligned memory allocation. it’s implemented in the following link : Dense storage.

Examples

tensor<float, 3> a({2, 3, 4});
matrix<float> b({2, 3});
vector<float> c({5});
auto x = zeros<float, 3>({2, 3, 4});
auto x = fill<float, 2>({2, 3}, 0.1);
auto x = range<matrix<float>>({2, 3});
auto y = reshape<2>(x, {3, 2});
// y is an expression, it can be evaluated
auto z = y.eval();
auto x = range<tensor<float, 3>>({2, 3, 4});
vector<float> b = flatten(a);
size_t seed = 1234;
auto x = rand<smatrix<float, 2, 3>>(seed);
auto y = rand<matrix<float>>({2, 3}, seed);
auto z = rand<float, 3>({2, 3, 4}, seed);
tensor<float, 3, storage_order::col_major> x({2, 3, 4});
cout << "shape = " << x.shape() << endl;
cout << "strides = " << x.strides() << endl;
tensor<float, 3, storage_order::row_major> y({2, 3, 4});
cout << "shape = " << y.shape() << endl;
cout << "strides = " << y.strides() << endl;
auto a = range<matrix<float>>({2, 3});
auto b = range<matrix<float>>({3, 4});
auto c = (a * b).eval();
auto a = range<vector<float>>(5);
auto b = 2. * a;
auto c = sqrt(b);
auto d = (c + a).eval();

Performance

The library is optimized for vector and matrix operations providing excellent performance in many cases. Here are some benchmarks against Eigen:

Gemm

Sum

Sub

Div

Mult

Future plans

The library is currently under active development. With its combination of lazy evaluation, template expressions, and advanced BLAS and SIMD backends, it stands to make a significant impact on the landscape of numerical linear algebra in C++. Some of the future plans are:

References