We have extended the implementation of the library and a further integration has been made available into D IET as a back-end of its data manager Dagda. We thus present how the library is used in the International **Sparse** **Linear** **Algebra** Expert **System** GridTLSE which manages entire expertises for the user, including data transfers, tasks executions, and graphical charts, to help analysing the overall execution. GridTLSE relies on D IET to distribute computations and thus can benefit from the persistency functionalities to provide scientists with faster results when their expertises require the same input matrices. In addition, with the possibility for two middleware to interact in a seamless way as long as they’re using an implementation of the GridRPC Data Management API, new architecture of different domains can easily be integrated to the expert **system** and thus helps the **linear** **algebra** community.

En savoir plus
This middleware is able to find an appropriate server according to the information given in the client’s request problem to be solved, size of the data involved, the performance of the t[r]

(b) Values of α
Fig. 1. Timings and α values for qr mumps frontal matrix factorization kernel
comes interesting to rely on an optimized dynamic runtime **system** to allocate and schedule tasks on computing resources. These runtime systems (such as StarPU [3], KAAPI [9], or PaRSEC [4]) are able to process a task on a pre- scribed subset of the computing cores that may evolve over time. This motivates the use of the malleable task model, where the share of processors allocated to a task vary with time. This approach has been recently used and evaluated [13] in the context of the qr mumps solver using the StarPU runtime **system**.

En savoir plus
∗ CNRS/Inria/University of Grenoble Alpes, France; firstname.lastname@imag.fr † Inria/University of Bordeaux, France; firstname.lastname@labri.fr ‡ CNRS/University Paul Sabatier, Toulouse, France; firstname.lastname@irit.fr
Abstract—The ever growing complexity and scale of paral- lel architectures imposes to rewrite classical monolithic HPC scientific applications and libraries as their portability and performance optimization only comes at a prohibitive cost. There is thus a recent and general trend in using instead a modular approach where numerical algorithms are written at a high level independently of the hardware architecture as Directed Acyclic Graphs (DAG) of tasks. A task-based runtime **system** then dynamically schedules the resulting DAG on the different computing resources, automatically taking care of data movement and taking into account the possible speed heterogeneity and variability. Evaluating the performance of such complex and dynamic systems is extremely challenging especially for irregular codes. In this article, we explain how we crafted a faithful simulation, both in terms of performance and memory usage, of the behavior of qr_mumps , a fully-featured **sparse** **linear** **algebra** library, on multi-core architectures. In our approach, the target high-end machines are calibrated only once to derive sound performance models. These models can then be used at will to quickly predict and study in a reproducible way the performance of such irregular and resource-demanding applications using solely a commodity laptop.

En savoir plus
Figure 1: Example of the decomposition of a task of the DAG of a Cholesky decomposition into smaller kernels.
As computing platforms evolve quickly and become more complex (in particularly because of the increasing use of accelerators such as GPU or Xeon Phi), it becomes interesting to rely on an optimized dynamic runtime **system** to allocate and schedule the kernels on the computing resources. These runtime systems (such as StarPU [5], KAAPI [6], or PaRSEC [7]) are able to process the kernels of a given task on a prescribed subset of the computing cores, and this subset may evolve with time. This motivates the use of a malleable task model, where the share of processors allocated to a task vary with time. This approach has been recently used and evaluated [19] in the context of the qr mumps solver using the StarPU runtime **system**.

En savoir plus
1 Introduction
Parallel **sparse** **linear** **algebra** solvers are often the innermost numerical kernels in scientific and engineering applications; consequently, they are one of the most time consuming parts. In order to cope with the hierarchical hardware design of modern large-scale supercomputers, the HPC solver community has proposed new **sparse** methods. One promising approach towards the high- performance, scalable solution of large **sparse** **linear** systems in parallel scientific computing consists of combining direct and iterative methods. To achieve a high scalability, algebraic domain decom- position methods are commonly employed to split a large size **linear** **system** into smaller size **linear** systems that can be efficiently and concurrently handled by a **sparse** direct solver while the solu- tion along the interfaces is computed iteratively [27, 25, 13, 11]. Such an hybrid approach exploits the advantages of both direct and iterative methods. The iterative component allows us to use a small amount of memory and provides a natural way for parallelization. The direct part provides its favorable numerical properties; furthermore, this combination provides opportunities to exploit several levels of parallelism.

En savoir plus
(and is also a constant factor larger than n, to ensure that the noise-free version of the corresponding
**linear** **algebra** problem has a unique solution, and that the covariance matrix of the rows a of A
is well-controlled). Our result applies to a very large class of distributions for A and e including
bounded distributions and discrete Gaussians. It relies on sub-Gaussian concentration inequalities. Interestingly, ILWE can be interpreted as a bounded distance decoding problem in a certain lattice in Z n (which is very far from random), and the least squares approach coincides with Babai’s rounding algorithm for the approximate closest vector problem (CVP) when seen through that lens. As a side contribution, we also show that even with a much stronger CVP algorithm (including an exact CVP oracle), one cannot improve the number of samples necessary to recover s by more than a constant factor. And on another side note, we also consider alternate algorithms to least squares when very few samples are available (so that the underlying **linear** **algebra** **system** is not even full-rank), but the secret vector is known to be **sparse**. In this case, **linear** programming techniques from [ CT07 ] can solve the problem efficiently.

En savoir plus
227 En savoir plus

Some solutions have been proposed in recent years but they tend to solve partially the abstrac- tion/efficiency trade-off problem. The method followed by the Formal **Linear** **Algebra** Methods Environment (FLAME) with the Libflame library [46] is a good example. It offers a framework to develop dense **linear** solvers through the use of algorithmic skeletons [15] and an API which is more user-friendly than LAPACK while giving satisfactory performance results. Another ap- proach is the one followed in recent years by C++ libraries built around expression templates [48] or other generative programming [20] principles for high-performance computing. Examples of such libraries are Armadillo [16] and MTL [27]. Armadillo provides good performance with BLAS and LAPACK bindings and an API close to Matlab [36] for simplicity. However it does not provide a generic solver like the Matlab routine linsolve that can analyze the matrix type and choose the correct routine to call from the LAPACK library. It also does not support GPU computations which are becoming mandatory for medium to large dense **linear** **algebra** problems. In a similar way, while MTL can topple the performance of vendor-tuned codes, it does not offer linsolves-like implementation or GPU support. Other examples of libraries with similar content include Eigen [30] , Flens [34], Ublas [49] and Blaze [32].

En savoir plus
MSC 2010 subject classiﬁcations: Primary 62F15; secondary 62J05.
Keywords: Bayesian regression, functional data, support estimate, parsimony.
1 Introduction
Consider that one wants to explain the ﬁnal outcome y of a process along time (for instance the amount of some agricultural production) thanks to what happened during the whole history (for instance, the rainfall history, or temperature history). Among the statistical learning methods, functional **linear** models (Ramsay and Silverman, 2005 ) aim at predicting a scalar y based on covariates x 1 (t), x 2 (t), . . . , x q (t) lying in a functional

En savoir plus
Our main contribution is to demonstrate one case in which this notion of effective dimension is helpful for approximate CV - that of f, regularized generalized lin[r]

Abstract
This paper adresses static resource allocation problems for irregular distributed parallel applica- tions. More precisely, we focus on two classical tiled **linear** **algebra** kernels: the Matrix Multiplication and the LU decomposition algorithms on large **linear** systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of **linear** **algebra** kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the workload associated to each tile depends on its compression factor, which induces an heteroge- neous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered over the whole matrix. This in turn causes communication complexity problems, since matrix multiplication and LU decompositions heavily rely on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D Block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good performance on makespan when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.

En savoir plus
We introduce in this paper a calculus HKA for Kleene **algebra** whose non- wellfounded proofs we prove sound and complete (Sects. 5 and 6). This calculus is cut-free and admits the subformula property. We actually prove that its regular fragment—those proofs with potentially cyclic but finite dependency graphs—is complete. Our approach is related to other works on cyclic systems for logics, e.g. [11,13], but is more fine-grained proof theoretically. We give a diagrammatic summary of our contributions in Fig. 1, where we use the symbols ` ω and ` ∞ to distinguish between regular proofs and arbitrary, potentially infinite proofs resp. Starting from Palka’s **system**, a natural idea when looking for a regular **system** consists in replacing her infinitary rules for Kleene star by finitary ones, and allowing non-wellfounded proofs. Doing so, we obtain the calculus LKA described in Sect. 3: proofs that are well-founded but of infinite width in Palka’s **system** become finitely branching but infinitely deep in LKA. These non-wellfounded proofs of LKA admit an elegant proof theory, but we show that its regular fragment is not complete: there are valid inequalities which require arbitrarily large sequents to appear in their proofs. We solve this problem by allowing slightly more structure in the succedents of sequents, moving to hypersequents to design the calculus HKA (Sect. 4). After showing completeness, inspection of the regular proofs of HKA yields an alternative proof that the equational theory of rational languages is in PSpace, without relying on automata-theoretic arguments (Sect. 7). We conclude this paper with some further comments and directions for future work (Sect. 8).

En savoir plus
straining e to be bounded with respect to a certain norm. In [7], Candès and Randall used this approach to cor- rect errors occurring when decoding messages transmit- ted over communication channels. Their idea is to esti- mate, under sparsity of (25), the error e together with the vector θ which we suppose here to represent one sub- model of the switched **system** (1). To do so, one needs however to know a priori an upper bound η on the norm of the noise. More precisely, a somewhat tight bound η satisfying kek ℓ ≤ η, with ℓ a certain norm in {2, ∞, . . .}, is required. If θ is a PV for the switched **linear** **system**, then θ may be computed from the convex program

En savoir plus
REORDERING STRATEGY FOR BLOCKING OPTIMIZATION IN
**SPARSE** **LINEAR** SOLVERS ∗
GREGOIRE PICHON † , MATHIEU FAVERGE ‡ , PIERRE RAMET † , AND JEAN ROMAN † Abstract. Solving **sparse** **linear** systems is a problem that arises in many scientific applications, and **sparse** direct solvers are a time-consuming and key kernel for those applications and for more advanced solvers such as hybrid direct-iterative solvers. For this reason, optimizing their performance on modern architectures is critical. The preprocessing steps of **sparse** direct solvers—ordering and block-symbolic factorization—are two major steps that lead to a reduced amount of computation and memory and to a better task granularity to reach a good level of performance when using BLAS kernels. With the advent of GPUs, the granularity of the block computation has become more important than ever. In this paper, we present a reordering strategy that increases this block granularity. This strategy relies on block-symbolic factorization to refine the ordering produced by tools such as Metis or Scotch, but it does not impact the number of operations required to solve the problem. We integrate this algorithm in the PaStiX solver and show an important reduction of the number of off-diagonal blocks on a large spectrum of matrices. This improvement leads to an increase in efficiency of up to 20% on GPUs.

En savoir plus
Bordenave, C. and Chafa¨ı, D. (2012). Around the circular law. Probability Surveys. 9 1-89. Collier, O., Comminges, L., and Tsybakov, A.B. (2017). Minimax estimation of **linear**
and quadratic functionals under sparsity constraints. Ann. Statist. 45 923–958.
Donoho, D.L. and Jin, J. (2004). Higher criticism for detecting **sparse** heterogeneous mix- tures. Ann. Statist. 32 962–994.

Finally, item c) intended to probe if students’ previously constructed structures about matrices and vectors enabled them to recognize the product of a matrix A and a vector s in the model for the transformation of sources and observations they were developing. We intended to observe if they were able to relate these constructions to BSS contextual elements in order to explain, beyond mathematics, the need for A to have n columns if vector s is in R n . Results obtained showed that effectively, most students had interiorized the matrix form of a **system** of equations into a process and could coordinate it with a process of coefficient matrix once the nxm **linear** **system** was identified in item b). Students related the size of the matrix A to the BSS context by observing that the number of columns of A must equal the number of sources, and the product results in m observations, so they concluded A has to have m rows by making reference to configurations in each case.

En savoir plus
2007 ), it is very rare that **sparse** sources both have non-zeros values at the same time. Therefore, when plotting the scatter plot of S 1 as a function of S 2 (cf. Fig. 1(a) ), most of the
source coefficients lie on the axes (in this work we even assume that all coefficients lie on the axes – this hypothesis is discussed in Sec. 4 ). Once mixed with the non-**linear** f , the source coefficients lying on the axes are transformed into n non-**linear** one dimensional (1D) manifolds ( Ehsandoust et al. , 2016 ; Puigt et al. , 2012 ), each manifold corresponding to one source (see Fig 1(b) ). To separate the sources, the idea is then to back-project each manifold on one of the axes. We propose to perform this back-projection by approximating the 1D-manifolds by a **linear**-by-part function, that we will invert. As evoked above, we then get separated sources which are only distorted through non-**linear** functions that do not remix them, called h in the following.

En savoir plus
Fig. 4.1. Experiments results for small matrices
infrastructure for conducting the experiments. Taking into account the testbed infrastructure’s parameters (mostly RAM), three different sizes of matrices are studied: small (4096×4096), medium (8192×8192) and large (12288×12288). The Simple Linux Utility for Resource Management (Slurm) [21] used for jobs scheduling which is an open source, fault-tolerant, and highly scalable cluster management and job scheduling **system** both for large scale and small Linux clusters. To ensure that the results were statistically sound and the value of the execution time and power consumption of each matrix is reliable, for each random execution the experiment is run ten times and the arithmetic mean is taken.

En savoir plus
3. For all a, b, c ∈ g, (a • b) • c − a • (b • c) = (a • c) • b − a • (c • b).
4. For all a, b ∈ T (V ), ∆(a • b) = a (1) ⊗ a (2) • b + a (1) • b (1) ⊗ a (2) b (2) , with Sweedler’s notations.
Our aim in this text is to give a generalization of the construction of g and its relative H and G, and to study some general properties of this construction. Let us take any **linear** endomorphism f of a vector space V . We inductively define a pre-Lie product • on the shuffle Hopf **algebra** (T (V ), , ∆), making it a Com-Pre-Lie Hopf **algebra** denoted by T (V, f ) (definition 1 and theorem 2). For example, if x 1 , x 2 , x 3 ∈ V and w ∈ T (V ):

En savoir plus
that the number of modular reductions is smaller in the case of tile recursive LU factorization, which is one motivation for the use of the tile recursive variant over a finite field.
The impact of grain size. The granularity is the block dimension (or the dimen- sion of the smallest blocks in recursive splittings). Matrices with dimensions below this threshold are treated by a base-case variant (often referred to as the panel factorization [ 8 ], in the case of the PLUQ decomposition). It is an im- portant parameter for optimizations: a finer grain allows more flexibility in the scheduling when running on numerous cores, but it also challenges the efficiency of the scheduler and can increase the memory bus traffic. In numerical **linear** **algebra**, where cubic time algorithm are used, the arithmetic cost is indepen- dent of the cutting in blocks. Hence the granularity has very little impact on the efficiency of a block algorithm run sequentially. On the contrary, we saw in Table 1 that over a finite field, a finer granularity can lead to a larger number of costly modular reductions. The use of sub-cubic variants for the sequential matrix multiplications is another reason why coarser a granularity lead to a higher sequential efficiency. On the other hand, the granularity needs to be fine enough so as to generate enough independent tasks to be executed in parallel. Therefore, with a fixed number of resources, we will rather set the number of tasks to be created (usually to the number of available cores, or slightly more), instead of setting a fixed small grain size as usually done in numerical **linear** al- gebra. Hence, an increase in the dimensions, will result in a coarser granularity, making each sequential task perform more efficiently.

En savoir plus