ACHAR Project

Active-inference of Complex models using Homing sequences And Registers

Context

Origin and motivation of the project

The visit in September and October 2022 of Adenilso Simao (U. São Paulo) and Neil Walkinshaw (U. Sheffield) as UGA guest researchers, accompanied by Rafael S. Braz (master project at USP, followed for 2 years by A. Simao, R. Groz and C. Oriat) and Michael Foster (post-doc in Sheffield) made it possible to work very effectively on a method allowing us to infer extended automaton type models (EFSM) of learned systems in black box. This was also an opportunity to connect work by Nicolas Hili (LIG-Vasco) with Neil's CITCoM project Walkinshaw in Sheffield.

Inference of EFSM is a complex problem which we have only sketched out a solution. We must now develop and consolidate this method, carry out an implementation, experiment it on various case studies and evaluate its performance.

The ACHAR project should allow us to extend this initial collaboration, with the recruitment of two interns and new visits.

Scientific context

The use of models in software engineering is the basis of application tools that use these models for system analyses, validation or other tasks. Unfortunately, a lot of software systems are developed without their construction or specification being formalized, which makes the use of secure techniques more difficult. The Vasco team had developed, as part of previous projects, model inference techniques (SIMPA tool initially developed as part of the European SPaCIoS project, in particular in collaboration with SAP Research). Like other inference techniques models, it was assumed that we could reset the system to learn. However, it turns out that resets are often time consuming or not possible in the context of systems accessed via a network or other interfaces. This led us to develop inference algorithms of automata without the possibility of resetting them, which has already been done with A. Simao (see references).

However, automata inference algorithms remain limited to simple automata models, with finite alphabets, and without parameters or variables. The University of Dortmund in collaboration with Uppsala has extended this in the Learnlib library to automata that can memorize a parameter value (register automata), as we we did in the European SPaCIoS project, but this remains limited to domains with equality and requires recourse to very numerous resets.

By combining our work on active inference without reset of simple automata to those at the University of Sheffield who studied passive inference (from software traces) of extended finite state machines (EFSM), we were able to make an important step to obtain parameterized models making it possible to represent both control and software data processing.

A second promising possibility for synergy consists of associating $ the work of Sheffield on causal inference from trace analysis (CITCoM project) with the construction of DSL semantics from interaction traces provided by users of these dedicated languages (ALBUS project led by Nicolas Hili at LIG).

Description

A. Part 1: Inference of automata + data models (EFSM)

  1. Initially (fall 2022), we will formalize the EFSM inference algorithm. Then we will work on implementation by relying on the LIG tool (SIMPA) to the inference of control and that of Michael Foster in Sheffield for the inference of calculation functions on the parameters. There are three types of functions to infer:
    • Transition guards (predicates relating to the parameters of the inputs and variables stored in the automaton)
    • Output functions, which calculate the values of the output parameters
    • Internal variable update functions

    In all this, the internal variables of the system being hidden, we must, based solely on observation of input parameters and output, infer the number of internal variables needed to explain observations and the way in which they are used and updated by the black box. Michael Foster achieves this through a genetic approach.

  2. Based on the implementation by a combination of tools, it will be possible to evaluate the performance of the method, and to explore the impact of associated heuristics. It should be relatively easy to write and publish work on the inference of guards in the absence of input variables (so only on the input parameters), but our ambition is to prepare a supported publication addressing the general case with an arbitrary number of internal variables, which would constitute a significant progress compared to the state of the art:
    • Ability to infer "Turing-complete" EFSM models
    • No restriction on domains and complexity of inferred functions (modulo the operators which would feed the genetic algorithms)
    • All that for complete black box systems without reset.

    Currently, none of the existing algorithms and tools can process only one of these three challenges, let alone their combination. The main existing methods (in Learnlib, SIMPA, etc.) make it possible to infer only finite state automata of a few hundred states, which strongly hinders its applicability for inferring software models with fairly large value domains. By inferring parameterized systems, we hope to be able to infer software models without strong constraints on domains of values, even if we will initially limit ourselves to simple types (numeric and character strings).

B. Part 2: DSL and causal tests.

  1. Neil Walkinshaw developed a framework for causal testing (https://github.com/CITCOM-project/CausalTestingFramework), particularly applied to analyzes in epidemiology. The tests are specified in a JSON file, but Sheffield's wish would be to define both the tests and the calculus model associated with a DSL graphic, for which we could develop the notation associated with a semantics deduced from examples provided by the designers. The works led at LIG by Nicolas Hili are particularly suited to providing the means to do it.

  2. In the opposite direction, the tool for passive inference of functions from of traces developed in Sheffield by Michael Foster would bring more genericity to the current method used at LIG for the inference of the semantics of DSL from traces which is restricted to Karnaugh tables, probably at the cost of greater complexity. This would strengthen the use of inference techniques of semantics and the links between the two parts of the project.

References