Researchers provide LLM benchmarking suite for the EU Artificial Intelligence Act

Researchers from ETH Zurich, the Bulgarian AI research institute INSAIT -- created in partnership with ETH and EPFL -- and the ETH spin-off LatticeFlow AI have provided the first comprehensive technical interpretation of the EU AI Act for General Purpose AI (GPAI) models. This makes them the first to translate the legal requirements that the EU places on future AI models into concrete, measurable and verifiable technical requirements.

Such a translation is very relevant for the further implementation process of the EU AI Act: The researchers present a practical approach for model developers to see how aligned they are with future EU legal requirements. Such a translation from regulatory high-level requirements down to actually runnable benchmarks has not existed so far and thus can serve as an important reference point for both model training as well as the currently developing EU AI Act Code of Practice.

The researchers tested their approach on 12 popular generative AI models such as ChatGPT, Llama, Claude or Mistral -- after all, these large language models (LLMs) have contributed enormously to the growing popularity and distribution of artificial intelligence (AI) in everyday life, as they are very capable and intuitive to use.

With the increasing distribution of these -- and other -- AI models, the ethical and legal requirements for the responsible use of AI are also increasing: for example, sensitive questions arise regarding data protection, privacy protection and the transparency of AI models. Models should not be "black boxes" but rather deliver results that are as explainable and traceable as possible.

Implementation of the AI Act must be technically clear

Furthermore, they should function fairly and not discriminate against anyone. Against this backdrop, the EU AI Act, which the EU adopted in March 2024, is the world's first AI legislative package that comprehensively seeks to maximize public trust in these technologies and minimize their undesirable risks and side effects.

"The EU AI Act is an important step towards developing responsible and trustworthy AI," says ETH computer science professor Martin Vechev, head of the Laboratory for Safe, Reliable and Intelligent Systems and founder of INSAIT, "but so far we lack a clear and precise technical interpretation of the high-level legal requirements from the EU AI Act.

"This makes it difficult both to develop legally compliant AI models and to assess the extent to which these models actually comply with the legislation."

The EU AI Act sets out a clear legal framework to contain the risks of so-called General Purpose Artificial Intelligence (GPAI). This refers to AI models that are capable of executing a wide range of tasks. However, the act does not specify how the broad legal requirements are to be interpreted technically. The technical standards are still being developed until the regulations for high-risk AI models come into force in August 2026.

"However, the success of the AI Act's implementation will largely depend on how well it succeeds in developing concrete, precise technical requirements and compliance-centered benchmarks for AI models," says Petar Tsankov, CEO and, with Vechev, a founder of the ETH spin-off LatticeFlow AI, which deals with the implementation of trustworthy AI in practice.

"If there is no standard interpretation of exactly what key terms such as safety, explainability or traceability mean in (GP)AI models, then it remains unclear for model developers whether their AI models run in compliance with the AI Act," adds Robin Staab, Computer Scientist and doctoral student in Vechev's research group.

Test of 12 language models reveals shortcomings

The methodology developed by the ETH researchers offers a starting point and basis for discussion. The researchers have also developed a first "compliance checker," a set of benchmarks that can be used to assess how well AI models comply with the likely requirements of the EU AI Act.

In view of the ongoing concretization of the legal requirements in Europe, the ETH researchers have made their findings publicly available in a study posted to the arXiv preprint server. They have also made their results available to the EU AI Office, which plays a key role in the implementation of and compliance with the AI Act -- and thus also for the model evaluation.

In a study that is largely comprehensible even to non-experts, the researchers first clarify the key terms. Starting from six central ethical principles specified in the EU AI Act (human agency, data protection, transparency, diversity, non-discrimination, fairness), they derive 12 associated, technically clear requirements and link these to 27 state-of-the-art evaluation benchmarks.

Importantly, they also point out in which areas concrete technical checks for AI models are less well-developed or even non-existent, encouraging both researchers, model providers, and regulators alike to further push these areas for an effective EU AI Act implementation.

Impetus for further improvement

The researchers applied their benchmark approach to 12 prominent language models (LLMs). The results make it clear that none of the language models analyzed today fully meet the requirements of the EU AI Act. "Our comparison of these large language models reveals that there are shortcomings, particularly with regard to requirements such as robustness, diversity, and fairness," says Staab.

This also has to do with the fact that, in recent years, model developers and researchers primarily focused on general model capabilities and performance over more ethical or social requirements such as fairness or non-discrimination.

However, the researchers have found that even key AI concepts such as explainability are unclear. In practice, there is a lack of suitable tools for subsequently explaining how the results of a complex AI model came about: What is not entirely clear conceptually is also almost impossible to evaluate technically.

The study makes it clear that various technical requirements, including those relating to copyright infringement, cannot currently be reliably measured. For Staab, one thing is clear: "Focusing the model evaluation on capabilities alone is not enough."

That said, the researchers' sights are set on more than just evaluating existing models. For them, the EU AI Act is a first case of how legislation will change the development and evaluation of AI models in the future.

"We see our work as an impetus to enable the implementation of the AI Act and to obtain practicable recommendations for model providers," says Vechev, "but our methodology can go beyond the EU AI Act, as it is also adaptable for other, comparable legislation."

"Ultimately, we want to encourage a balanced development of LLMs that takes into account both technical aspects such as capability and ethical aspects such as fairness and inclusion," adds Tsankov.

The researchers are making their benchmark tool COMPL-AI available on a GitHub website to initiate the technical discussion. The results and methods of their benchmarking can be analyzed and visualized there. "We have published our benchmark suite as open source so that other researchers from industry and the scientific community can participate," says Tsankov.

Pop Pulse News

Researchers provide LLM benchmarking suite for the EU Artificial Intelligence Act

POPULAR CATEGORY

corporate

tech

entertainment

research

wellness

athletics