Introducing Italia: iGenius’ First Open Source Foundational LLM

Image with the statue of David at the centre | iGenius media
iGenius
June 6, 2024
·
3 min

After months of hard work, we are pleased to announce the launch of Italia, our 100% open-source Foundational Large Language Model, developed in collaboration with Cineca.

Thanks to this partnership, we trained and fine-tuned our model on a large scale, using thousands of GPUs on the Leonardo supercomputer, one of the most advanced and high-performing computing infrastructures in the world.

The first model in our series is Italia 9B, a Foundational LLM with a 9-billion-parameter Transformer architecture, a context window of 4.096 tokens, and a vocabulary of 50.000 tokens.

An accurate, powerful, and secure model

Italia 9B was trained from scratch in Italian on trillions of tokens, using a heterogeneous mix of data: public sources, synthetic data, and domain-specific content provided by our commercial partners. Trained exclusively in Italian, without any translation from English, Italia 9B can understand all Italian linguistic and cultural nuances with unprecedented precision.

Additionally, we have established a collaboration with Editoriale Nazionale, a company within the Monrif group, to use their historical archive of press articles as an additional source to improve our model.

Thanks to Editoriale Nazionale’s valuable content, we will be able to further expand Italia's knowledge, covering decades of national and international history. We plan on incorporating their content in future versions of Italia, aiming to enhance both the model's general knowledge and conversational capabilities.

To build our training dataset and ensure the ethical integrity of the generated content, we have developed specific safety filters for the Italian language. These filters remove sensitive, explicit, and highly biased content from our selected sources.

These protection mechanisms, combined with the adoption of cutting-edge data cleaning techniques, have also allowed us to limit hallucinations and the generation of content inconsistent with the conversation.

Data security and information reliability have always been priorities for iGenius. We have invested in building a high-quality Italian dataset to develop a truly open, transparent, and secure language model, in compliance with European AI regulations such as the AI Act.

Since 2016, our mission has been to humanize data and democratize business knowledge, revolutionizing the historical AI paradigm from data-centric to people-centric. So developing an open-source language model felt like the next step to create products aligned with this objective. We were able to increase levels of transparency, trust, and security for the people and companies that choose to adopt it.

A model designed for businesses

Italia was designed for companies operating in highly regulated sectors, such as financial services or public administration.

Even in its first version, it presents itself as a unique LLM. Although specialized in a single language, the high number of parameters combined with the quality of the training process make it the ideal choice for the most critical use cases in the enterprise world, where the reliability of generated content is of paramount importance.

As the name suggests, Italia is equipped with excellent linguistic formulation capabilities in Italian. This does not just encompass vocabulary and sentence structure, but also cultural and historical knowledge of the country. This is essential for applications that require advanced proficiency in the Italian language.

In addition to its outstanding conversational ability, Italia excels in the efficiency with which it processes Italian words.

Thanks to a proprietary tokenizer, specifically developed from scratch for this model by the iGenius team, Italia can process and generate tokens in Italian with a performance equivalent to a 60% increase in its context window.

This directly translates into significant cost and resource efficiency in serving the model, as well as enhanced computational performance, both of which are crucial features for an AI solution in the enterprise environment.

According to our team of experts, language models specialized in a single language, like Italia, cannot be accurately evaluated using benchmark systems focused on general questions, especially those designed for the English and American ecosystems.

As such, we are working with top-tier Italian institutions to develop an impartial benchmark system for evaluating native Italian models. This system will not be limited to general knowledge topics but will also include references to real-world business use cases.

How to bring AI into your business with Unicorn

Italia is the first step towards a Digital Renaissance, introducing a new era of AI development that prioritizes people, not technology.

Italia was developed and trained with particular attention to the needs of businesses and professionals, ensuring effective integration of Artificial Intelligence into their activities.

At iGenius, we believe that every organization should adopt AI with solutions tailored to their specific needs, not through a generalized approach, while maintaining control over their private data.

Since 2016, we have been working with companies to adapt Artificial Intelligence to their needs, always starting from the real requirements of individuals, rather than merely their data.

This is what we accomplished with Crystal, our Decision Intelligence product for businesses, which allowed us to fully understand the challenges that prevent organizations from adopting AI in critical and high-priority operational contexts.

This was also our thinking behind Unicorn, a new business line aimed at supporting public and private organizations in adopting AI and Large Language Models through solutions tailored to their specific needs.

By combining the reasoning capabilities of language models like Italia with the reliability of data and business knowledge, we are able to create effective, secure, and scalable AI solutions that meet the quality standards of highly regulated sectors.

To achieve this goal, we collaborate with top-tier partners and system integrators to ensure the optimal integration of our technologies into existing company infrastructures, providing continuous support and precise customization of solutions.

This approach allows us to address each client's specific challenges with the maximum attention, improve operational efficiency, and accelerate innovation while maintaining high levels of security and regulatory compliance.

Italia is the result of extensive research and development, and we’re still just at the beginning of our journey in AI innovation.

We are already working on new versions of the model, including a multilingual version that will soon be available.

Interested in downloading Italia 9B? Check it out on Hugging Face.

To receive updates on the latest news about Italia and iGenius, subscribe to our newsletter.

The Digital Renaissance is in place.

Read our official press release here.

0,55 0,43 0,42 Italia 3B Instruct - v0.1 ARC ITA, 5-shot ITA, 5-shot MMLU ITA, 5-shot HellaSwag
0,38 0,25 Italia 3B Instruct - v0.1 MC2, ITA, 0 shot TruthfulQA MC1, ITA, 0 shot TruthfulQA
0,71 0,42 44,98 Italia 3B Instruct - v0.1 ITA, 0-shot LAMBADA ITA, 0-shot, acc LAMBADA ITA, 0-shot, perplexity XCOPA
Share this post
it