Building a Native LLM: Upstage's Campaign to Amass One Trillion Korean Data Tokens
8/14/2023
Upstage launched 1T Club to gather 1 trillion Korean language data tokens to create an ecosystem for developing an high-performing native LLM
Building on the world-class technological prowess, Upstage introduces a data-sharing scheme that benefits data providers through profit-sharing
Upstage also ensures maximum security to protect data and privacy
(Seoul, Aug. 14, 2023 /Upstage) Upstage, leading AI startup in Korea, is set to propel AI innovation to new heights as it embarks on a groundbreaking mission to craft an unparalleled Korean language model, harnessing its globally acclaimed technological prowess.
On August 14th, Upstage introduced a groundbreaking initiative named the "1T (trillion) Club." This visionary project aims to gather one trillion units of data, forming the bedrock for creating a high-performance Large Language Model (LLM) tailored specifically for the Korean language. The 1T Club brings together a consortium of partners, each contributing no less than 100 million words of diverse Korean language data, including textual content, literary works, news articles, reports, and more.
Earlier this month, Upstage’s own LLM achieved an average score of 72.3 on the HuggingFace Open LLM Leaderboard. This score surpasses even that of GPT-3.5, solidifying Upstage's position as a frontrunner in the competitive AI landscape.
The HuggingFace Open LLM Leaderboard serves as a benchmark for evaluating open-source language models, with over 500 LLMs submitted for evaluation. The evaluation is based on four key metrics, namely the model's reasoning challenge, common sense inference, contextual understanding, and factual accuracy (preventing AI hallucinations). Recently, Upstage unveiled a new model that secured an impressive score of 73, securing both the first and second positions on the global ranking.
The launch of the 1T Club marks a significant stride in Upstage's overarching goal to establish Korea's independent AI model. Rooted in this strategic advancement, Upstage is poised to harness the collective power of the 1T Club, gathering vital data crucial for shaping an unparalleled Korean LLM. This resulting model is positioned to address the diverse needs of local businesses, ushering in a new era of AI-driven solutions across various industries.
Upstage has already initiated conversations with dozens of entities, encompassing media, academia, and diverse sectors, to foster robust partnerships for data sharing. The trajectory includes extending these dialogues to prominent corporations, thereby reinforcing the domain and applications of Korean private LLM. Interested parties can sign up via Upstage's official website.
Data stands as an invaluable resource for crafting a language model. However, the present landscape is marred by limitations in both volume and complexities related to copyright. This challenge is further compounded when considering foreign-developed LLMs, which struggle to adeptly comprehend the intricacies of the Korean language. Consequently, their applicability for private LLM usage by Korean firms is hindered.
Meta's LLaMA boasts a staggering 2 trillion tokens for model training, while Google's LaMDA touts an impressive 2.81 trillion. In stark contrast, the contribution of Korean data in GPT-3's training comprises a mere 100 million tokens, representing a meager 0.01697 percent and ranking it 28th in the list of languages trained. This discrepancy becomes even more pronounced when compared to the colossal 45 trillion tokens employed for training in English data, exacerbating the language gap exponentially.
Upstage aims to further enhance Korea's AI capabilities through the 1T Club and is committed to establishing Korea as a frontrunner in the global AI industry. The company will strive to address issues such as copyright concerns arising from AI training through web crawling. Additionally, the initiative is designed to ensure benefits for both data providers and model creators, fostering a mutually advantageous relationship.
The 1T Club partners will unlock a host of exclusive advantages. These advantages encompass a progressive discount structure, intricately linked to the extent of data contributed. Additionally, partners will have the unique opportunity to share in the rewards arising from Upstage's LLM API sales, thus fostering a collaborative synergy that embraces collective achievements.
In line with the API discount mechanism, Upstage is prepared to extend significant concessions in accordance with the quantity of shared data tokens. To elucidate, a company offering 100 million tokens' worth of data will receive an allocation of 100 million API tokens without charge.
Furthermore, partners will be eligible to receive a portion of the profits generated through the Upstage LLM's API enterprise, proportional to the volume of data tokens they contribute.
Upstage is also committed to implementing rigorous measures to preserve the confidentiality of information. Data obtained from the 1T Club is exclusively utilized for the model's pre-training, effectively mitigating the risk of data extraction. The utmost standards of privacy protection are diligently maintained, augmented by Upstage's proprietary jailbreak check, thus establishing an impenetrable defense against external data breaches.
Sung Kim, Upstage's CEO, said, "LLMs are the essential technology behind modern generative AI, and it's crucial to create an ecosystem that allows various domestic industries to freely utilize high-performance private LLMs." Kim stressed, "Through the 1T Club, Upstage is committed to safeguarding the rights of data providers and developing LLMs that encapsulate the essence of Korean culture and sentiment, so that all domestic businesses can reap the benefits of AI technology"