Data-Centric AI and Real-World

Tech

Apr 12

2023/04/12 | 4mins

Park Chan-joon (AI Research Engineer)
THOSE WHO ARE CURIOUS ABOUT THE IMPORTANCE OF DATA IN AI SYSTEMS
Those who want to apply Data-Centric AI in the field
Those who want to create good data
Do you know about “Data-Centric AI” that emphasizes the importance of data in AI systems? In the real world that actually handles data, we will look at how to apply it to business and the elements needed to create good data.
✔️ What is Data-Centric AI?
✔️ How to apply Data-Centric AI in the Real-World
✔️ How to create data in Real-World
✔️ Quantity and quality of data
✔️ What is good data?
✔️ Closing (Then what about upstage?)

💡 You need fuel to make your car move, and you need ingredients to make food. Similarly, AI systems need fuel and materials. That's what data does. In this article, I would like to introduce what is happening in the real-world, where “data” is actually handled.

What is Data-Centric AI?

Artificial intelligence exists everywhere in our daily life. We search through portal sites every day to get the information we want, and use translators when we feel language barriers. In addition, YouTube's recommendation system continuously shows content that fits my interests, so I have the experience of watching videos without knowing the time is passing, and I also use ChatGPT as an auxiliary tool for various tasks. This is how we face and utilize various AI systems in our daily lives!

What elements are these artificial intelligence systems in our daily life composed of? Extremely, all artificial intelligence systems are largely divided into data and code . In the first step, planning and designing what kind of artificial intelligence system will be developed (Setup), in the second step, preparing data or fuel suitable for the purpose, and in the third step, after writing the code to train the model, use the GPU hardware. Learn the artificial intelligence system that developers want. In the last step 4, the distribution (serving) of the system is carried out so that users or customers can directly use the model.

Does the life cycle of an artificial intelligence system end after just distributing it like this? no! Just as humans need to consume nutrients evenly in order to grow, artificial intelligence systems must continue to advance . Then, what approach is needed to advance the AI system?

Ultimately, it is necessary to advance the two major elements, Code or Data. Among these, it is Data-Centric AI that improves the quality of data and improves the performance of models through data quality control, rather than improving performance through code, or modeling. In other words, Data-Centric AI is the code, that is, let's fix the data instead of just fixing the model! That is.

We asked ChatGPT what Data-Centric AI is.

ChatGPT answers questions about Data-Centric AI

If you look at the answers, you can see that Data-Centric AI refers to AI systems that are data-centric, and emphasize the point of transforming data to increase performance. Data-Centric AI can be summarized in two ways as follows.

Research methodology to consider from the data point of view to improve performance (Hold the Code / Algorithms fixed)
- eg Data Management (Collection of New Data), Data Augmentation (Data Augmentation), Data Filtering (Data Filtering), Synthetic Data (Synthetic Data). Label Consistency (Systematic labeling method), Data Consistency, Data Tool (Labeling Tool), Data Measurement, Evaluation, Curriculum learning, Active Learning, etc.
Research methodology for how can the performance of a model be improved without model modification?
- Should we find another model?
- AI algorithms that understand data and use that information to improve models.

How to Apply Data-Centric AI in the Real-World

How are companies applying Data-Centric AI in the real-world of actual business? There are various methods, but the most representative process is “Data-flywheel” . Whether it's a B2B company or a B2C company, if you do AI-based services, logs will accumulate. Many companies use this accumulated data to provide better services.

The fact that the YouTube recommendation model reflects our needs well is also reflected in the model through log data to increase user satisfaction. Actions taken on the platform, such as search terms and search journeys that we search on portal sites, can become data and are actually being accumulated. In this way, the data-flywheel is to process the data that is accumulated while operating the service as the learning data of the model, and to continuously train the model for additional learning, naturally improving the recognition performance of the model.

In other words, it interacts with the model based on the data and runs several iterations to improve the quality of both the model and the data. This is the most representative form when Data-Centric AI is applied to the real-world.

How to create data in the Real-World

So, is the Data-flywheel all that Data-Centric AI is all about in the Real-World? no! In the Real-World, we also create data ourselves. However, since most of the existing AI research was focused only on model research, no specific systematic process was established for the data development life cycle. As a result, I was relatively less interested in who and how data was created, which data was good data, and how good data was created. We feel the need for such a process, and in Upstage, we are designing it around the data team.

The process of creating data in the Real-World

Research on A to Z of the data production process, thinking about how to create good data, and conducting pipeline research on it. In order to create good data, we are continuing various researches and publishing papers under the name of DMOPs (Data Management Operation and Recipes). (This will be covered in the next two parts.)

These competencies require completely different competencies from AI modeling competencies and serving competencies. In other words, gathering people who can do these things well and building a team will be a great competitive advantage for a company.

Pipeline Structure for Creating Training Data (Source: https://arxiv.org/pdf/2303.10158.pdf )

In addition, various sub-fields of Data-Centric AI are contributing in the stages of collection, labeling, preparation, reduction, and augmentation while developing data.

Quantity and quality of data

Then, when creating data, which part of quantity or quality should be considered more heavily? What I felt while dealing with data in the real-world was that the quality of data should be treated with more weight.

Looking at existing data-centric AI studies in academia, there are cases where the focus is on the quantitative part of data, such as proposing a new data augmentation methodology or improving model performance by creating synthetic data to augment the size of data. There are many. However, what I felt while doing the service in the actual field is that while considering these aspects is important, the quality of data, that is, “label consistency” is important.

To this end, for label consistency, guidelines should be presented to actual annotators so that subjective judgments of individual annotators do not act as bias in data by designing a general rule for how to annotate according to the characteristics of each data. Then, it was confirmed that the part about data measurement on how to evaluate such label consistency and how to improve the guidelines through this evaluation were prominent in practice.

From this point of view, the desirable data-flywheel I think is not a one-way improvement method such as increasing the amount of additional data if the performance of the model is poor, but gradually improving the guidelines and processes of the data generation process according to the results of the model. I think the data-model two-way virtuous cycle structure is important. In other words, when the performance of the model is poor, it suggests that there should be a qualitative expansion, not simply a quantitative expansion that increases the data. In the end, the process of collecting more error-prone data while servicing and consistently correcting the ambiguous labels of such data brings truly impactful performance improvements to the model.

Bi-directional data-flywheel: A virtuous cycle structure between data and models in which guidelines and processes in the process of data creation are gradually improved according to the results of the model, rather than a one-way improvement method such as increasing the amount of data.

Therefore, it is also very important to design a data tool to create high-quality data. A device that allows workers to work comfortably in the tool and checks whether label consistency is well maintained should be implemented. Upstage also completed the development of this data tool under the name of “Labeling Space,” and is currently applying it to the in-house data pipeline to help create high-quality data. This is because this tool works as a key player that can significantly reduce the time and cost of data production, and also allows you to produce good data .

What is Good Data?

We've looked at the elements needed to create good data so far.

So what exactly is good data? In the academic world, benchmark data that can objectively and clearly measure model performance, and publicly available high-quality training data will be considered good data. However, beyond these conditions in the real-world, various measures of good data can be defined.

<Real-World에서 쓰이는 좋은 데이터의 척도>

How informative the meta data is
Is the amount of data sufficient and is the cost reasonable?
Is it data that provides fair compensation to workers, but does not pay unnecessary costs?
Is the versioning system well done?
Is the data storage folder structure intuitive and clear?
Doesn't it contain unnecessary data?
Does it meet the requirements specified in the Data Requirements Directive?
Are there any bias, bias, contamination and ethical issues in the data?
Is data labeling valid and consistent?
Are ownership and copyrights, intellectual property rights, confidentiality and privacy properly considered?

Various factors like the above can be viewed as measures of good data. This may seem obvious, but it is an indispensable element to complete good data. While not considered good data in academia, the real-world should consider these factors as well. In other words, there is a difference between Good Data in Academia and Good Data in companies.

Looking at the recent data studies being dealt with in academia, I think that they are creating data for data rather than data for models. Looking at data-centric studies in academia, it seems that there are many studies that filter only by considering the intrinsic characteristics of the data rather than looking at compatibility with the model. In other words, if you look closely at the refinement criteria when filtering certain data, you are filtering by focusing on the intrinsic characteristics of the data rather than considering the output result of the model. However, when we asked the question, “Why are we trying to create good data?”, good data is actually to create good models. In other words, I think the criterion for dividing good and bad data is very valid when considering the performance of the model.

I mentioned earlier that AI systems are divided into code and data. It is clear that the part that can dramatically improve performance in a short period of time is data, but I don't think you should overlook the code. Therefore, Data-centric research in which Code has become a milestone will be required, which means Model based Data Centric AI must be done.

We believe that true good data is data that contributes to improving the performance of the model through constant cleansing based on the results of the model through several iterations with modelers. The human in the loop cycle, which detects errors through the model and cleans them through humans, is important. It is important not only to be error-free through continuous cycles, but also to have organic data that matches well with the results of the model. In other words, it can be seen that revisiting for data-centric AI is necessary. In order to realize true Data Centric AI, it is necessary not only to treat data as important, but also to harmonize with the aforementioned factors.

In the end, the four most important things for good data can be summarized as follows.

1) Systematic processes such as DMOps

2) In consideration of label consistency, appropriately set guidelines and

3) Tool to create data easily and efficiently

4) Data that has undergone a continuous cleansing process based on the results of the model

In this way, I think that data created based on a virtuous cycle structure in which data quality, guidelines, and model performance are also improved through the cleansing process can be defined as good data. And the value will be evaluated through the model in the market. In the end, even if it is an AI company, I think companies that are good at both models and data will survive in the future, going beyond data and model expertise.

Closing (So what about upstage?)

Upstage is a company that is good at both models and data. The “Upstage AI Pack” developed by Upstage released the Data-centric AI techniques in the Real-world that I mentioned so far as an all-in-one package in the form of No-Code and Low-Code. You can think of the Upstage AI Pack as a representative example of an AI platform that well melts the concept of data centric AI in the real-world. The virtuous cycle structure of the Data-flywheel has been captured with just one click of your mouse. It is a no-code-low-code AI solution that allows even those who are not familiar with AI to easily and efficiently create an AI system with the click of a mouse.

If various factors considering the real-world implemented in these AI packs are reflected, I think that we can create great data beyond good data. We hope Upstage's AI Pack can help many companies create great data. In the next two episodes, we will talk about DMOps ( Data Management Operation and Recipes ), which builds real data.

Learn about Upstage Document AI
Upstage, founded in October 2020, offers a no-code/low-code solution called "Upstage AI Pack" to help clients innovate in AI. This solution applies the latest AI technologies to various industries in a customized manner. Upstage AI Pack includes OCR technology that extracts desired information from images, recommendation technology that considers customer information and product/service features, and natural language processing search technology that enables meaning-based search. By using the Upstage AI Pack, companies can easily utilize data processing, AI modeling, and metric management. They can also receive support for continuous updates, allowing them to use the latest AI technologies conveniently. Additionally, Upstage offers practical, AI-experienced training and a strong foundation in AI through an education content business. This helps cultivate differentiated professionals who can immediately contribute to AI business.
Led by top talents from global tech giants like Google, Apple, Amazon, Nvidia, Meta, and Naver, Upstage has established itself as a unique AI technology leader. The company has presented excellent papers at world-renowned AI conferences, such as NeurIPS, ICLR, CVPR, ECCV, WWW, CHI, and WSDM. In addition, Upstage is the only Korean company to have won double-digit gold medals in Kaggle competitions. CEO Sung Kim, an associate professor at Hong Kong University of Science and Technology, is a world-class AI guru who has received the ACM Sigsoft Distinguished Paper Award four times for his research on bug prediction and automatic source code generation. He is also well-known as a lecturer for "Deep Learning for Everyone," which has recorded over 7 million views on YouTube. Co-founders include CTO Hwal-suk Lee, who led Naver's Visual AI/OCR and achieved global success, and CSO Eun-jeong Park, who led the modelling of the world's best translation tool, Papago.
Go to Upstage Homepage

←Back to Blog

Hailey (Park Seong-min) .