DMOps (Data Management Operation and Recipes), building data in the field

May 11, 2023 Hailey (Park Seong-min) .

2023/05/12 | 5mins

💡 With the development of data-centric AI, the importance of data is growing day by day. If the last part 1 introduced how Data-Centric AI is represented and operated in the Real-World, this article will explain in detail how to build real data in the field and DMOps (Data Management Operation and Recipes ).

Rule-based, statistical-based, machine learning-based, deep learning-based, and LLM-based, there was no time when data was not important in the age of artificial intelligence. There is a clear difference in learning methodology in each era, but data and tasks have been maintained. For example, in the case of machine translation, starting with the concept proposed by Warren Weaver in 1949, passing through rule-based machine translation and statistical machine translation, artificial neural network-based machine translation currently used in Google Translate, Papago, and DeepL (Neural Machine Translation), which all trained models based on Parallel Corpus. In this way, tasks and data have been maintained through all ages, and only differences in learning methodologies exist. In other words, data has been a ubiquitous (everywhere) existence since the term artificial intelligence was coined.

So, who the hell is creating and designing this data? As the proverb goes, “You plant beans and you grow beans, and you plant red beans and you grow red beans”, data does not appear magically, but through a series of processes. In this post, I would like to explain how to create training data from Zero to One to create a high-quality artificial intelligence model.

The need for DMOps

With the advent of Data-Centric AI, academics and government agencies are carrying out various studies and policies on data. In the academic world, research is being conducted in various fields, such as research on model performance improvement using large-scale datasets, as well as research on creating benchmark datasets for performance comparison between models. The government implements a public data open policy, provides data from the National Statistical Office, etc., and implements a data dam project (a project hosted by the Ministry of Science and ICT, one of the Digital New Deal). Core businesses corresponding to this include AI voucher support projects and data construction for AI learning, and through these, we tried to create a digital economy foundation that collects various data including cloud and big data platforms. In addition, various data platforms such as AIHUB operated by the National Institute of the Korean Language and NIA are also operated at the national level.

HOWEVER, IN THE INDUSTRY, DATA SPECIFIC TO THE MAIN BUSINESS DOMAIN IS REQUIRED, AND DATA TAILORED TO CUSTOMER REQUIREMENTS AND BUSINESS ITEMS IS ESSENTIAL FOR B2B COMPANIES IN PARTICULAR. THESE DATA ARE DIFFICULT TO SATISFY WITH ONLY OPEN BENCHMARK DATASETS AND PUBLIC DATA, AND EVEN IF THESE DATA ARE USED, ADDITIONAL DOMAIN-SPECIFIC DATA MUST BE MANUALLY PRODUCED. AS A RESULT, MANY COMPANIES ARE PRODUCING THE NECESSARY DATA ON THEIR OWN, AND MOREOVER, COMPANIES THAT PROFESSIONALLY OPERATE CROWD WORKERS ARE EMERGING.

In response to these companies' data needs, we would like to introduce DMOps, a pipeline that can be applied universally regardless of specific domains and can build data easily, quickly and efficiently. DMOps is a comprehensive solution for data fabrication. Responsible for the entire process from data production to distribution. This can help businesses streamline operations by reducing the time and cost of designing and producing data. In other words, DMOps is a kind of comprehensive solution that can build data quickly and efficiently, and serves as a baseline for data production, enabling consistent, reliable, and high-quality data to be produced.

Introduction to DMOps

Process required for DMOps (Source: https://arxiv.org/pdf/2301.01228.pdf)

Just as you need a recipe to make food, you need a recipe to create data. That recipe is called DMOps. DMOps is divided into 12 phases. This includes everything from the business phase where the data purpose and requirements are analyzed to the delivery of the final output data to the modeling team.

Establish the Project Goal

This is the stage where the purpose of data production and business requirements are analyzed. This step requires collaboration with the model team and business operation team, and the deep learning model used by the model team is identified, and the input/output format of data corresponding to it is determined and the quantity required for each step is determined. In companies, everything starts with identifying the needs of users or customers. This is a crucial difference between academia and business, and it is very important for business to consider the purpose of the project. In other words, good data from a business point of view starts with data that reflects the data provider's needs, that is, the requirements well.

2. Secure Raw Data (Collection of original data)

This is the stage in which the collection is performed along with the investigation of the source data. How is raw data collected? It can be broadly divided into five categories.

If the customer provides data
In-house crowdsourcing (in-house workers) to collect data
When crawling
When using public data
When data is collected through in-house internal events

When collecting data in any way, copyright verification is essential, and furthermore, companies must go through legal review . In addition, the structure of storing data, the right to modify, etc. must be considered. It is important to protect the data from indiscriminate modification or management by granting data access rights only to the team that handles possible data.

It is important to consider the following four factors when collecting raw data, which is recommended in the “Artificial Intelligence Data Quality Standard (Ministry of Science and ICT)”.

Data Diversity: It should be composed of data with characteristics and variability similar to those of the real world, considering whether it contains all characteristic information useful for learning and whether it is changing in various ways.
Reliability: raw data must be collected from reliable sources
Possibility of acquisition: Data that is difficult to process, such as data with unknown characteristics, should not be collected, but data that is easy to process should be collected
Compliance with the legal system: When collecting data containing personal information, only data that has been agreed to be collected and used must be collected, and in areas where permission is required, prior permission must be obtained and data must be collected

IN PARTICULAR, IT IS VERY IMPORTANT TO CONSIDER COPYRIGHT, AND IT IS IMPORTANT TO ALWAYS CHECK THE CC LICENSE TO GET INTO THE HABIT OF USING DATA.

Source: creative commons license

3. Data Pre-processing

This is a stage in which various preprocessing is performed to improve the quality of data based on the original data collected or delivered. Basically, the key is to filter out noise such as personal information and hate speech that need filtering, from fitting the data to the required format. In other words, it is a step to practice “quality over quantity”.

These preprocessing steps can be divided into two main tasks. The first is to improve the quality based on the inherent characteristics of data. As a representative example, parallel corpus filtering of machine translation exists, and it is a data-centric and efficient methodology that improves model performance only by managing the quality of data without changing the structure of the model.

The second is to address the ethical issues of data. This task includes pre-attaching license information of data or masking personal information if there is one. If these tasks are not clearly carried out in advance, the data may not be usable even if data labeling and verification are carried out later, so it can be seen as a very important task.

The three points to keep in mind when preprocessing data are as follows.

Refining standards
- Establish clear criteria to select data suitable for the purpose of construction, and effectively remove data that does not meet the criteria
de-identification
- Properly de-identify personal information and take care not to cause loss of information through de-identification
avoid redundancy
- Similar data and data without characteristics should be removed

4. Design a Data Schema

“Data Schema” refers to the method of defining data structures in the database field . In other words, it defines the format, structure, constraints, etc. of data stored in the database. Similarly, when designing data, how to label data You need to plan and design for etc. The step to proceed with these tasks is the “Data Annotation System Design” step.

In other words, it is the process of designing annotations so that the dataset contains all the necessary information. It is to create an annotation system that can contain the information necessary to solve the problem that the AI model is trying to solve by directly looking at the data. It is also important to separate the parts that can be automated (pseudo-labeling) and those that require human input (labeling) to improve efficiency and accuracy. Because automation is an essential part of building data. In this process, it is important to rapidly reinforce the contents of the insufficient part through pilot work.

To explain with an easy example, the steps include what information to extract in information extraction, which entity to tag in name entity recognition, and how much information to compress in document summarization. , It is the process of designing a data policy for paraphrasing, literal translation, or transcendental translation in machine translation, and is one of the most important steps. In the academic world, research is conducted with this information already pre-determined, but in companies, even this information must be newly designed according to customer needs!

5. Prepare a Guideline

This is the documentation process to deliver the designed data annotation system to workers or crowdsourcers. From the perspective of a worker encountering this task for the first time, seeing all the edge cases can distract the worker, so the difficulty of the document must be carefully adjusted with a clear purpose and work method. Defining the purpose of data construction, definitions of terms used, and considerations when building data should be easily included in the guidelines. In addition, it is necessary to consider in advance what information must be disclosed to workers and what additional information is required, and when explaining the labeling system, it is recommended to attach examples. In addition, it is necessary to continuously revise through the construction and inspection process and manage versions to determine what has changed.

In other words, it is recommended that the guidelines be written in the following order.

Overview
- Introduction to Data Building Purpose
term definition
labeling scheme
- Data attribute taxonomy
- Data Annotation Methods and Procedures
- Data annotation format and definition
- How to use data annotation tools
- How to manage data annotations after completion
- Criteria for rejection and passing
Notice
Edge Case

6. Recruit Annotators

This is the stage of recruiting workers to annotate real data. Recruiting workers through suitable tests is key to efficient and accurate work. In other words, a test similar to the dataset construction guidelines should be set up, and the accuracy and speed of workers should be used as recruitment criteria. Ethical considerations also need to be considered. Good data is data for which workers are properly compensated. Therefore, it is also necessary to consider whether workers are properly compensated and whether unnecessary expenses are not paid.

When constructing data, it can be divided into three cases depending on the way workers are operated.

The first is internal organization, i.e. the company runs its own workers. In this case, regular training and feedback on data quality is possible, but operating expenses and infrastructure for the data working environment are required. This is an operating method suitable for work requiring close feedback between processes on labeling results.
The second is outsourcing. It is suitable for work that is specialized in data construction and requires knowledge and skills, and has the advantage of being able to utilize the know-how of companies with high business expertise and experience. However, the downside is that it takes a lot of time to define requirements and agree on criteria.
Third is crowdsourcing. This method is suitable for tasks that require large amounts of data to be processed in a short period of time. However, there are clear limits to quality training and feedback.

7. Instruct Annotators (operator training)

This step is to explain the guidelines created in step 5 to the workers. In the case of this step, rather than one-way communication that simply provides guidelines and then checks understanding , two-way communication that draws out as many questions as possible from workers and organizes the questions is the key. In other words, it is important to ensure that operators can follow the flow of the overall labeling. In addition, it is also important to help workers understand the purpose and label based on a natural logical flow, rather than doing simple labor.

8. Data Labeling

This is the step of building real data. It can be described as the process of transferring the worker's linguistic, cognitive, and visual intuition to data. Therefore, it is necessary to devise a method suitable for each data set, such as practical worker management and a method of unifying different intuitions of each worker in a slightly more universal line. It is important to identify edge cases that do not exist.

Additionally, labeling data through a well-designed Data Labeling Tool is key. When using the Data Labeling Tool, we recommend that you consider the following three factors:

Quality Control: Whether consistent and accurate data can be generated.
Efficiency: Whether you can easily build data efficiently by reducing time
Scalability: Whether multiple workers can process large amounts of data simultaneously.

Additionally, when constructing data, it is recommended to separate the work into two stages. In other words , starting pilot work and then starting main deployment is the key to improving data quality. Before starting the main construction, small-scale test data must be established, and issues and problems not found during data design must be identified and improved in advance. Through this process, you can supplement and revise the guidelines and then select workers considering the purpose of building the dataset.

Afterwards, during this construction, it is necessary to manage the work schedule and workers so that the dataset can be built within the period, and to continuously check whether the data is correctly labeled through an interim inspection to create high-quality data.

9. Data Internal Factor Verification

This is the stage in which the data built by the worker is inspected by himself or by another worker. Since people are working on the work, it is a task to correct mistakes that can naturally occur and to draw conclusions through discussion in case of edge cases where it is difficult to judge. This is a necessary step to ensure data quality.

What you must consider at this stage is ' Consensus Labeling '. In other words, it is a task to check data labeling consistency, and it can be confirmed through an evaluation index called Inter Annotator Agreement (IAA). Because the operator is also human, there may be mistakes, or the guideline may be misunderstood, and the labeling results with other workers may have been unusually different. Therefore, to detect and avoid these mistakes, label conformance must be verified through IAA.

In conclusion, Data Internal Factor Verification is to verify the intrinsic factors of data work. However, this task cannot proceed with verification of factors other than data, and furthermore, the relationship between the model and the data (that is, whether the data actually helps improve model performance). Separately, I would like to further explain Data Extrinsic Factor Verification and Data evaluation through model verification, which verify the corresponding steps.

10. Data Extrinsic Factor Verification

This is the step of verifying the built data. When verifying data, it is first necessary to determine whether the final data was created according to the guidelines. In addition, 1) Data Sufficiency, 2) Data Diversity, 3) Data Trustworthiness, 4) Privacy and Security, and 5) Data Ethics should be additionally reviewed. In other words, it is a step to go beyond the internal information of data and examine the sufficiency, diversity, reliability, security, and ethics of data from an interdisciplinary perspective. This verification is best done through the Institutional Review Board or external advisors.

11. Data Evaluation through Model Verification

This step is to evaluate data quality through actual modeling. Quantitatively verify whether the data has been produced to meet the production goals and requirements as a whole through experiments to see data efficiency while increasing the amount of data, experiments to verify data quality consistency by separating data sections, etc. It's a step. If there is a part that does not match our purpose while proceeding with the step, we have to go through the process from operator training to data verification again. In other words, the human in the loop cycle, which detects errors through the model and cleans them through humans, is important. The key is to create data that is not only error-free through continuous cycles, but also matches well with the results of the model, that is, organically with the results of the model.

12. Data Deliverables

THIS IS THE STAGE OF DELIVERING THE FINAL DATA OUTPUT. IN OTHER WORDS, THIS IS THE PROCESS OF DELIVERING COMPLETED DATA TO THE MODEL TEAM OR TO THE CLIENT. WHEN DELIVERING, IT IS IMPORTANT TO CONTROL THE VERSION ACCORDING TO THE PROTOCOL, AND IT IS IMPORTANT TO REVEAL THE SHAPE OF THE SAMPLE DATA, INCLUDING THE LABEL DISTRIBUTION OF THE DATASET. FURTHERMORE, AFTER GOING THROUGH THE EDA PROCESS, IT IS RECOMMENDED THAT THE DATA ANALYSIS REPORT AND QUALITY EVALUATION REPORT BE DELIVERED TOGETHER.

As a measure of good data in the industry explained in Part 1 of the blog post, “How informative is meta data?” “Data with fair compensation (for workers) = Data that provides fair compensation to workers but does not pay unnecessary costs. ”, “Is the versioning system well done”, and “Is the data storage structure intuitive and clean” can also be seen as measures of good data. When these factors come together, it turns good data into great data.

The Future of DMOps

Then, what should be the future data-related research? If it used to be a competition between models and models, now it's a competition between models and people. That's why we need a human gold standard all the more, and this should be the reference point. Furthermore, we will have to build human gold standard data for areas with a very high level of difficulty, such as superGLUE, and consider various tasks that the model still falls far short of human capabilities. In other words, data and people are inextricably linked.

Still, a lot of it needs to be automated. It is necessary to proceed with automation by giving efficiency to many steps that play human roles. It is important to conduct research on data that evokes the same effect as if it were labeled by a real person by human inspection of the synthetic data predicted by the model through self-labeling. This means that automatic generation of data using LLMs such as ChatGPT and GPT-4 should be considered.

Finally, we need to go deeper into the consideration of evaluation. Looking at AI papers, there are a lot of studies on quantitative evaluation, but standards for dormant evaluation are extremely rare. I think it is important to design a clear standard and system for human evaluation. In addition, when building a test set, it will be one of the important future research topics to build data that can explain why this is a correct answer and why it is labeled this way, rather than simply labeling only the correct answer.

In this post, we introduced DMOps, a universally applicable data pipeline that can efficiently produce high-quality data to meet the needs of the business. We hope that our Upstage postings have been a good opportunity to indirectly experience how data is produced in various companies.

←Back to Blog