🎉 The #CandyDrop Futures Challenge is live — join now to share a 6 BTC prize pool!
📢 Post your futures trading experience on Gate Square with the event hashtag — $25 × 20 rewards are waiting!
🎁 $500 in futures trial vouchers up for grabs — 20 standout posts will win!
📅 Event Period: August 1, 2025, 15:00 – August 15, 2025, 19:00 (UTC+8)
👉 Event Link: https://www.gate.com/candy-drop/detail/BTC-98
Dare to trade. Dare to win.
DataFi: New Opportunities in the AI Era - How Web3 Leads the Data Track
Exploring the Development Potential of DataFi from the AI Data Track
The world today is in an era of competing globally to build the best foundational models. While computing power and model architecture are important, the true moat is the training data. This article will start with Scale AI and explore the potential of the AI data track.
The Secret to Scale AI's Success
Scale AI is currently valued at $29 billion, serving clients including the U.S. military and several competing AI giants. The core business of Scale AI is to provide a large amount of accurate labeled data, and it stands out among many unicorns because it recognized the importance of data in the AI industry early on.
Computing power, models, and data are the three pillars of AI models. During the rapid development of large language models, the industry's focus has shifted from models to computing power. Nowadays, most models have established the transformer as their framework, and major players have solved the computing power issue either by building their own supercomputing clusters or by signing long-term agreements with cloud service providers. In this context, the importance of data has gradually become prominent.
Scale AI is not only committed to mining existing data but also looks towards the long-term data generation business. It forms AI training teams composed of human experts from different fields to provide higher quality training data for AI model training.
Two Stages of AI Model Training
The training of AI models is divided into two parts: pre-training and fine-tuning.
The pre-training phase is similar to the process of human infants learning to speak. We need to input a large amount of text, code, and other information crawled from the internet into the AI model, allowing the model to acquire basic communication skills through self-learning.
The fine-tuning phase is similar to school education, with clear rights and wrongs, answers, and directions. By using some pre-processed and targeted datasets, we can train the model to possess specific capabilities.
Therefore, the data required for AI training is also divided into two categories:
A large amount of data that requires minimal processing, typically sourced from crawled data of large UGC platforms, public literature databases, private corporate databases, etc.
Data that requires meticulous design and selection, similar to professional textbooks, needs to undergo data cleaning, filtering, labeling, and manual feedback.
These two types of datasets form the main body of the AI data track. With the further enhancement of model capabilities, various more refined and specialized training data will become key influencing factors for model capabilities.
Web3 DataFi: The Ideal Soil for AI Data
Compared to traditional data processing methods, Web3 has a natural advantage in the AI data field, giving rise to the new concept of DataFi. The advantages of Web3 DataFi are mainly reflected in the following aspects:
For ordinary users, DataFi is the easiest decentralized AI project to participate in. Users do not need to sign complicated contracts or invest in expensive hardware; they can simply participate through straightforward tasks, such as providing data, evaluating models, and using AI tools for simple creations.
The Potential Projects of Web3 DataFi
Currently, several Web3 DataFi projects have secured significant funding, demonstrating the immense potential of this field. Here are some representative projects:
Sahara AI: Committed to building a decentralized AI super infrastructure and trading market.
Yupp: An AI model feedback platform that collects user feedback on model output content.
Vana: Transforming users' personal data into monetizable digital assets.
Chainbase: Focuses on on-chain data, covering over 200 blockchains.
Sapien: Aims to transform human knowledge on a large scale into high-quality AI training data.
Prisma X: Committed to becoming an open coordination layer for robots, physical data collection is key.
Masa: A leading subnet project in the Bittensor ecosystem, operating the Data Subnet and Agent Subnet.
Irys: Focused on programmable data storage and computation.
ORO: Empowering ordinary people to participate in AI contributions.
Gata: Positioned as a decentralized data layer, offering multiple participation methods.
Thoughts on the Current Project
Currently, the barriers to these projects are generally low, but once users and ecological stickiness are accumulated, the platform advantages will quickly accumulate. Therefore, early projects should focus on incentives and user experience.
At the same time, these data platforms also need to consider how to manage manpower, ensure the quality of data output, and avoid the phenomenon of bad money driving out good money. Some projects like Sahara and Sapien have already begun to strengthen management in terms of data quality.
In addition, improving transparency is also an important issue faced by current on-chain projects. Many projects still lack sufficient public and traceable data, which is detrimental to the long-term healthy development of Web3 DataFi.
Finally, the large-scale adoption of DataFi requires attracting enough individual participants and gaining recognition from mainstream enterprises. Some projects like Sahara AI and Vana have made good progress in this regard.
DataFi represents the long-term symbiotic relationship between human intelligence and machine intelligence. For those who are filled with both anticipation and concern about the AI era, participating in DataFi is indeed a good choice.