Technical Design
Duplication Detection
We can divide data into structured data and unstructured data. For structured data, we can adopt hash-based partitioning technology. First, calculate the hash value (such as SHA-256) for key fields, then divide the data into 100 buckets according to the last two digits of the hash value. Each bucket is processed separately for duplicate detection to reduce memory and computational overhead. For unstructured data, we can also mainly divide it into text and image. For text data, we mainly use embedding technology to first convert the text into vectors and calculate the cosine similarity between vectors. When the similarity is ≥ 0.95, it is judged as highly similar, and a lower weight score is given to ensure the diversity of data within a single iDAO organization. For image data, we can first use a classification model to determine the model category, and then use a perceptual hashing algorithm. For example, scale the image to 8*8 pixels, convert it to a grayscale image, calculate the average pixel value, and generate a hash value after comparing each pixel with the average value. Images with similar hash values are considered duplicates or highly similar.
Quality Assessment
For text data, we directly calculate the perplexity of the data through the LazAI language reasoning model. The lower the perplexity, the better the coherence and standardization of the text, and the higher the information density. For image data, we mainly evaluate resolution and label consistency.
Context Alignment Detection - Model Training Weight Analysis
Monitor the change of weights with data during training: If a batch of data significantly reduces the model's loss function (such as MSE for regression tasks and cross-entropy for classification tasks), it indicates that this data plays a great role in model optimization. By calculating the gradient of the loss function for each batch of data, the data with the gradient direction consistent with the direction of loss minimization and a large amplitude is more important for model training.
Adopt importance sampling: Allocate probabilities according to the contribution of data to weight updates. Data with high probabilities is more aligned with the model's goals.
Context Alignment Detection - Model Inference Verification
After training, use the data for inference testing: For example, in a fraud detection model, data that can correctly classify known fraud/non-fraud cases has a high alignment degree. Calculate metrics such as precision, recall, and F1-score on the validation set. Higher scores indicate that the data quality is more suitable for the model's task.
Analyze the confidence of model predictions: In classification tasks, data with a confidence close to 1 for the correct category has strong consistency with the patterns learned by the model and a high alignment degree.
DQS CalculatioCalculation
DQS is calculated using a weighted formula that synthesizes multidimensional indicators, as follows:
S = w1 * DS + w2 * AS + w3 * CAS
where w1, w2, and w3 are weights, DS stands for Duplication Score, AS for Accuracy Score, and CAS for Model Context Alignment Score.
Last updated