> For the complete documentation index, see [llms.txt](https://docs.lazai.network/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.lazai.network/data-evaluation-and-alignment/technical-design.md).

# Technical Design

### Duplication Detection

We can divide data into structured data and unstructured data. For structured data, we can adopt hash-based partitioning technology. First, calculate the hash value (such as SHA-256) for key fields, then divide the data into 100 buckets according to the last two digits of the hash value. Each bucket is processed separately for duplicate detection to reduce memory and computational overhead. For unstructured data, we can also mainly divide it into text and image. For text data, we mainly use embedding technology to first convert the text into vectors and calculate the cosine similarity between vectors. When the similarity is ≥ 0.95, it is judged as highly similar, and a lower weight score is given to ensure the diversity of data within a single iDAO organization. For image data, we can first use a classification model to determine the model category, and then use a perceptual hashing algorithm. For example, scale the image to 8\*8 pixels, convert it to a grayscale image, calculate the average pixel value, and generate a hash value after comparing each pixel with the average value. Images with similar hash values are considered duplicates or highly similar.&#x20;

### Quality Assessment

For text data, we directly calculate the perplexity of the data through the LazAI language reasoning model. The lower the perplexity, the better the coherence and standardization of the text, and the higher the information density. For image data, we mainly evaluate resolution and label consistency.

### Context Alignment Detection - Model Training Weight Analysis

Monitor the change of weights with data during training: If a batch of data significantly reduces the model's loss function (such as MSE for regression tasks and cross-entropy for classification tasks), it indicates that this data plays a great role in model optimization. By calculating the gradient of the loss function for each batch of data, the data with the gradient direction consistent with the direction of loss minimization and a large amplitude is more important for model training.

Adopt importance sampling: Allocate probabilities according to the contribution of data to weight updates. Data with high probabilities is more aligned with the model's goals.&#x20;

### Context Alignment Detection - Model Inference Verification

After training, use the data for inference testing: For example, in a fraud detection model, data that can correctly classify known fraud/non-fraud cases has a high alignment degree. Calculate metrics such as precision, recall, and F1-score on the validation set. Higher scores indicate that the data quality is more suitable for the model's task.

Analyze the confidence of model predictions: In classification tasks, data with a confidence close to 1 for the correct category has strong consistency with the patterns learned by the model and a high alignment degree.&#x20;

#### DQS CalculatioCalculation

DQS is calculated using a weighted formula that synthesizes multidimensional indicators, as follows:&#x20;

S = w1 \* DS + w2 \* AS + w3 \* CAS

where w1, w2, and w3 are weights, DS stands for Duplication Score, AS for Accuracy Score, and CAS for Model Context Alignment Score.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.lazai.network/data-evaluation-and-alignment/technical-design.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
