Data First. Fine-Tuning Last.

Wed Jul 10 2024•AI Introduction LLM Large Language Models

In the rapidly evolving field of AI, there’s an often-overlooked element that can make or break a project: the quality and nature of the underlying data. While much attention is given to model architectures, hyperparameter tuning, and the latest large language models (LLMs), the critical importance of looking at raw data is still not well understood.

The Fundamentals of Data in AI

It’s essential to understand how data quality and representation directly impact model performance. In both traditional machine learning and modern LLM approaches, there is a truth that you need to remember, “garbage in, garbage out”. High-quality, representative data is the foundation upon which successful AI applications are built.

Data quality encompasses several factors:

Accuracy: The correctness of the data points
Completeness: The presence of all necessary information
Consistency: The uniformity of the data across the dataset
Relevance: The applicability of the data to the problem at hand

The choice between traditional ML approaches and LLMs is often influenced by the nature of the data available. Traditional ML methods may be more suitable for structured, numerical data with clear patterns, while LLMs excel at handling unstructured text data and can leverage pre-trained knowledge. However, as we’ll see in our case studies, this decision is not always straightforward.

[The case studies provided are from the “Mastering LLMs: A Conference For Developers & Data Scientists”]

Case Study 1: Value Prediction in Logistics

Let’s examine a scenario where a logistics company attempted to predict the value of shipped items based on short, 80-character descriptions provided by shippers. This task, at first glance, might seem well-suited for a traditional ML approach, such as NLP (Natural Language Processing).

Initially, the team considered using classical NLP/ML techniques. However, they realized that encoding words from scratch would mean the model would have no prior understanding of rarely seen words. This led them to experiment with fine-tuning an LLM, hoping to leverage its pre-trained knowledge.

Challenges faced:

Abbreviated descriptions: Many entries used company-specific acronyms or shorthand.
Historical data biases: Past behavior, such as undervaluing items to avoid insurance costs, skewed the dataset.
Limited context: The 80-character limit meant crucial information was often omitted.

The team found that both traditional ML and LLM approaches struggled. The LLM, despite its vast pre-trained knowledge, couldn’t effectively interpret the domain-specific abbreviations. Traditional ML models, while potentially better at handling the structured nature of the data, lacked the semantic understanding to make accurate predictions.

Key takeaway: The nature of the data - short, highly abbreviated, and domain-specific - made it unsuitable for both approaches without significant preprocessing and domain adaptation.

Case Study 2: Natural Language to Specialized Query Translation

In our second example, we’ll look at a company that developed an AI assistant to translate natural language queries into a domain-specific query language for an observability platform.

The initial approach involved prompt engineering with GPT-3.5, using a complex prompt structure that included:

A system message defining the AI’s role
Context of relevant schema information
A “Query Spec” explaining the basics of the query language
A tips section to handle edge cases and failure modes
Few-shot examples demonstrating correct translations

Challenges:

Expressing language nuances: The query language had many idioms and best practices that were difficult to capture in a prompt.
Handling edge cases: As the system expanded, the number of edge cases grew, making the prompt unwieldy.
Limited exposure: The base model had no prior exposure to this specialized query language.

While this approach showed initial promise, it quickly hit a plateau. The team realized that fine-tuning a smaller, specialized model might be more effective for this narrow, domain-specific task.

Key takeaway: For highly specialized domains, the nature of the data often necessitates moving beyond prompt engineering to fine-tuning domain-specific models.

Lessons

The critical importance of understanding your data in AI projects cannot be overstated. As our case studies have illustrated, the nature of your data profoundly influences not just the choice of approach, but the entire development process and ultimate success of your project.

Key considerations for approaching AI projects include:

Data Examination: Always start by thoroughly looking over your raw data. This step is critical for understanding the challenges and opportunities your data presents.
Start Simple: Begin with the simplest approach possible, regardless of data volume or complexity. For many text-based tasks, this often means starting with prompt engineering using base model LLMs.
Evaluation First: Establish robust evaluation systems before choosing or fine-tuning a model. This includes unit tests, human evaluation, and automated evaluation using LLMs.
Iterative Approach: Be prepared to iterate quickly based on your evaluation results. Your initial approach will likely need refinement as you learn more about how your model performs on real data.
Domain Specificity: While highly domain-specific tasks might eventually require fine-tuning, still start with prompt engineering. Only move to more complex approaches if you hit a clear performance wall.

This methodology allows you to avoid common pitfalls such as premature optimization or over-reliance on complex models. Remember, even the most sophisticated AI models can’t compensate for fundamental issues in data understanding or evaluation processes.

Best Practices for Data Examination

Exploratory Data Analysis (EDA): Use statistical and visualization techniques to understand your data’s distribution, patterns, and anomalies. Tools like pandas and matplotlib in Python are invaluable for this.

Domain Expert Consultation: Engage with subject matter experts to understand nuances in the data that might not be immediately apparent.

Sample Examination: Manually inspect a diverse set of samples from your dataset. This can reveal patterns or issues that aggregate statistics might miss.

Data Lineage Tracking: Understand and document the origins and transformations of your data. This can help identify potential biases or quality issues introduced during data collection or processing.

Conclusion

By adopting this approach, you’ll be better equipped to choose the right model, anticipate challenges, and ultimately build AI systems that truly solve real-world problems. You’ll also be able to justify every increase in complexity with tangible performance improvements.

In the rapidly evolving landscape of AI, your most valuable assets are not just your data or your models, but your processes for understanding data, evaluating performance, and making informed decisions about when and how to increase complexity. By focusing on these aspects, you can create AI solutions that are not only powerful, but also practical, maintainable, and truly aligned with your specific use case.