Less Hallucinations and Better UX with Claim References

Sep 6, 2023

When working with Large Language Models (LLMs), hallucinations are an open challenge. Hallucinations occur when a system produces misleading or false outputs during interaction with a user. We've been researching potential solutions, and believe incorporating claim references might be a promising route. In this article, we'll explore several techniques that tap into the power of references to enhance the reliability of LLMs and their associated challenges.

Understanding the challenge: What’s the deal with hallucinations?

They put the end-user at risk. Hallucinations shift the responsibility of discerning accuracy onto the end-user. This means that users, often without the necessary expertise or the time, have to determine if the information presented is correct or not. This could lead to the propagation of misinformation if unchecked.

The risk amplifies in high-stake scenarios. Consider a business setting where an LLM churns out a report or a claim. If that output is a result of a hallucination and goes unchecked, it could lead to reputational damages, legal issues, or financial losses.

They’re Not Your Everyday Bug. Unlike traditional software bugs, which might cause a system to crash or function improperly, hallucinations are more insidious. They present false information as fact. While a system crash is evident and prompts immediate action, a hallucination might go unnoticed. Detecting and validating them requires a level of expertise that's often beyond general user interaction.

How to mitigate these risks?

Improved User Experience and Debugging: GitHub Copilot serves as an excellent example. Engineers using Copilot have various tools like debuggers and Integrated Development Environments (IDEs) at their disposal to validate and correct any inaccuracies, thereby minimizing the risks involved.

Grounding Models: Grounding refers to anchoring the generated content to a factual basis or an external database. By doing this, the text generation process becomes more accurate and consistent. Grounding is especially crucial for specialized text generation use cases such as medical or legal advice, where inaccuracies can have significant implications.

Claim References: While grounding offers a base layer of validation, referencing adds an extra layer of specific and verifiable credibility.

The Role of Referencing

References in articles reinforce credibility. Whether it's a scientific paper, news story, or blog post, they serve as proof, backing up the information presented. They allow readers, regardless of their expertise, to verify facts and figures from trusted sources and even delve deeper into the subject matter if possible. This boosts trust and encourages transparency.

Similarly, in Large Language Models (LLMs) or other Machine Learning systems, references play the same role. They act as a validation tool. When LLMs include citations, it not only ensures the information's reliability but also guides users to the original sources, enhancing confidence in both the AI and its produced content.

Referencing techniques

Procedure-injected References
References are added to the end of the generated answer without LLM intervention. This technique enhances user experience since the context is referenced, and users can validate the output. However, it doesn’t reduce hallucinations, as it doesn’t affect the LLM generated, or specify which parts of the text correspond to each reference.
LLM-Generated References
References are created through the LLM's output. This method works as a context-reinforcement mechanism, similar to what Chain-of-thought techniques do, grounding the output a lot more to the context provided by requiring to add the source of the content, reducing the risk of hallucinations while also providing a better experience for users to validate the output generated. However, achieving this is a much more complex task, as it requires adjusting the LLM input or its configuration to achieve the task.

Approaches to LLM-generated referencing:

Rules in context to align referencing requirements (Prompt Engineering)
This approach capitalizes on the baseline capabilities of LLMs. Some papers and various blog posts have explored using prompts to enable LLMs to generate quotations. By providing certain rules on what and how the model should generate the references to the claims, we would be able to ground the model as required to do what we expect.
Challenge: Proper context definition.
Use cases may require specific ways of defining the context to make sure that the references appear in the format expected.

Model alignment through fine-tuning
Fine-tuning models to learn how to treat specific inputs to add references to the output could lead to the model generalizing how claims should be presented on question-answering use cases.

We have identified several datasets that could be used to achieve this objective. Some of them are:
- Amazon’s StoryQA Dataset
- MSQA by Microsoft

Challenges:
Cost. Fine-tuning requires substantial computational resources, that prompt engineering doesn’t require.

Overall challenges of LLM-generated references

False negatives: Incomplete referencing
The inherent limitations of LLMs mean that it's not guaranteed that every claim will be accurately referenced.
Even if we provide the proper context and the LLM uses it and doesn’t generate claims/references that don’t exist, the system may not reference claims that should be referenced. Tuning the model and/or the context provided may help to achieve fewer false negatives.
False positives: Hallucinations are still something to deal with
Even if requiring references to LLM could help to ground it and therefore reduce hallucinations, it is not guaranteed that hallucinations will be eliminated entirely.

There are potential risks that both the content and the reference generated are not provided by the context, and therefore, we can consider it a hallucination. Further work is required in order to find a mechanism that helps us to mitigate hallucinations completely.

Conclusions

Incorporating claim references in Large Language Models (LLMs) holds promise for enhancing content reliability and reducing errors. Grounding techniques like Procedure-injected and LLM-Generated References offer an added layer of user verification, enhancing both user experience and credibility. Nevertheless, the intricate task of grounding and referencing in LLMs comes with challenges, including the computational cost of fine-tuning and the difficulty of ensuring accurate references for all claims. Despite progress in grounding efforts, an ongoing discussion about the robustness and dependability of these models is essential. Hallucination risks, particularly for users without specialized expertise, emphasize the need for continuous updates, rigorous testing, and complementary strategies to bolster reliability.

If you want to know more or collaborate with us, contact us!

Check out our latest articles:

Less Hallucinations and Better UX with Claim References

Sep 6, 2023

Tackling Question Answerability with Few-Shot Classifiers

Aug 18, 2023

Reranking Embeddings to Improve How They Retrieve Information

Aug 8, 2023

An AI-Venture by the agile monkeys.