• Try it out now: https://aditya-rag-app.streamlit.app/

    As LLMs have become more ubiquitous, tools to build with them have evolved as well. In fact, they’ve improved to the point that even apps like Perplexity no longer seem daunting to build at a smaller scale.

    To learn more about building with LLMs, I created a RAG (Retrieval-Augmented Generation) chatbot that integrates real-time web search information to provide up-to-date and relevant answers. This operates similarly to many popular large-scale LLM chat applications today.

    The initial consideration when it came to this project was to decide which framework to use when working with LLMs. The two popular choices are LangChain and LlamaIndex. While LlamaIndex is well-known for its specialty in RAG, I chose LangChain because of its flexibility in various aspects – such as workflows and data loaders – thanks to its modular design.

    After finishing an initial prototype however, I faced issues with LangChain. Primarily, I couldn’t enforce a strict workflow for the LLM system. For instance, system prompts would often fail to register, leading to the overall app being unreliable. Moreover, many LangChain features were being deprecated in favor of LangGraph, LangChain’s new framework for complex and dynamic agentic workflows. This led me to rewrite the application to use LangGraph instead.

    Although it initially posed a slight learning curve due to LangGraph’s graph-based architecture, it was soon apparent how much more powerful it was. While LangChain is sequential in nature – “chaining” together runnable components – LangGraph allows for non-linear workflows with conditional logic. This made the agentic approach to the system much more reliable and scalable. LangGraph’s memory persistence via checkpointers made memory management and use much more straightforward, as well as providing a useful abstraction layer compared to LangChain’s many redundant and incompatible memory types.

    When it came to putting together a frontend for the application, Streamlit was my first choice. It was relatively easy to build, and deployment was a smooth process!

  • With the recent release of Genie 3, attention has once again been brought to world models. These are generative models that are trained to simulate and model the (not necessarily) real world! Most impressively, these are real-time simulations, sensitive to input. World models are able to understand the properties of the simulated world – such as forces, motion, spatial and temporal relations – and aim to apply them by simulating a coherent world.

    How they work

    A popular example breaks world models down into 3 components.

    The vision model component allows for representing the video frame as a latent representation – done so at a lower dimensionality. This permits a compressed representation of the world. This component also ensures that the latent space representation can be decoded back into the appropriate video frame.

    The memory model component uses RNNs (Recurrent Neural Networks) to process the temporal sequence of latent space representations, and predict the next one. Since RNNs are already commonly used for temporal predictions, they fit right into this use case! This component also includes the previous control action to make its prediction, which brings us to…

    The control component. This unit is responsible for deciding how the video output changes. While traditionally this was automated via NNs, incorporating human input into this component allows a user to control the simulation. This is all quite the simplified explanation, but it gives a good idea of how world models work.

    Why world models?

    World models are highly valuable because they enable AI systems to simulate and understand complex, dynamic environments before acting in them. This capability drives applications across robotics, autonomous vehicles, and more. For example, robots can learn spatial awareness and plan multi-step tasks safely in simulations, reducing costly real-world trials. Autonomous vehicles use world models to train safely in diverse traffic, weather, and pedestrian scenarios that might be difficult to encounter consistently in reality. Beyond training, world models support better decision-making and safety by predicting future states and outcomes in real time. They also accelerate learning efficiency and task generalization, empowering AI to handle new and complex challenges with flexibility.

    Genie 3

    Google’s recent release of Genie 3 marks a significant step forward in world model capabilities. Genie 3 generates interactive, 3D virtual worlds in real-time, powered by a foundation model trained to simulate diverse environments accurately and responsively. What sets Genie 3 apart is its ability to maintain logical consistency and physical realism over extended interactions without relying on hard-coded physics engines. This results in an interaction horizon lasting several minutes! Through its short-term memory, it remembers past events and actions to sustain coherent experiences. Users can guide these worlds with text prompts or direct actions, creating a fluid, explorable simulation that blends imagination and grounded world understanding.

  • Have you seen this video?

    “This is gonna be scariest sound you’ll hear when they’re looking for you”

    “This is almost like two R2-D2’s having a conversation.”

    “Literally sounds like something out of a sci-fi horror with the AI looking for you hiding in the cupboard lol”

    What’s going on? Is Skynet upon us? What is Gibberlink mode?

    Gibberlink Mode

    Gibberlink mode is an AI communication protocol that allows two AI agents to communicate via a sound-based language optimized for inter-machine communications. Once two AI voice agents realize they’re in a conversation, they can switch to communicating via Gibberlink mode.

    This enables them to transmit and receive data via the GGWave protocol, which encodes digital data into bursts of sound. This compacts the data, making it an efficient communication method.

    Some mechanics the protocol uses include:

    • Frequency division multiplexing: Using multiple carrier frequencies at once, boosting throughput and signal integrity.
    • Packet structure: Data bursts form packets, containing a header, payload, and end marker.
    • Encryption: Though Gibberlink data on its own is insecure (to devices at least), some promising results are demonstrating that AI voice agents can learn to encrypt data via public keys and derived secret keys.

    Why Gibberlink?

    Created by Anton Pidkuiko and Boris Starkov, and demonstrated via this viral video, Gibberlink won the global top prize at the ElevenLabs (a leading speech synthesis company) Worldwide Hackathon. Despite its playful appearance, the technology is seeing steady adoption among voice agents. Gibberlink’s upsides include:

    • Speed: Up to 80% faster than spoken language.
    • Efficiency: Reduced computational load, by representing data efficiently and avoiding full NLP for communication.
    • Security: Harder to intercept than spoken language.

  • Why Do LLMs Hallucinate?

    LLM hallucinations stem from several inherent factors tied to how these models are developed and operate:

    • Limitations in Training Data: LLMs learn from vast datasets, but these datasets can be incomplete, outdated, or biased. Missing information, errors in the data, or skewed representations can lead the model to generate inaccurate or misleading content.
    • Probabilistic Text Generation: LLMs generate text by predicting the most likely next word based on patterns learned during training. However, they do not possess true fact-checking capabilities. This probabilistic nature means they can produce plausible-sounding but incorrect information.
    • Ambiguous or Poorly Phrased Prompts: When user input is vague or unclear, the model struggles to interpret intent precisely. This uncertainty can cause it to fill gaps with invented or unrelated details, resulting in hallucinations.
    • Architectural and Optimization Factors: Certain design choices in model architecture and optimization techniques can impact how well the model balances creativity and accuracy, influencing hallucination rates.
    • Randomness in Generation Processes: Elements like temperature settings introduce randomness to encourage diverse outputs, but this can sometimes cause the model to produce unexpected or erroneous content.

    Approaches to Mitigate Hallucinations

    While hallucinations cannot be entirely eliminated, various strategies help reduce their frequency and impact:

    • Improving Training Data Quality: Curating high-quality, comprehensive, and up-to-date datasets helps models learn more accurate and relevant information.
    • Retrieval-Augmented Generation: Integrating external knowledge sources or real-time databases allows the model to ground its responses in verifiable facts, reducing fabrication.
    • Prompt Engineering: Crafting clear, specific, and well-structured prompts minimizes ambiguity and guides the model toward more accurate answers.
    • Post-Processing and Fact-Checking: Applying automated or human-in-the-loop verification processes after generation can identify and correct hallucinated content before it reaches users.
  • What Are LLM Hallucinations?

    When it comes to LLMs, “hallucinations” refer to instances where the model generates information that is inaccurate, irrelevant, or entirely fabricated. The term is metaphorical, borrowing from human experiences of perceiving things that aren’t real, to describe how an AI model can produce outputs that appear plausible and convincing but are fundamentally false or misleading. These hallucinations pose a significant challenge because they can undermine trust and limit the usefulness of LLMs in practical applications.

    Types of LLM Hallucinations

    Hallucinations in LLMs manifest in several distinct forms:

        Factual Inaccuracies: These occur when the model provides information that is simply wrong or misleading. Despite drawing from extensive training data, LLMs can mix facts incorrectly, invent dates, misattribute quotes, or otherwise present erroneous content as truth.

        Nonsensical Responses: Sometimes the output lacks logical coherence—sentences or paragraphs may be grammatically correct yet make no real sense, fail to connect ideas meaningfully, or veer into absurdity without clear reason.

        Contradictions: An LLM may produce conflicting statements either within a single response or between its response and the input prompt. This inconsistency can confuse users and reduce confidence in the model’s reasoning.

        Irrelevant or Off-Topic Content: The model might wander away from the subject at hand, introducing information or tangents that have little or no connection to the user’s query or the surrounding context. This distracts from the conversation’s purpose and reduces clarity.

  • Common Mistakes in Evaluation

    One frequent error is over-reliance on a single metric, which fails to capture the multidimensional nature of language tasks. Using outdated benchmarks can misrepresent modern model abilities or ignore emerging challenges. Data leakage—where test data overlaps with training data—can artificially inflate scores and mislead evaluations.

    Best Practices

    Combining human and automatic evaluation leverages the speed of algorithms and the insightfulness of human judgment. Regularly updating benchmarks ensures evaluations remain relevant amid rapid LLM advancements. Robustness testing against adversarial inputs and real-world scenarios helps assess how models perform outside controlled environments.

  • Benchmarks

    Benchmarks provide standardized datasets and tasks to compare model performance. Popular benchmarks for LLMs include GLUE and SuperGLUE for language understanding, SQuAD for question answering, as well as more specialized domain tests focused on coding ability or multilingual competence. These benchmarks help track progress and identify gaps across diverse challenges.

    Core Automatic Metrics

    Common quantitative metrics include perplexity, which measures how well a model predicts text; accuracy and F1 score for classification-type tasks; and BLEU and ROUGE for evaluating text similarity in translation or summarization tasks. These metrics offer objective, reproducible ways to gauge model capability on discrete aspects of language.

    Limitations of Metrics

    While useful, automatic metrics have blind spots. They often miss subtleties like contextual appropriateness, creativity, or ethical risks. Some metrics can be gamed by models optimizing for score rather than quality, leading to misleading conclusions. Therefore, relying solely on metrics without complementary evaluation methods, such as human evaluation, can obscure a model’s true performance capabilities.

  • Evaluating large language models (LLMs) is crucial to ensure they deliver accurate, safe, and useful outputs. After all, without rigorous assessment, models may generate incorrect, biased, or harmful content that undermines trust and viability. Evaluation helps developers understand a model’s strengths and weaknesses, guide improvements, and ensure alignment with practical needs.

    How do we evaluate LLMs?

    There are two primary methods for evaluating LLMs: automatic metrics and human evaluation. Automatic metrics use algorithms to score outputs objectively, enabling rapid assessment at large scale. Human evaluation, on the other hand, involves subjective judgment by people who assess fluency, relevance, and appropriateness more holistically, capturing nuances that machines may miss.

    When to Use Which Approach

    The choice between automatic and human evaluation depends on factors like scale, cost, and task type. Automatic metrics are ideal for frequent, large-scale testing where efficiency is critical, but their insights are limited. Human evaluation is best for nuanced tasks, high-stakes applications, or final validation, despite being slower and more costly. Often, a blend of both provides the most reliable and actionable insights.

  • What is Temperature in LLMs?

    When it comes to LLMs, temperature is a key parameter that controls the randomness or creativity of the text the model generates. It acts like a dial that influences how adventurous or predictable the model’s word choices are when producing language, essentially shaping the style and variety of the output.

    How Temperature Works

    Temperature works by adjusting the probability distribution over possible next words before the model selects its output. Internally, the model computes logits—raw, unnormalized scores for each potential token. Temperature rescales these logits, effectively sharpening or flattening the probability distribution. A low temperature value (close to zero) makes the distribution peak sharply around the highest-probability words, making the output highly predictable and deterministic. Conversely, a higher temperature flattens the distribution, increasing the chances of selecting less likely tokens and thus boosting output diversity and creativity.

    Why Temperature Matters

    This parameter is crucial because it allows users to balance coherence and inventiveness based on their needs. For example, when factual accuracy and reliability are paramount—such as in technical writing or data-driven responses—a low temperature is preferred to generate consistent and precise text. On the other hand, creative tasks like storytelling, brainstorming, or poetry benefit from a higher temperature, which encourages novel and varied outputs. Adjusting temperature helps tailor the model’s behavior to different applications, making it a versatile tool in managing language generation.

    Practical Implications and Examples

    Typically, temperature values range from 0 to 1, though values above 1 are sometimes used for even more randomness. A common low setting might be around 0.2 to 0.5, yielding safe and focused text. Mid-range values around 0.7 strike a balance, offering natural yet slightly inventive language. At high values near 1 or beyond, outputs become more unpredictable and sometimes quirky, which can be exciting but may also introduce nonsensical or irrelevant content.

  • What is a Context Window?

    A context window is essentially the span or range of tokens—units of text like words, subwords, or punctuation—that an AI language model can consider or “remember” at one time. Think of it as the model’s working memory, the active portion of text it analyzes when processing input and generating outputs. This window frames what the model can directly reference or draw upon during a single interaction, allowing it to maintain continuity and relevance within that scope.

    How Context Windows Work

    Language models process language through tokens, which break down text into manageable pieces, ranging from full words to smaller segments depending on the model’s design. The context window defines how many of these tokens the model can use simultaneously. As the model receives input, it incorporates all tokens within this window to understand meaning, context, and relationships before predicting the next token or generating a response. If the input exceeds the window size, the model must truncate or forget earlier tokens, limiting how much information it can actively consider at once.

    Why Context Windows Matter in LLMs

    The size of the context window directly influences a large language model’s performance and usefulness. A larger window allows the model to consider more text—whether it’s a long conversation, a detailed document, or complex instructions—leading to better understanding, more coherent outputs, and fewer hallucinations or errors caused by missing context. It also enhances the model’s ability to handle nuanced or extended interactions without losing track of earlier details. Consequently, innovations that expand context window sizes are crucial for improving how LLMs interact naturally and effectively across various applications.