In the realm of Large Language Model (LLM) chatbots, two of the most persistent user experience disruptions relate to streaming of responses:
- Markdown rendering jank: Syntax fragments being rendered as raw text until they form a complete Markdown element. This results in a jarring visual experience.
- Response delay: The long time it takes to formulate a response by making multiple LLM roundtrips while consulting external data sources. This results in the user waiting for an answer while staring at a spinner.
Here’s a dramatic demonstration of both problems at the same time:
For Sidekick, we've developed a solution that addresses both problems: A buffering Markdown parser and an event emitter. We multiplex multiple streams and events into one stream that renders piece-by-piece. This approach allows us to prevent Markdown rendering jank while streaming the LLM response immediately as additional content is resolved and merged into the stream asynchronously.
In this post, we'll dive into the details of our approach, aiming to inspire other developers to enhance their own AI chatbot interactions. Let's get started.
Selective Markdown buffering
Streaming poses a challenge to rendering Markdown. Character sequences for certain Markdown expressions remain ambiguous until a sequence marking the end of the expression is encountered. For example:
Emphasis (strong) versus unordered list item: A
"*"character at the beginning of a line could be either. Until either the closing
"*"character is encountered (emphasis), or an immediately following whitespace character is encountered (list item start), it remains ambiguous whether this
"*"will end up being rendered as a
Links: Until the closing parenthesis in a
"[link text](link URL)"is encountered, an
<a>HTML element cannot be rendered since the full URL is not yet known.
We solve this problem by buffering characters whenever we encounter a sequence that is a candidate for a Markdown expression and flushing the buffer when either:
- The parser encounters an unexpected character: We flush the buffer and render the entire sequence as raw text, treating the putative Markdown syntax as a false-positive.
- The full Markdown element is complete: We render the buffer content as a single Markdown element sequence.
Doing this while streaming requires the use of a stateful stream processor that can consume characters one-by-one. The stream processor either passes through the characters as they come in, or it updates the buffer as it encounters Markdown-like character sequences.
We use a Node.js Transform stream to perform this stateful processing. The transform stream runs a finite state machine (FSM), fed by individual characters of stream chunks that are piped into it – characters, not bytes: To iterate over the Unicode characters in a stream chunk, use an iterator (e.g.
for..of over a chunk string). Also, assuming you’re using a Large Language Model (LLM), you can have faith that chunks streamed from the LLM will be split at Unicode character boundaries.
Here’s a reference TypeScript implementation that handles Markdown links:
You can add support for additional Markdown elements by extending the state machine. Implementing support for the entire Markdown specification with a manually crafted state machine would be a huge undertaking, which would perhaps be better served by employing an off-the-shelf parser generator that supports push lexing/parsing.
Async content resolution and multiplexing
LLMs have a good grasp of general human language and culture, but they’re not a great source of up-to-date, accurate information. We therefore tell LLMs to tell us when they need information beyond their grasp through the use of tools.
The typical tool integration goes:
- Receive user input.
- Ask the LLM to consult one or more tools that perform operations.
- Receive tool responses.
- Ask the LLM to assemble the tool responses into a final answer.
The user waits for all steps to complete before seeing a response:
We’ve made a tweak to break the tool invocation and output generation out of the main LLM response, to let the initial LLM roundtrip directly respond to the user, with placeholders that get asynchronously populated:
Since the response is no longer a string that can be directly rendered by the UI, the presentation requires orchestration with the UI. We could handle this in two steps. First, we could perform the initial LLM roundtrip, and then we could let the UI make additional requests to the backend to populate the tool content. However, we can do better! We can multiplex asynchronously-resolved tool content into the main response stream:
The UI is responsible for splitting (demultiplexing) this multiplexed response into its components: First the UI renders the main LLM response directly to the user as it is streamed from the server. Then the UI renders any asynchronously resolved tool content into the placeholder area.
This would render on the UI as follows:
This approach lends itself to user requests with multiple intents. For example:
To multiplex multiple response streams into one, we use Server-Sent Events, treating each stream as a series of named events.
Tying things together
Asynchronous multiplexing serendipitously ties back to the Markdown buffering we mentioned earlier. In our prompt, we tell the LLM to use special Markdown links whenever it wants to insert content that will get resolved asynchronously. Instead of “tools”, we call these “cards” because we tell the LLM to adjust its wording to the way the whole response will be presented to the user. In the “tool” world, the tools are not touch points that a user is ever made aware of. In our case, we’re orchestrating how content will be rendered on the UI with how the LLM outputs presentation-centric output, using presentation language.
The special card links are links that use the “card:” protocol in their URLs. The link text is a terse version of the original user intent that is paraphrased by the LLM. For example, for this user input:
| How can I configure X?
The LLM output might look something like this:
Remember that we have a Markdown buffering parser that the main LLM output is piped to. Since these card links are Markdown, they get buffered and parsed by our Markdown parser. The parser calls a callback whenever it encounters a link. We check to see if this is a card link and fire off an asynchronous card resolution task. The main LLM response gets multiplexed along with any card content, and the UI receives all of this content as part of a single streamed response. We catch two birds with one net: Instead of having an additional stream parser sitting on top of the LLM response stream to extract some “tool invocation” syntax, we piggyback on the existing Markdown parser.
Then content for certain cards can be resolved entirely at the backend and their final content arrives in the UI. The content for certain cards gets resolved into an intermediate presentation that gets processed and rendered by the UI (e.g. by making an additional request to a service). But in the end, we stream everything as they’re being produced, and the user always has feedback that content is being generated.
Markdown, as a means of transporting structure, beats JSON and YAML in token counts. And it’s human-readable. We stick to Markdown as a narrow waist for both the backend-to-frontend transport (and rendering), and for LLM-to-backend invocations.
Buffering and joining stream chunks also enables alteration of Markdown before sending it to the frontend. (In our case we replace Markdown links with a card content identifier that corresponds to the card content that gets multiplexed into the response stream.)
Buffering and joining Markdown unlocks UX benefits, and it’s relatively easy to implement using an FSM.