• Tech Dev NotesTech Dev Notes
Apps
  • App lookup
  • App compare
Market movement
  • App charts
  • App rankings
Visual proof
  • App screens
  • App listing screenshots
  • App icons
Build intelligence
  • App tech stacks
  • Tool releases
  • Developers
More
  • X feature flags
  • Grokipedia
  • Blog
  • Follow on X
Skip to content
All content/ filesChangelog

xai-docs/latest/content · Jun 27, 00:17 UTC

pages/developers/advanced-api-usage/context-compaction.md

MD·12 KB·271 lines

content/

  • .

    • llms.txt
  • pages

    • overview.md
  • pages/build

    • enterprise.md
    • modes-and-commands.md
    • overview.md
    • settings.md
  • pages/build/cli

    • headless-scripting.md
  • pages/build/features

    • skills-plugins-marketplaces.md
  • pages/console

    • billing.md
    • collections.md
    • usage.md
  • pages/console/faq

    • accounts.md
    • billing.md
    • security.md
  • pages/developers

    • community.md
    • cost-tracking.md
    • debugging.md
    • docs-mcp.md
    • files.md
    • grpc-api-reference.md
    • management-api-guide.md
    • models.md
    • pricing.md
    • quickstart.md
    • rate-limits.md
    • release-notes.md
  • pages/developers/advanced-api-usage

    • async.md
    • batch-api.md
    • context-compaction.md
    • deferred-chat-completions.md
    • mtls.md
    • priority-processing.md
    • prompt-caching.md
    • websocket-mode.md
  • pages/developers/advanced-api-usage/prompt-caching

    • best-practices.md
    • how-it-works.md
    • maximizing-cache-hits.md
    • multi-turn.md
    • usage-and-pricing.md
  • pages/developers/faq

    • accounts.md
    • billing.md
    • general.md
    • security.md
    • team-management.md
  • pages/developers/files

    • collections.md
    • managing-files.md
    • public-urls.md
  • pages/developers/files/collections

    • api.md
    • metadata.md
  • pages/developers/migration

    • may-15-retirement.md
  • pages/developers/model-capabilities

    • imagine.md
  • pages/developers/model-capabilities/audio

    • custom-voices.md
    • ephemeral-tokens.md
    • speech-to-text.md
    • text-to-speech.md
    • voice-agent.md
    • voice.md
  • pages/developers/model-capabilities/audio/voice-agent

    • sip.md
  • pages/developers/model-capabilities/files

    • chat-with-files.md
  • pages/developers/model-capabilities/images

    • editing.md
    • generation.md
    • multi-image-editing.md
    • understanding.md
  • pages/developers/model-capabilities/imagine

    • files.md
  • pages/developers/model-capabilities/imagine/files

    • inputs.md
    • outputs.md
  • pages/developers/model-capabilities/legacy

    • chat-completions.md
  • pages/developers/model-capabilities/text

    • comparison.md
    • generate-text.md
    • multi-agent.md
    • reasoning.md
    • streaming.md
    • structured-outputs.md
  • pages/developers/model-capabilities/video

    • editing.md
    • extension.md
    • generation.md
    • image-to-video.md
    • reference-to-video.md
  • pages/developers/models

    • speech-to-text.md
    • text-to-speech.md
    • voice-agent-api.md
  • pages/developers/rest-api-reference

    • collections.md
    • files.md
    • inference.md
    • management.md
  • pages/developers/rest-api-reference/collections

    • collection.md
    • search.md
  • pages/developers/rest-api-reference/files

    • download.md
    • manage.md
    • upload.md
  • pages/developers/rest-api-reference/inference

    • batches.md
    • chat.md
    • images.md
    • legacy.md
    • models.md
    • other.md
    • speech-to-text.md
    • videos.md
    • voice.md
  • pages/developers/rest-api-reference/management

    • audit.md
    • auth.md
    • billing.md
  • pages/developers/tools

    • advanced-usage.md
    • citations.md
    • code-execution.md
    • collections-search.md
    • function-calling.md
    • overview.md
    • remote-mcp.md
    • streaming-and-sync.md
    • tool-usage-details.md
    • web-search.md
    • x-search.md
  • pages/grok

    • connector-management.md
    • connectors.md
    • faq.md
    • management.md
    • organization.md
    • user-guide.md
  • pages/grok/connectors

    • custom-mcp-tunneling.md
    • gmail-google-calendar.md
    • google-drive.md
    • microsoft-teams.md
    • onedrive.md
    • outlook.md
    • salesforce.md
    • sharepoint.md
  • pages/grok/faq

    • team-management.md
  • pages/integrations

    • hubspot-mcp-setup.md

Advanced API Usage

Context Compaction

When a conversation grows past a few thousand tokens, every follow-up call resends every prior message and pays input tokens for all of them. Context compaction lets you shrink those messages into a single opaque item that preserves the salient state — system prompts, attached files, prior reasoning, and a compacted record of the turns — while dropping the verbose tool output and back-and-forth.

You then pass that compaction item back into your next request verbatim, and the model continues the conversation as if the full history were still there.

  • Lower input cost — the next call only pays for the compacted context, not the original messages.
  • Lower latency — smaller payloads mean faster time-to-first-token.
  • Sharper responses — a tighter context keeps the model focused on the current task instead of getting distracted by stale tool output and old turns.
  • Longer conversations — keep multi-hour agent loops well under the model's context window.

[!NOTE]

Treat encrypted_content as opaque — do not parse or modify it. You can store the blob in your own database and pass it back unchanged in later requests; it is only meaningful when sent back to xAI's API.

When to compact

Compact when all of the following are true:

  • The conversation has grown large enough that input_tokens on each call is hurting cost or latency.
  • You still want the model to remember prior turns (otherwise just start a new conversation).
  • The current window still fits within the model's context limit (compaction shrinks the conversation — it cannot rescue a request that is already over the limit).

A typical pattern is to call the Compaction API every N turns inside an agent loop, or once whenever your bookkeeping shows the rendered context above a threshold you've chosen for your workload.

Compaction API

Send the conversation you want to compact. The response contains a single compaction item that stands in for the entire prior conversation — you can safely drop the original messages from your client-side state, use the compaction item as the head of your next request, and append your new user turn after it.

# Step 1 — compact the long conversation
curl -s https://api.x.ai/v1/responses/compact \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "grok-4.3",
    "input": [
      {"role": "system", "content": "You are a concise and knowledgeable science tutor."},
      {"role": "user", "content": "What is the Higgs boson and why is it important?"},
      {"role": "assistant", "content": "The Higgs boson is an elementary particle..."},
      {"role": "user", "content": "How does the Higgs mechanism actually work?"},
      {"role": "assistant", "content": "The Higgs mechanism works through spontaneous symmetry breaking..."}
    ]
  }'

# Step 2 — continue the conversation using the compacted output
curl -s https://api.x.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "grok-4.3",
    "input": [
      {
        "type": "compaction",
        "id": "cmp_abc123",
        "encrypted_content": "<paste encrypted_content from step 1>"
      },
      {"role": "user", "content": "Based on our earlier conversation, what gives particles their mass?"}
    ]
  }'
import os
from xai_sdk import Client
from xai_sdk.chat import system, user

client = Client(api_key=os.environ["XAI_API_KEY"])

# Build up a chat normally — system prompt plus a few user/assistant turns.
# use_encrypted_content=True is recommended for reasoning models so the model's
# reasoning content from prior turns is preserved through the compaction.
chat = client.chat.create(model="grok-4.3", use_encrypted_content=True)
chat.append(system("You are a concise and knowledgeable science tutor."))

chat.append(user("What is the Higgs boson and why is it important?"))
chat.append(chat.sample())

chat.append(user("How does the Higgs mechanism actually work?"))
chat.append(chat.sample())

# ... many more turns ...

# Step 1 — compact the conversation. Pass the chat's accumulated messages
# straight into compact_context.
compact = client.chat.compact_context(
    model="grok-4.3",
    messages=chat.messages,
)
print(f"Compaction ID:    {compact.id}")
print(f"Dropped messages: {compact.dropped_message_count}")
print(f"Tokens used:      {compact.usage.total_tokens}")

# Step 2 — continue the conversation. chat.append(compact) clears the
# in-memory message list on the chat object and seeds it with just the
# compaction blob, so subsequent chat.sample() calls run on top of the
# compacted context instead of replaying the full prior history.
chat.append(compact)
chat.append(user("Based on our earlier conversation, what gives particles their mass?"))
print(chat.sample().content)
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url="https://api.x.ai/v1",
)

# Step 1 — compact the long conversation
compacted = client.responses.compact(
    model="grok-4.3",
    input=[
        {"role": "system", "content": "You are a concise and knowledgeable science tutor."},
        {"role": "user", "content": "What is the Higgs boson and why is it important?"},
        {"role": "assistant", "content": "The Higgs boson is an elementary particle..."},
        {"role": "user", "content": "How does the Higgs mechanism actually work?"},
        {"role": "assistant", "content": "The Higgs mechanism works through spontaneous symmetry breaking..."},
    ],
)

print(f"Compaction ID:    {compacted.id}")
print(f"Dropped messages: {compacted.usage.dropped_message_count}")
print(f"Output tokens:    {compacted.usage.output_tokens}")

# Step 2 — continue the conversation. Spread compacted.output into the next input.
followup = client.responses.create(
    model="grok-4.3",
    input=[
        *compacted.output,  # use the compaction item verbatim — do not modify
        {"role": "user", "content": "Based on our earlier conversation, what gives particles their mass?"},
    ],
)

print(followup.output_text)
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.XAI_API_KEY,
  baseURL: "https://api.x.ai/v1",
});

// Step 1 — compact the long conversation
const compacted = await client.responses.compact({
  model: "grok-4.3",
  input: [
    { role: "system", content: "You are a concise and knowledgeable science tutor." },
    { role: "user", content: "What is the Higgs boson and why is it important?" },
    { role: "assistant", content: "The Higgs boson is an elementary particle..." },
    { role: "user", content: "How does the Higgs mechanism actually work?" },
    { role: "assistant", content: "The Higgs mechanism works through spontaneous symmetry breaking..." },
  ],
});

console.log(`Compaction ID:    ${compacted.id}`);
console.log(`Dropped messages: ${compacted.usage.dropped_message_count}`);
console.log(`Output tokens:    ${compacted.usage.output_tokens}`);

// Step 2 — continue the conversation. Spread compacted.output into the next input.
const followup = await client.responses.create({
  model: "grok-4.3",
  input: [
    ...compacted.output, // use the compaction item verbatim — do not modify
    { role: "user", content: "Based on our earlier conversation, what gives particles their mass?" },
  ],
});

console.log(followup.output_text);

The xAI SDK also exposes an AsyncClient with await client.chat.compact_context(...) and await chat.sample() for the same flow under asyncio.

Response shape

The REST endpoint (POST /v1/responses/compact) returns an OpenAI-compatible compaction object:

{
  "id": "cmp_01HZ9P0V8M2YQK3F7C4G6N5R2A",
  "object": "response.compaction",
  "created_at": 1748895600,
  "model": "grok-4.3",
  "output": [
    {
      "type": "compaction",
      "id": "cmp_01HZ9P0V8M2YQK3F7C4G6N5R2A",
      "encrypted_content": "<opaque blob>"
    }
  ],
  "usage": {
    "input_tokens": 12000,
    "input_tokens_details": { "cached_tokens": 0 },
    "output_tokens": 800,
    "output_tokens_details": { "reasoning_tokens": 240 },
    "total_tokens": 12800,
    "dropped_message_count": 45
  }
}
Field Description
id Stable ID for this compaction (cmp_<uuid>). Also echoed on the inner compaction item.
object Always "response.compaction".
output[].encrypted_content Opaque blob containing the compacted conversation.
output[].type Always "compaction".
output An array containing a single compaction item. Pass it verbatim into your next request.
usage.dropped_message_count Number of input messages folded into the compaction.
usage.input_tokens Tokens in the pre-compaction conversation.
usage.output_tokens Tokens generated for the compacted record. The blob the model rehydrates on the next call is roughly your preserved system prompt(s) plus this many tokens.

[!WARNING]

Do not prune the compaction output. Treat the returned compaction item as the new "start" of the conversation — append new user turns after it, never before. Removing or reordering items inside the compacted output breaks the chain.

In-place compaction in the xAI SDK

For long-running agent loops, the xAI SDK has a convenience method on a live Chat object: chat.compact() runs compaction against the chat's current messages and replaces them in-place with the compaction item. You can keep calling chat.sample() afterwards exactly as before — the server will rehydrate the compacted prefix on the next request.

import os
from xai_sdk import Client
from xai_sdk.chat import system, user

client = Client(api_key=os.environ["XAI_API_KEY"])

# use_encrypted_content=True preserves the model's reasoning content across
# turns, recommended when using reasoning models.
chat = client.chat.create(model="grok-4.3", use_encrypted_content=True)
chat.append(system("You are a helpful assistant. Keep answers brief."))

compact_every = 5
for turn in range(1, 100):
    chat.append(user(input("You: ")))
    response = chat.sample()
    print(f"Grok: {response.content}")
    chat.append(response)

    if turn % compact_every == 0:
        before = len(chat.messages)
        compact = chat.compact()
        print(
            f"[compacted {before} → {len(chat.messages)} messages | "
            f"dropped {compact.dropped_message_count} | "
            f"tokens used: {compact.usage.total_tokens}]"
        )

The same method is available on AsyncClient as await chat.compact().

Limits and gotchas

  • The conversation you compact must already fit in context. Compaction shrinks the conversation; it does not rescue an over-limit request. If your conversation is already past context_length_exceeded, you'll need to prune or split before calling compact.
  • At most one compaction per call. The endpoint does one compaction pass per request.
  • encrypted_content is opaque. Do not parse, edit, or hand-merge multiple blobs. Always pass the full output array (or CompactContextResponse) back verbatim.
  • Re-compacting is fine. You can compact an already-compacted conversation again later — for example, when the conversation grows long after the previous compaction.
  • Token usage on the compaction call. The compaction itself uses tokens (visible in usage.input_tokens / usage.output_tokens). Pick a smaller / faster model for compaction if you are doing it frequently.

Related

  • Generate Text — Responses API — the primary endpoint that compaction feeds into.
  • Prompt Caching — a complementary cost-reduction lever for unchanged prompt prefixes.
  • Chat API Reference — full request/response schema for the Compaction API.
Previouspages/developers/advanced-api-usage/batch-api.mdNextpages/developers/advanced-api-usage/deferred-chat-completions.md

© 2026 Tech Dev Notes

RSSAboutAPIPrivacyTermsSitemap@techdevnotes