Context Compaction: The Art of Summarizing a 3-Hour Coding Session for Your LLM

April 18, 2025 · 5 min read

Osamah Alghanmi

Co-Founder & Technical Lead

Your AI pair programmer has a 200K token limit. After 3 hours, you're at 150K. What do you do?

The Token Limit Problem

You're pair programming with an AI. Three hours in:

47 user messages
47 assistant responses
94 tool calls
94 tool results

Total: ~150K tokens

The LLM's context window: 200K tokens

You have 50K tokens left. At this rate, you'll hit the limit in an hour.

Options:

Start a new session — Lose all context
Truncate old messages — Lose potentially important details
Summarize with the LLM — Expensive and slow
Context compaction — Smart compression

What Is Context Compaction?

Context compaction intelligently reduces token usage while preserving semantic meaning:

Before: 847 messages, ~142K tokens
After:  23 messages, ~3K tokens
       
[Context Summary — compacted from 847 messages]

## Original Request
User asked to generate an orbital schema for a task management app...

## Actions Taken
1. Created schema `taskly.orb` with 3 orbitals
2. Validated — fixed 4 errors
3. Compiled to TypeScript shell
4. User requested adding "priority" field

## Current State
- Schema is valid and compiles cleanly
- Working directory: /home/user/projects/taskly

The Compaction Pipeline

Almadar's context compaction follows an 8-step pipeline:

Step 1: Estimate Tokens

function estimateTokens(messages: Message[]): number {
  // Character-based heuristic (80% accurate)
  const totalChars = messages.reduce((sum, m) => {
    const content = typeof m.content === 'string' 
      ? m.content 
      : JSON.stringify(m.content);
    return sum + content.length;
  }, 0);
  
  return totalChars / 4; // ~4 chars per token
}

Step 2: Classify Messages

Not all messages are equal:

Category	Examples	Compaction Priority
System	System prompt, skill instructions	🔴 Never touch
Anchor	User's original request	🟡 Preserve in summary
Tool-heavy	File reads, validation output	🟢 Compress first
Reasoning	Assistant's analysis	🟡 Summarize
Recent	Last N messages	🔴 Never touch

Step 3: Partition

Split into old and recent:

const keepRecent = 20; // Always keep last 20 messages
const recent = messages.slice(-keepRecent);
const old = messages.slice(0, -keepRecent);

Step 4: Compress Tool Results

Replace large outputs with stubs:

// Before: 850 lines of code
{
  role: 'tool',
  content: '<850 lines of TypeScript...>'
}

// After: One line
{
  role: 'tool', 
  content: '[read_file: src/schema.ts — 850 lines]'
}

Step 5: Summarize (Optional)

For the summarize or hybrid strategy:

const summaryPrompt = `
Summarize this conversation for an AI assistant.
Focus on:
1. What the user originally requested
2. What actions have been taken
3. What the current state is
4. Any errors encountered and how they were fixed

Be concise but comprehensive.
`;

const summary = await llm.generate(summaryPrompt, oldMessages);

Step 6: Reassemble

const compacted = [
  systemMessage,      // Original system prompt
  summaryMessage,     // Generated summary
  ...recentMessages,  // Last 20 messages unchanged
];

Step 7: Emit Event

// Send compaction notice to UI
sse.send({
  type: 'compaction',
  data: {
    messagesBefore: 847,
    messagesAfter: 23,
    tokensBefore: 142000,
    tokensAfter: 3000,
    strategy: 'hybrid',
    summaryLength: summary.length,
  },
});

Step 8: Persist

Store compaction metadata with the session:

await sessionManager.recordCompaction(
  threadId,
  originalMessageCount,
  compactedMessageCount,
  'token_limit'
);

Configuration Options

interface CompactionConfig {
  maxTokens: number;           // Trigger threshold (default: 150000)
  triggerThreshold: number;    // 0-1, when to trigger (default: 0.75)
  keepRecentMessages: number;  // Always keep N recent (default: 20)
  strategy: 'truncate' | 'summarize' | 'hybrid';
  summaryModel?: string;       // Use cheaper model for summaries
}

Strategy Comparison

Strategy	How It Works	Best For
truncate	Drop oldest messages	Fast, no extra LLM call
summarize	LLM summarizes old messages	Preserving context
hybrid	Compress tools, summarize rest	Balance of speed/quality

Real-World Example

Session: Building an E-Commerce Platform

Hour 1:
- Created Order, Product, User entities
- Set up CRUD traits
- Compiled successfully
[Messages: 15, Tokens: ~8K]

Hour 2:
- Added shopping cart orbital
- Implemented checkout flow
- Fixed validation errors
[Messages: 35, Tokens: ~25K]

Hour 3:
- Added payment integration
- Tested end-to-end
- Refactored for performance
[Messages: 67, Tokens: ~62K]

Hour 4:
- Added inventory management
- Realized: Running low on tokens!

Compaction triggered:

[Context Summary — compacted from 67 messages]

## Original Request
Build an e-commerce platform with product catalog, 
shopping cart, and checkout.

## Entities Created
- Product: name, price, inventory
- User: email, name, addresses
- Order: items, total, status
- Cart: items, user relation

## Key Features Implemented
1. Product browsing (entity-table pattern)
2. Shopping cart (session-based)
3. Checkout wizard (3-step flow)
4. Payment integration (Stripe)
5. Inventory management

## Current State
- 5 orbitals defined
- All tests passing
- Ready for deployment

## Recent Focus
Adding inventory management and low-stock alerts.

Result: 67 messages → 1 summary + 20 recent = 21 messages

Code Example: Using Compaction

// Internal: Almadar's session management system

const sessionManager = new SessionManager({
  mode: 'firestore',
  firestoreDb: db,
  memoryManager,
  compactionConfig: {
    maxTokens: 150000,
    triggerThreshold: 0.8,
    keepRecentMessages: 10,
    strategy: 'hybrid',
  },
});

// Check if compaction needed
const shouldCompact = sessionManager.shouldCompactMessages(messages);

if (shouldCompact) {
  console.log('Compacting context...');
  
  // In your agent loop, trigger compaction
  // before sending to LLM
  const compacted = await compactMessages(
    messages,
    config
  );
  
  // Record for analytics
  await sessionManager.recordCompaction(
    threadId,
    messages.length,
    compacted.length,
    'token_limit'
  );
}

// Get compaction history
const history = sessionManager.getCompactionHistory(threadId);
console.log(`Session compacted ${history.length} times`);
// Session compacted 3 times

Real-World Analogy: Executive Summary

Context compaction is like an executive summary:

Full Report (500 pages):

Every email
Every meeting transcript
Every spreadsheet
Every draft

Executive Summary (2 pages):

What we were asked to do
What we did
Current status
Next steps

The CEO doesn't read 500 pages. They read the summary and the last few updates.

LLMs work the same way.

Trade-offs

What We Keep

✅ System instructions (critical)
✅ User's original request (context)
✅ Recent messages (current state)
✅ Error/success patterns (learning)

What We Lose

❌ Exact tool output (replaced with stub)
❌ Intermediate reasoning (summarized)
❌ Exact file contents (can re-read)

Mitigations

Re-read on demand: If LLM needs file contents, it can re-read
Keep key decisions: Important choices preserved in summary
Track references: Original messages linked for debugging

The Takeaway

Context windows are finite. Sessions can be long.

Context compaction bridges the gap:

Smart compression: Keep meaning, reduce tokens
Configurable strategies: Speed vs. quality tradeoffs
Transparent: User sees what's compacted
Recoverable: Can re-fetch compressed data

Because the best AI assistant isn't one with infinite memory — it's one that knows what to remember.

Learn more about Session Management.

The Token Limit Problem​

What Is Context Compaction?​

The Compaction Pipeline​

Step 1: Estimate Tokens​

Step 2: Classify Messages​

Step 3: Partition​

Step 4: Compress Tool Results​

Step 5: Summarize (Optional)​

Step 6: Reassemble​

Step 7: Emit Event​

Step 8: Persist​

Configuration Options​

Strategy Comparison​

Real-World Example​

Code Example: Using Compaction​

Real-World Analogy: Executive Summary​

Trade-offs​

What We Keep​

What We Lose​

Mitigations​

The Takeaway​

Recent Posts