GenericAgent, a local self-evolving agent

GenericAgent (code, paper) is a self-evolving LLM‑powered agent framework capable of controlling your entire computer.

The system uses a minimal set of atomic tools to perform tasks such as web browsing, filesystem manipulation, code execution, keyboard and mouse control, and screen perception through vision capabilities. It can also interact with mobile devices via Android’s ADB interface.

GenericAgent (GA) is fully open‑source and can be accessed here. In evaluations, it consistently outperforms leading agent systems like Claude Code, OpenAI Codex, and OpenClaw, all while consuming fewer tokens and completing tasks in fewer steps.

The agent automatically continues to optimize over time through a self-evolving mechanism. When you give it a new task, GenericAgent solves it step-by-step and saves those steps as a new skill. The more you use it, the more skills it collects, growing into your personal library of skills that are custom-made for the way you work.

Key features

Self-evolving: Converts each completed task into a reusable skill.
Minimal architecture: Contains around 3,000 lines of code. The main agent loop is only about 100 lines long.
Browser-integrated control: Can operate directly within a real web browser while keeping login sessions active. It uses a set of 9 simple tools to control and interact with the system effectively.
Token efficiency: Uses less than 30,000 tokens for its context window, which is much smaller compared to the 200,000 to 1M tokens used by many other agents.

The context overload problem in AI agents

Traditional LLM agents struggle with long interactions. As conversations or tasks grow, they accumulate tool outputs, memory, and environmental feedback. This leads to context overload, where critical information gets buried or even lost.

A good AI agent needs enough information to understand the task, but not so much that it becomes overloaded. If the context is too complete, it may contain too much unnecessary information, making the agent slow and confused. If the context is too concise, important details may be missing, leading to mistakes or poor decisions (see the picture below).

GenericAgent balances completeness and conciseness (source: paper)

GenericAgent keeps the right balance between completeness and conciseness because its entire architecture is built around minimal atomic tools, self‑evolving skills, and LLM‑driven planning that avoids unnecessary steps.

The model

GenericAgent is a framework built around existing LLMs. It can integrate with models such as Claude, Gemini, and other major LLM APIs. The system consists of four tightly integrated modules: a minimal tool set, hierarchical memory, self-evolution layer, and structured browser extraction.

Overall framework of GenericAgent (source: paper)

A minimal tool set: To keep GA incredibly small (~3K lines of code), the agent avoids having a specific app for every task. Instead, it uses 9 atomic tools for browser control, filesystem operations, screen vision, keyboard/mouse input, code execution, and ADB for mobile. Because they are composable, the agent can express any complex workflow without needing dozens of specialized APIs.

Hierarchical memory: GA uses a structured memory system to keep its workspace organized and efficient. In many other AI systems, every past action and detail stays in the active memory. Over time, this creates a clutter of old data that confuses the AI and makes it harder for it to think clearly or make good decisions. To solve this, GA organizes memory into 2 levels:

The Always-On Layer: This is a very small, permanent part of the AI’s memory. It does not store every detail; instead, it acts like a simple map or index that tells the AI what information is available.
The Deep Memory Layers: All detailed records, past steps, and complex facts are compressed and stored in the “back” of the system. This information stays hidden until the AI specifically needs it.

When it learns something new, GA creates a very short summary to keep in the Always-On Layer. When it needs the full details later, it uses that summary as a pointer to pull the exact information it needs from the Deep Memory Layers.

Self-evolution layer: This is a mechanism designed to transform successful task execution into permanent operational knowledge. After the successful completion of a complex task, the agent takes the optimal execution path, filters out situational noise, such as errors or temporary workarounds, and extracts a refined, reusable skill. These refined skills are stored within a local Skill Tree to be further reused for other similar tasks. Past interaction histories are transformed into reusable Standard Operating Procedures (SOPs) and executable code.

Structured browser extraction: This is one of the most powerful features of GA. While traditional web tools often struggle to read websites correctly, GA navigates a website and converts raw, unstructured HTML into organized, machine-readable data, such as a table or JSON. Instead of you spending hours copying and pasting data into a spreadsheet, the agent can do it for you. It can visit multiple websites, find the specific facts you need, and present them in a perfect, easy-to-read format.

The next table showcases the self-evolution mechanism:

Task	What the agent does the first time	What was learned after
WeChat “Read my WeChat messages”	Install deps (dependencies) → reverse DB (figures out how to talk to the apps on your computer, like finding where WeChat stores messages) → write read script → save skill	How to decrypt and read the local database.
Stocks “Monitor stocks and alert me”	Install mootdx (downloads the necessary libraries for stocks)→ build selection flow → configure cron (setting up a digital alarm clock) → save skill	How to use specific financial libraries and set a schedule (cron).
Gmail “Send this file via Gmail”	Configure OAuth → write send script → save skill	How to handle secure logins (OAuth) and file attachments.

Table data: official documentation

As it completes tasks, the model turns those experiences into reusable skills, forming a skill tree that will be used in future sessions.

Advantages of using GenericAgent

Reduced computational cost
Faster execution
Better scalability

The most significant data point is the 7x reduction in token consumption, compared to Claude Code, OpenAI Codex, and OpenClaw. In standardized benchmarks, GenericAgent completed the same task suite as OpenClaw using 3.9M tokens, compared to 27.3M for OpenClaw.

This is largely due to its contextual information density design, which keeps the active conversation much smaller (often under 30K tokens) than competitors that can use up to 200K+ tokens, whereas other agents may expand to 100K–200K+ during long workflows.

The table below shows the main differences between GA, Claude Code, OpenAI Codex, and OpenClaw.

Feature	GenericAgent	Claude Code / Codex	OpenClaw
Focus	General computer use + self-evolution	Specialized coding & repo management	Visual-based computer interaction
Code Size	~3K lines (Minimalist)	Large / Proprietary	~530K lines (Complex)
Learning	Saves custom “Skill Trees”	Fixed capability sets	Mostly session-based
Memory	Hierarchical / Compressed	Context-compaction	Persistent but can be noisy

Table data: official documentation

While GenericAgent is designed for general computer use, allowing an AI to operate a full desktop environment and gradually improve its abilities, Claude Code and Codex focus mainly on software development tasks and OpenClaw emphasizes vision‑based interaction, enabling an agent to control a computer by interpreting what appears on the screen.

Evaluation

GenericAgent was evaluated across five main areas: task completion, token efficiency, tool usage, memory effectiveness, self-evolution, and web browsing performance. The tests measured how efficiently the agent completes the tasks, manages memory, uses its minimal toolset, learns from past interactions by creating reusable procedures and code, and handles complex web-based tasks in dynamic environments.

The table below illustrates the task completion rate and token efficiency across the main agent benchmarks and RealFin-benchmark. GA shows consistent performance across all three benchmarks.

Benchmark	Agent	Model	Accuracy	Input Tokens	Output Tokens	Total Tokens	Efficiency
SOP-Bench	GA	Claude Sonnet 4.6	100%	2.02M	53k	2.08M	0.48
	OpenClaw	Claude Sonnet 4.6	100%	2.60M	40k	2.64M	0.38
	Claude Code	Claude Sonnet 4.6	85%	1.23M	23k	1.25M	0.68
	GA	Minimax M2.7	90%	893k	32k	924k	0.97
	OpenClaw	Minimax M2.7	95%	2.91M	46k	2.96M	0.32
Lifelong AgentBench	GA	Claude Sonnet 4.6	100%	222k	20k	241k	4.15
	OpenClaw	Claude Sonnet 4.6	70%	1.43M	21k	1.45M	0.48
	Claude Code	Claude Sonnet 4.6	75%	800k	14k	814k	0.92
	GA	Minimax M2.7	90%	400k	23k	423k	2.12
	OpenClaw	Minimax M2.7	70%	1.20M	17k	1.22M	0.57
RealFin-benchmark	GA	Claude Sonnet 4.6	65%	102k	12k	114k	5.70
	Claude Code	Claude Opus 4.6	60%	290k	17k	307k	1.95
	Claude Code	Claude Sonnet 4.6	55%	226k	12k	238k	2.31
	OpenClaw	Claude Sonnet 4.6	35%	249k	2k	251k	1.39
	Codex	GPT-5.4	60%	838k	54k	892k	0.67

Table source: paper

The next table shows the results for long-horizon, complex tasks. The evaluation includes 5 categories: document generation (PDF/PPT creation), SQL copilot query generation, experiment analysis report writing, procurement decision-making with web retrieval, and feasibility analysis for reproducing research papers. The table summarizes the average performance across the full set of long-horizon tasks.

Agent	# Tasks	Success	Total Tokens	Time (s)	Requests	Tool Calls
Claude Code	5	100.0%	537,413	320.8	32.6	22.6
GenericAgent	5	100.0%	188,829	220.8	11.0	12.8
OpenClaw	5	80.0%	633,101	183.1	15.0	16.6

Table source: paper

We can see that GenericAgent is the most balanced agent. It matches the perfect accuracy of the best models but does so faster, with fewer requests, and at a fraction of the cost. GenericAgent only used 188,829 tokens, while Claude Code used over 537,000 tokens and OpenClaw used over 633,000 tokens. This means that GenericAgent is nearly 3x cheaper to run than Claude Code and 3.34x cheaper than OpenClaw.

We observe that GenericAgent is the most balanced agent, achieving performance comparable to the best-performing models while requiring significantly fewer resources. It maintains top-level accuracy while operating with fewer requests, faster execution, and substantially lower token usage. Specifically, GenericAgent uses only 188,829 tokens, compared to over 537,000 tokens for Claude Code and over 633,000 tokens for OpenClaw. This corresponds to approximately 2.8× lower token usage than Claude Code and 3.3× lower than OpenClaw.

The table below shows prompt length after adding 20 skills under intensive usage with minimal input. GenericAgent (GA) prevents uncontrolled context growth.

System	Full Prompt Length (tokens)
OpenClaw	43,321
CodeX	23,932
Claude Code	22,821
GenericAgent	2,298

Table source: paper

Quick start guide to GenericAgent

Follow the installation steps in the GitHub repository. In summary, there are two install methods:

run a one-line command that downloads and executes an install script
git-clone the repository and install the Python dependencies

Afterwards, you must to set your LLM API key by editing the “mykey.py” file. It supports OpenAI-compatible APIs and Anthropic Claude native APIs.

Finally, you can launch one of the frontends specified in the README. You have these choices: a desktop GUI, a terminal UI and a Streamlit UI.

Conclusion

GenericAgent represents a significant departure from the bigger is better trend in AI development. While the industry often achieves performance through massive context windows and complex, multi-agent orchestrations, GA proves that a minimalist seed architecture can be more effective, efficient, and adaptable.

By reducing the system prompt by 90% and relying on just 9 atomic tools, GA avoids the complexity of traditional automation.

The agent’s true strength is its Self-Evolution Layer, which distills complex, trial-and-error interactions into permanent, executable SOPs. The agent becomes faster, cheaper, and more personalized with every task it completes.