OpenAI and Google Shocked by the First EVER Open Source AI Agent

AI Revolution

1,838 views • 1 hour ago

Video Summary

The release of GLM 4.6V by Zepuai marks a significant advancement in open-source AI, introducing the first multimodal model capable of treating images, videos, screenshots, and web pages as direct inputs for tool calling. This eliminates the need for cumbersome text conversions, enabling a more fluid interaction between perception, understanding, and action, crucial for advanced AI agents. The model boasts an impressive 128,000-token context window, allowing it to process extensive documents or an hour of video in a single pass. One particularly astonishing aspect is its ability to reconstruct websites from screenshots with pixel-perfect accuracy and even edit code based on visual cues.

The GLM 4.6V comes in two versions: a 106 billion parameter model for cloud setups and a 9 billion parameter "flash" version optimized for local devices. Both are MIT licensed, offering companies free deployment without restrictions. The pricing for the larger model is remarkably low at $0.3 per million input tokens and $0.9 per million output tokens, significantly undercutting competitors like GPT 4.5, Gemini 3 Pro, and Claude Opus. This combination of native multimodal capabilities, extensive context handling, open-source accessibility, and competitive pricing has generated substantial excitement, positioning GLM 4.6V as a foundational technology for future agent frameworks and a genuine game-changer in the AI landscape.

Short Highlights

GLM 4.6V is the first open-source multimodal model that treats images, videos, screenshots, and web pages as direct inputs for tool calling, bypassing text conversion.
The model features a 128,000-token context window, capable of processing approximately 150 pages of documents, 200 slides, or one hour of video in a single pass.
Zepuai released two versions: GLM 4.6V (106 billion parameters) for cloud and GLM 4.6V Flash (9 billion parameters) for local devices, both MIT licensed.
GLM 4.6V's pricing is significantly lower than competitors, with the larger model costing $0.3 per million input tokens and $0.9 per million output tokens.
The model demonstrates advanced capabilities in reconstructing front-ends from screenshots, performing visual web search, and handling complex mixed-content documents and videos.

Key Details

The Dawn of Native Multimodal Tool Calling [00:02]

GLM 4.6V represents a paradigm shift as the first open-source multimodal model capable of treating visual inputs (images, videos, screenshots, web pages) directly as real inputs for tool calling, rather than requiring them to be converted into text first.
This direct visual input bypasses slow and lossy textual descriptions, enabling a more integrated loop of perception, understanding, and action, which is crucial for advanced AI agents.
The open-source nature of GLM 4.6V means anyone can download, run locally, or build upon it without restrictions, a significant departure from previous closed-lab multimodal capabilities.

"The simplest way to put it is that this is the first open-source multimodal model that treats images, videos, screenshots, and even full web pages as real inputs for tool calling, not as some secondary thing that has to be squeezed into text first."

Extended Context and Dual Model Variants [00:50]

GLM 4.6V extends its training context to 128,000 tokens, allowing it to process roughly 150 pages of dense documents, 200 slides, or an entire hour of video in one go without fragmented pipelines.
Zepuai released two versions: the large GLM 4.6V with 106 billion parameters for cloud environments and high-performance clusters, and the "flash" version with 9 billion parameters optimized for local devices and low latency tasks.
Both models are MIT licensed, permitting companies to deploy them freely without concerns about proprietary code or enterprise fees.

"The crazy part is that the flash variant is free to use and both models are MIT licensed so companies can deploy them wherever they want without worrying about opening their code or paying enterprise level fees."

Groundbreaking Multimodal Tooling and Cost-Effectiveness [01:59]

The native multimodal tool calling system allows visual data to be used directly as parameters, enabling tools to return visual outputs like search result grids or rendered web pages, which the model then reasons with alongside text.
GLM 4.6V's pricing is exceptionally competitive, with the 106B version at $0.3 per million input tokens and $0.9 per million output tokens, significantly undercutting models like GPT 4.5 ($1.25 per million input plus output), Gemini 3 Pro, and Claude Opus ($90 per million).
Despite its scale and cost, GLM 4.6V achieves benchmark scores surpassing larger models in long-context tasks, video summarization, and multimodal reasoning.

"GLM4.6V lands at $1.2 $2 total and somehow it delivers benchmark scores that beat models way above its size on long context tasks, video summarization, and multimodal reasoning."

Vision-Native Execution and Front-End Automation [03:26]

The model operates as a "vision native execution layer," directly processing visual data as parameters without text conversion, a feature not commonly found even in many closed-source models.
It supports URLs representing images or video frames, enabling precise targeting of visuals within large documents and avoiding file size limitations.
Capabilities extend to front-end automation, where GLM 4.6V can reconstruct full website layouts as HTML, CSS, and JavaScript from screenshots, and even edit code snippets based on visual instructions.

"It is basically a vision native execution layer, which is something even most closed source models do not really have right now."

Scalable Document and Video Processing [07:14]

The 128k context window is instrumental for handling mixed documents at scale, with examples of summarizing financial reports from multiple companies, extracting metrics, and building comparative tables in a single pass.
For video, an hour of footage fits within the context window, allowing for comprehensive summarization, key moment highlighting, and answering timestamp-specific questions.
This is achieved by treating video frames as visual tokens with temporal encoding, utilizing 3D convolutions and timestamp markers.

"And for video, one hour of footage fits into the same window. So the model can summarize the whole match, highlight key moments, and still answer questions about timestamps or goal sequences afterward."

Advanced Training and Architectural Innovations [08:00]

Zepuai employed a multi-stage training process involving massive pre-training, fine-tuning, and reinforcement learning, with an RL system that learns from verifiable tasks rather than human ratings.
Curriculum sampling ensures the model progressively tackles harder examples, and tool usage is integrated into its reward system, teaching it optimal planning and output structuring.
Architecturally, it's based on the AIM V2 huge vision transformer with an MLP projector, adept at handling diverse image sizes and aspect ratios (up to 200:1), using 2D OP positioning and interpolation for precision.

"Instead, it learns from verifiable tasks, things with clear right or wrong answers like math problems, chart reading, coding interfaces, spatial reasoning, and video question answering."

Benchmark Performance and Ecosystem Impact [09:34]

GLM 4.6V significantly outperforms previous versions and competitors on benchmarks like Math Vista (88.2 vs. 84.6), Web Voyager (81 vs. 68.4), and sets new state-of-the-art results on Ref Coco and Treebench.
Even the smaller flash model surpasses comparable light models, showcasing its efficiency for local deployment.
The model's ability to maintain consistency with huge, mixed inputs surpasses even larger models like Llama 3 321B and Qwen3 VL235B, owing to its synchronized vision and language systems for long, complex reasoning.

"On Math Vista, GLM 4.6V scored 88.2, beating the previous 4.5V's 84.6 and Quen 3 VL8B's 81.4."

A Foundation for Future Agents and Accessibility [10:29]

GLM 4.6V is positioned as a robust backbone for agent frameworks that require observation, planning, and action, closing the gap left by models that merely describe visuals.
The combination of an MIT license, a free lightweight version, and competitive pricing makes it an enterprise-ready solution accessible to startups and large companies alike.
Its immediate availability via Hugging Face for weights download, local running of the flash variant, an OpenAI-compatible API, and a desktop assistant app underscores its significant accessibility and disruptive potential.