What an amazing feeling to see
something I created get recognized! I'm delighted to share that my AI-generated
art piece was selected as one of the 50 winners out of over 420 entries in our organization's recent Visions of the Kick-Off AI Art Competition.
The competition dubbed as
"Visions of the Kick-Off," was launched as part of OpenText's
Kick-Off (OKO) Event this year. It was a creative challenge extended to all
employees globally, inviting us to unleash our imaginations and showcase our vision
of the event through AI. The core theme was to create an AI-generated image
inspired by OKO whether it reflected the energy, company priorities, or
memorable insights from the speakers.
Participating was straightforward
but thoughtful. The primary rule was that the image had to be fully
AI-generated based solely on a text prompt. We had the flexibility to use tools
like Copilot or any other AI image generator, though Copilot Chat in Teams was
highlighted as an internal option available to all employees. Beyond the
technicalities, there was a crucial emphasis on ensuring the image was
respectful, appropriate, and not offensive, given that entries would be
displayed for all OpenTexters to view. We were also limited to one entry per
employee, and it was essential to submit the exact prompt(s) used along with
the image. The deadline was firmly set for 7 PM ET on July 10, 2025.
The judging process was
comprehensive, with a panel reviewing submissions against several key criteria.
My focus, and indeed what the judges were looking for, centered on:
Creativity & originality:
Was the visual imaginative and unique, offering a fresh perspective?
Relevance to OKO: Was the
connection to the event clear?
Prompt quality: Was the prompt
well-crafted, thoughtful, and effective in guiding the AI?
Visual impact: Was the image
visually striking or engaging?
Knowing these criteria certainly
guided my approach to crafting the prompt and refining the image. The judging
panel expressed that they were "blown away by the talent and imagination
on display", which truly speaks to the high caliber of submissions.
My AI-generated art piece that earned a spot among the winners of the recently concluded Visions of the Kick-Off AI Art Competition.
When the announcement came on July
29, 2025, revealing the 50 winners, I was astonished. It s indeed gratifying to
have my work recognized. As a winner, my piece, along with the other winning
entries, will be featured in a special virtual gallery and celebrated across
company channels. Plus, there's some "amazing swag" headed my way!.
This initiative has not only been
a fun challenge but also a powerful demonstration of how creativity can truly
shine when paired with technology. It's inspiring to see so many colleagues
embrace AI as a tool for artistic expression. This experience has reinforced
the exciting possibilities that emerge when we dare to imagine and create using
the AI tools we have today.
Google is once again redefining the boundaries of digital creativity. Its Gemini platform now lets users transform ordinary still images/photos into short, animated video clips, complete with sound. This fresh capability, revealed by David Sharon, who leads Multimodal Generation for Gemini Apps, is powered by the company’s latest video model, Veo 3.
How It Works?
Breathing life into a static photo might sound like something out of a sci-fi movie, but with Gemini, the process feels intuitive and fun. Inside the Gemini interface, users can head over to the prompt area and select the “Videos” option. Once a photo is uploaded, all that’s left to do is describe what the scene should look like in motion, and optionally, suggest accompanying audio.
That’s all it takes. A few inputs later, your snapshot evolves into an eight-second animated video. Whether you're reimagining a childhood drawing or adding motion to a scenic photo from a recent hike, the possibilities feel nearly limitless. Finished videos can be downloaded or shared instantly with friends and family.
The AI Engine Behind the Art
Under the hood, all of this is made possible by Veo 3, Google's advanced video-generation engine. Introduced in May, this model is already making waves. It recently became available to Google AI Pro users across more than 150 countries.
And users are clearly loving it. In just the past seven weeks, over 40 million videos have been created using Veo 3 (both within Gemini and Flow -- Google’s AI-powered storytelling tool). People are using it to do everything from reimagining classic fairy tales with a modern spin to building ASMR experiences around nature’s most mesmerizing sounds.
Where and How to Try It
The photo-to-video feature is currently rolling out to Gemini AI Pro and Ultra users in select countries. Curious users can check it out by visiting gemini.google.com. The same tools are also available in Flow, which is more tailored for creators working on longer or more cinematic projects.
Built With Safety in Mind
As with all of Google’s AI innovations, the launch of this feature comes with a focus on responsibility and safety. Behind the scenes, the tech giant is running continuous “red teaming” simulations, essentially stress tests designed to catch problems before they reach real users.
Each AI-generated video is clearly marked with a visible watermark to indicate it was created by artificial intelligence. Additionally, every file includes a SynthID digital signature -- Google’s invisible watermarking system designed for traceability.
And user feedback is more than welcome. With a quick thumbs-up or thumbs-down on each video, creators can share their impressions. This feedback loop helps Google continuously fine-tune the experience and maintain high standards of safety.
This feature is more than just a novelty; it’s a glimpse into what the future of personal storytelling could look like. By giving users the ability to animate their memories, drawings, or ideas with just a few prompts, Google is turning imagination into a playable format. Whether you’re an artist, a content creator, or just someone curious to explore what AI can do, Gemini now offers a platform filled with limitless potential.
One of the highlights revealed during the Google
Cloud NEXT 2025 was the
Agent Development Kit (ADK), a new open-source
framework designed to streamline the creation and deployment of intelligent,
autonomous AI agents and advanced multi-agent systems easier. As the AI
landscape evolves beyond single-purpose models, building coordinated teams of
agents presents new challenges, which ADK aims to solve by providing a
full-stack, end-to-end development solution.
The
framework is the same one that powers agents within Google's own products,
including Agentspace and the Google Customer Engagement Suite (CES). By
open-sourcing ADK, Google intends to empower developers with powerful and
flexible tools for building in the rapidly changing agentic AI space. ADK is
built with flexibility in mind, supporting different models and deployment
environments, and designed to make agent development feel similar to
traditional software development.
Core
Pillars Guiding Development
ADK
provides capabilities across the entire agent development lifecycle. Its core
principles emphasize flexibility, modularity, and precise control over agent
behavior and orchestration. Key pillars include:
Multi-Agent
by Design:
Facilitates building modular and scalable applications by composing
specialized agents hierarchically, enabling complex coordination and
delegation.
Rich
Model Ecosystem:
Allows developers to choose the best model for their needs, integrating
with Google's Gemini models, Vertex AI Model Garden, and through LiteLLM,
a wide selection of models from providers like Anthropic, Meta, and
Mistral AI.
Rich
Tool Ecosystem:
Agents can be equipped with diverse capabilities using pre-built tools
(like Search or Code Execution), custom Function Tools, integrating
3rd-party libraries (like LangChain or LlamaIndex), or even using other
agents as tools via AgentTool. ADK also supports Google Cloud
tools, MCP tools, and those defined by OpenAPI specifications.
Built-in
Streaming: Enables
human-like conversations with agents through bidirectional audio and video
streaming capabilities, moving beyond text-only interactions into rich,
multimodal dialogue with just a few lines of code.
Flexible
Orchestration:
Workflows can be defined using Workflow Agents such as Sequential,
Parallel, and Loop agents for predictable pipelines, or by leveraging
LLM-driven dynamic routing via LlmAgent transfer for more adaptive
behavior.
Integrated
Developer Experience:
Offers tools for local development, testing, and debugging, including a
powerful CLI, a visual Web UI to inspect execution step-by-step, and an
API Server.
Built-in
Evaluation:
Provides systematic assessment of agent performance by evaluating both
final response quality and the step-by-step execution against predefined
test cases.
Easy
Deployment: Agents
can be containerized and deployed anywhere, with seamless integration
options for Vertex AI Agent Engine, a fully managed, scalable, and
enterprise-grade runtime on Google Cloud.
Additional
core concepts include Sessions & Memory for managing conversational
context and agent state, Artifacts for handling files and binary data
tied to a session, and Callbacks for customizing behavior at specific
execution points. The Runtime manages execution, tracked via Events,
with Context providing relevant information.
A
notable architectural component is the Agent2Agent (A2A) Protocol, an
open protocol designed to facilitate communication and collaboration between
agents across different platforms and frameworks, promoting a more
interconnected ecosystem.
Getting
Started and Building with ADK
Building
a basic agent with ADK is designed for Pythonic simplicity, potentially
requiring fewer than 100 lines of code. Developers define an agent's logic,
tools, and information processing. ADK provides the structure for managing
state, orchestrating tool calls, and interacting with underlying LLMs.
The
framework primarily supports the Python programming language, with
future plans for additional language support. Installation is straightforward
using pip. The documentation includes quickstart guides and examples, such as a
simple "question_answer_agent" using a Google Search tool.
For
a more integrated experience, ADK includes a developer Web UI that can
be launched locally via a CLI command (adk web), providing a user-friendly
interface for running, testing, and debugging agents. Agents can also be run
and tested via the CLI (adk run) or accessed through an api_server.
ADK
truly excels in building multi-agent applications. This involves
creating teams of specialized agents that can delegate tasks based on
conversation context, utilizing hierarchical structures and intelligent
routing. An illustrative example shows a WeatherAgent delegating simple
greetings and farewells to specialized GreetingAgent and FarewellAgent
instances, while handling weather queries itself using a get_weather tool. Delegation
relies heavily on clear, distinct descriptions of each agent's capability,
which the LLM uses to route tasks effectively.
Evaluation
and Deployment Path
Ensuring
agents behave reliably is crucial before deployment. ADK offers built-in evaluation
tools that allow developers to systematically assess performance by testing
execution paths and response quality against predefined datasets. These checks
can be run programmatically or via the ADK eval command-line tool or Web UI.
Once
satisfied with performance, ADK provides a clear path to production. Agents can
be containerized and deployed on various platforms, including Cloud Run
and Google Kubernetes Engine (GKE), or using Docker. For a fully managed,
scalable, and enterprise-grade runtime, ADK offers seamless integration with Vertex
AI Agent Engine on Google Cloud.
Optimized
for Google Cloud, but Versatile
While
offering flexibility across various tools and models, ADK is optimized for
seamless integration within the Google Cloud ecosystem, particularly with
Gemini models and Vertex AI. This integration allows developers to leverage
advanced Gemini capabilities and provides a native pathway to deploy on Vertex
AI.
Beyond
model integration, ADK enables agents to connect directly to enterprise systems
and data. This includes over 100 pre-built connectors, utilization of workflows
built with Application Integration, and access to data in systems like AlloyDB,
BigQuery, and NetApp, without requiring data duplication. Agents can also
securely tap into existing API investments managed through Apigee. This
comprehensive connectivity enhances ADK's power within the Google Cloud
environment.
Despite
this optimization, ADK is designed to be model and deployment agnostic. Through
integration with Vertex AI Model Garden and LiteLLM, it supports models from
providers like Anthropic, Meta, Mistral AI, and AI21 Labs. It also allows
integration with third-party libraries such as LangChain and CrewAI.
Open
Source and Community Driven
ADK
is released as an open-source framework, with its primary codebase
hosted on GitHub. The repository includes essential files like CHANGELOG.md,
CONTRIBUTING.md, LICENSE (Apache-2.0), and README.md, providing information and
guidelines for the project and community engagement.
Google
actively encourages contributions from the community, welcoming issue
reports, feature suggestions, documentation improvements, and code
contributions. There is already an active community, with a number of
contributors submitting changes. Stable releases are available weekly via pip,
while a development version can be installed directly from GitHub for access to
the latest changes (though this version may contain experimental features or
bugs).
Community
Perspectives and Comparisons
Initial
community reception of ADK includes significant praise for its excellent CLI
and developer-first tools (adk web, adk run, api_server), smooth building and
debugging process, and robust support for multiple model providers
beyond Google's, thanks to LiteLLM integration. Features like artifact
management for stateful agents and the AgentTool concept have also been
highlighted as useful additions. Deployment options, including the fully
managed Agent Engine and Cloud Run, are seen as providing appropriate levels of
control.
However,
community feedback also points out areas for potential improvement. Some
developers found initial setup steps cumbersome and perceived the different
agent types (LlmAgent, Sequential, Parallel, Loop) as potentially
over-engineered. The developer experience for evaluation and guardrails
is not consistently perceived as intuitive or smooth. Session state
management has been identified as a notable weak point, making it
challenging to work with effectively. A recurring concern is that the overall
developer experience seems optimized for advanced users, potentially posing a
higher barrier for beginners.
Compared
to other open-source frameworks:
LangChain/LangGraph: ADK is presented as a higher-level
framework potentially easier for simpler workflows, whereas LangGraph
offers lower-level, granular control, especially for persistence. ADK has
a built-in chat UI and monitoring, while LangChain is a widely adopted standard.
ADK integrates with LangChain tools, leveraging its extensive library.
CrewAI: Both facilitate complex agents.
ADK offers smoother integration with Google products, stronger enterprise
features like built-in RAG and connectors, and better deployment options
like Vertex AI Agent Engine. CrewAI focuses specifically on role-based multi-agent
collaboration.
AutoGen: Microsoft's AutoGen focuses on
swarms of specialized agents and multi-agent conversation frameworks. ADK
is a more general toolkit and uniquely offers built-in bidirectional
audio/video streaming. Some users prefer ADK.
OpenAI
Agents SDK: Both
are for multi-agent workflows. ADK supports multiple providers via LiteLLM
(including OpenAI models), while OpenAI Agents SDK focuses on native
integration with OpenAI endpoints. ADK includes artifact management.
Semantic
Kernel: A
model-agnostic SDK for building/orchestrating agents, with a focus on
enterprise integration. ADK integrates with Semantic Kernel via MCP and
A2A for cross-cloud collaboration. Semantic Kernel supports more languages
and has a plugin ecosystem.
LlamaIndex: Primarily focuses on connecting
LLMs to external data for Retrieval-Augmented Generation (RAG). ADK
integrates with LlamaIndex as a library to leverage its data
indexing/retrieval capabilities.
This
comparison highlights that ADK brings unique strengths, particularly its deep
integration with the Google Cloud ecosystem and the A2A protocol, while fitting
into a broader landscape of tools, each with its own focus and advantages.
Intended
Use Cases and Future Potential
Google
ADK is designed for a wide range of AI agent applications, from simple
automated tasks to complex, collaborative multi-agent systems. Its flexible
orchestration makes it suitable for applications requiring both deterministic
task execution and intelligent, context-aware routing. The ability to compose
specialized agents hierarchically is useful for breaking down intricate tasks.
The rich tool ecosystem and deployment flexibility broaden its applicability
across diverse scenarios. Built-in evaluation and emphasis on safety and
security make it suitable for applications where reliability and
trustworthiness are key.
Specific
potential applications mentioned include travel planning, retail automation,
and internal automation.
The
Agent Development Kit provides a robust, flexible, and open-source foundation
for building the next generation of AI applications. While community feedback
indicates areas for refinement, particularly in developer experience for
beginners and session state management, its strengths in modularity, Google
Cloud integration, and the innovative A2A protocol position it as a significant
tool in the evolving field of agentic AI. Its open-source nature and the
encouragement of community contributions suggest a promising path for future
development.
For
developers interested in exploring ADK, the official documentation and
quickstart guide are recommended starting points. Utilizing the developer Web
UI for local testing provides valuable hands-on experience. Considering
contributions to the GitHub project is also encouraged to engage with the
community and shape the framework's evolution. Developers with projects
requiring strong Google Cloud integration or leveraging Gemini models should
seriously evaluate ADK, while also considering other frameworks based on
specific project needs and infrastructure.
In the rapidly evolving world of
artificial intelligence, even seemingly small gestures of human courtesy
towards chatbots like ChatGPT come with a price tag. OpenAI CEO Sam Altman
recently revealed that users saying "please" and "thank you"
to the company's AI models is costing "tens of millions of dollars".
While the notion of politeness having a significant financial impact on a tech
giant might seem surprising, experts explain that this cost is a consequence of
how these powerful AI systems operate on an immense scale.
How AI Processes Language (And
Politeness)
Understanding the cost involves looking
into the technical underpinnings of AI chatbots. Large language models (LLMs)
like ChatGPT process text by breaking it down into smaller units called tokens.
These tokens can be words, parts of words, or even punctuation marks. When a
user inputs a prompt, the AI processes each token, requiring computational
resources like processing power and memory housed in massive data centers.
Generally, more tokens in a prompt require more computational resources.
Polite phrases such as
"please" and "thank you" typically add a small number of
tokens to a user's input, usually between two and four tokens in total.
According to OpenAI's API pricing, which charges based on token usage, the cost
per million tokens varies by model. For example, the GPT-3.5 Turbo model has an
input cost of $0.50 per million tokens. Based on this, adding three tokens for
politeness to a single prompt costs an exceedingly small amount – roughly
$0.0000015.
Scale: The Reason for the Millions
So, how does a cost of a fraction of a
cent per interaction balloon into "tens of millions of dollars"? The
answer lies in the sheer volume of daily usage. ChatGPT handles over one
billion queries daily. When a minuscule cost per interaction is multiplied by
billions of interactions, it accumulates into a substantial aggregate figure.
Operating LLMs like ChatGPT requires a
vast infrastructure of data centers, high-performance servers, and specialized
processing chips (GPUs). These facilities consume substantial amounts of
energy. Estimates suggest the daily energy cost to run ChatGPT could be around
$700,000. A single query to GPT-4 is estimated to consume about 2.9 watt-hours
of electricity, significantly more than a standard search. Scaling this across
billions of daily queries results in millions of kilowatt-hours consumed daily.
Beyond electricity, these data centers also require significant water for
cooling systems. Generating a short 100-word email with GPT-4, for instance,
can use as much as 519 milliliters of water for cooling the servers involved.
The cost of processing polite language is embedded within this broader
framework of infrastructure, energy, and water expenses.
While some expert analyses, based purely
on token costs for models like GPT-3.5 Turbo, estimate the annual cost of
politeness to be significantly lower, around $146,000, Altman's statement of
"tens of millions" likely reflects the broader increase in
computational load and associated energy consumption across their vast
infrastructure. Even at "tens of millions," this figure represents a
non-negligible part of the overall operational costs for ChatGPT, which are
estimated to be in the hundreds of millions annually.
Why Users Are Polite, And Why It Might
Matter
Despite the cost, a significant portion
of users are polite to AI. A late 2024 survey found that 67 percent of US
respondents reported being nice to their chatbots. Among those, 55
percent said they did it "because it's the right thing to do,"
while 12 percent cited appeasing the algorithm, perhaps out of fear of a
future AI uprising. About two-thirds of people who are impolite said it was
for brevity.
Moreover, experts suggest that being
polite to AI might offer benefits beyond simple etiquette. Microsoft's design
manager Kurtis Beavers noted that using proper etiquette helps generate "respectful,
collaborative outputs," explaining that polite language "sets a
tone for the response". A Microsoft WorkLab memo added that generative AI
mirrors the levels of professionalism, clarity, and detail in user prompts.
Beavers also suggested that being polite ensures you get the same graciousness
in return and improves the AI's responsiveness and performance.
Research hints that polite and
well-phrased prompts could lead to higher quality and less biased AI outputs,
with one study finding a 9% improvement in AI accuracy. Additionally, current
interactions with AI are seen as contributing to the training data for future
models. Polite exchanges might help train AI to default towards helpfulness,
whereas curt interactions could reinforce transactional behavior, potentially
shaping the ethical frameworks of future AI. OpenAI CEO Sam Altman's comment
that the tens of millions spent on politeness were "well spent"
and his cryptic "you never know" remark could indicate that OpenAI
sees long-term strategic value in fostering these more natural interactions.
Contextualizing the Cost and
Environmental Impact
Ultimately, while politeness does add to
the computational load and contributes to energy consumption, the cost per
individual user remains minimal. The significant expense arises from the
aggregate effect across billions of interactions.
However, the discussion underscores a
broader issue: the substantial environmental footprint of AI. Data centers
powering AI already consume around 2 percent of the world's energy, a figure
projected to increase dramatically. Those "pleases" and
"thank yous," while seemingly small, contribute to this growing
energy demand. This reality has led some to suggest that for tasks like
writing a simple email, the most environmentally conscious choice might be to
bypass the chatbot entirely and write it yourself.
As AI becomes more integrated into daily
life, the balance between optimizing computational efficiency and fostering
positive, human-like interactions remains a key consideration. The debate over
the cost of politeness highlights this intersection of technical performance,
economic reality, environmental impact, and evolving human-AI relationships.
Last month, I stumbled across an article
about a new AI agent called Manus that was making waves in tech circles.
Developed by Chinese startup Monica, Manus promised something different from
the usual chatbots – true autonomy. Intrigued, I joined their waitlist without
much expectation.
Then yesterday, my inbox pinged with a
surprise: I'd been granted early access to Manus, complete with 1,000
complimentary credits to explore the platform. As someone who's tested every AI
tool from ChatGPT to Claude, I couldn't wait to see if Manus lived up to its
ambitious claims.
For context, Manus enters an
increasingly crowded field of AI agents. OpenAI released Operator in January,
Anthropic launched Computer Use last fall, and Google unveiled Project Mariner
in December. Each promises to automate tasks across the web, but Manus claims
to take autonomy further than its competitors.
This post shares my unfiltered
experience – what Manus is, how it works, where it shines, where it struggles,
and whether it's worth the hype. Whether you're considering joining the
waitlist or just curious about where AI agents are headed, here's my take on
being among the first to try this intriguing technology.
What Exactly Is Manus?
Manus (Latin for "hands")
launched on March 6th as what Monica calls a "fully autonomous AI
agent." Unlike conventional chatbots that primarily generate text within
their interfaces, Manus can independently navigate websites, fill forms,
analyze data, and complete complex tasks with minimal human guidance.
The name cleverly reflects its purpose –
to be the hands that execute tasks in digital spaces. It represents a
fundamental shift from AI that just "thinks" to AI that
"does."
Beyond Conversational AI
Traditional AI assistants like ChatGPT
excel at answering questions and generating content but typically can't take
action outside their chat interfaces. Manus bridges this gap by combining
multiple specialized AI models that work together to understand tasks, plan
execution steps, navigate digital environments, and deliver results.
According to my research, Manus uses a
combination of models including fine-tuned versions of Alibaba's open-source
Qwen and possibly components from Anthropic's Claude. This multi-model approach
allows it to handle complex assignments that would typically require human
intervention – from building simple websites to planning detailed travel
itineraries.
The Team Behind Manus
Monica (Monica.im) operates from Wuhan
rather than China's typical tech hubs like Beijing or Shanghai. Founded in 2022
by Xiao Hong, a graduate of Huazhong University of Science and Technology, the
company began as a developer of AI-powered browser extensions.
What started as a "ChatGPT for
Google" browser plugin evolved rapidly as the team recognized the
potential of autonomous agents. After securing initial backing from ZhenFund,
Monica raised Series A funding led by Tencent and Sequoia Capital China in
2023.
In an interesting twist, ByteDance
reportedly offered $30 million to acquire Monica in early 2024, but Xiao Hong
declined. By late 2024, Monica closed another funding round that valued the
company at approximately $100 million.
Current Availability
Manus remains highly exclusive. From
what I've gathered, less than 1% of waitlist applicants have received access
codes. The platform operates on a credit system, with tasks costing roughly $2
each. My 1,000 free credits theoretically allow for 500 basic tasks, though
complex assignments consume more credits.
Despite limited access, Manus has
generated considerable buzz. Several tech influencers have praised its
capabilities, comparing its potential impact to that of DeepSeek, another
Chinese AI breakthrough that surprised the industry last year.
How Manus Works
My first impression upon logging in was
that Manus offers a clean, minimalist interface. The landing page displays
previous sessions in a sidebar and features a central input box for task
descriptions. What immediately sets it apart is the "Manus's Computer"
viewing panel, which shows the agent's actions in real-time.
The Technical Approach
From what I've observed and researched,
Manus operates through several coordinated steps:
When you describe a task, Manus
analyzes your request and breaks it into logical components
It creates a step-by-step plan,
identifying necessary tools and actions
The agent executes this plan by
navigating websites, filling forms, and analyzing information
If it encounters obstacles, it
attempts to adapt its approach
Once complete, it delivers results in
a structured format
This process happens with minimal
intervention. Unlike chatbots that need continuous guidance, Manus works
independently after receiving initial instructions.
The User Experience
Using Manus follows a straightforward
pattern:
You describe your task in natural
language
Manus acknowledges and may ask
clarifying questions
The agent begins working, with its
actions visible in the viewing panel
For complex tasks, it might provide
progress updates
Upon completion, it delivers
downloadable results in various formats
One valuable feature is Manus's
asynchronous operation. Once a task begins, it continues in the cloud, allowing
you to disconnect or work on other things. This contrasts with some competing
agents that require constant monitoring.
Pricing Structure
Each task costs approximately $2 worth
of credits, though I've noticed complex tasks consume more. For instance, a
simple research assignment used 1 credit, while a detailed travel itinerary
planning task used 5 credits.
At current rates, regular use would
represent a significant investment. Whether this cost is justified depends
entirely on how much you value the time saved and the quality of results.
Limitations and Safeguards
Like all AI systems, Manus has
constraints. It cannot bypass paywalls or complete CAPTCHA challenges without
assistance. When encountering these obstacles, it pauses and requests
intervention.
The system also includes safeguards
against potentially harmful actions. It won't make purchases or enter payment
information without explicit confirmation and avoids actions that might violate
terms of service.
How Manus Compares to Competitors
The AI agent landscape has become
increasingly competitive, with major players offering their own solutions.
Based on my testing and research, here's how Manus stacks up:
Performance Benchmarks
Manus reportedly scores around 86.5% on
the General AI Assistants (GAIA) benchmark, though these figures remain
partially unverified. For comparison:
OpenAI's Operator achieves 38.1% on
OSWorld (testing general computer tasks) and 87% on WebVoyager (testing
browser-based tasks)
Anthropic's Computer Use scores 22.0%
on OSWorld and 56% on WebVoyager
Google's Project Mariner scores 83.5%
on WebVoyager
For context, human performance on
OSWorld is approximately 72.4%, indicating that even advanced AI agents still
fall short of human capabilities in many scenarios.
Key Differentiators
From my experience, Manus's most
significant advantage is its level of autonomy. While all these agents perform
tasks with some independence, Manus requires less intervention:
Manus operates asynchronously in the
cloud, allowing you to focus on other activities
Operator requires confirmation before
finalizing tasks with external effects
Computer Use frequently needs
clarification during execution
Project Mariner often pauses for
guidance and requires users to watch it work
Manus also offers exceptional
transparency through its viewing panel, allowing you to observe its process in
real-time. This builds trust and helps you understand how the AI approaches
complex tasks.
Regarding speed, the picture is mixed.
Manus can take 30+ minutes for complex tasks but works asynchronously. Operator
is generally faster but still significantly slower than humans. Computer Use
takes numerous steps for simple actions, while Project Mariner has noticeable
delays between actions.
Manus stands out for global
accessibility, supporting multiple languages including English, Chinese
(traditional and simplified), Russian, Ukrainian, Indonesian, Persian, Arabic,
Thai, Vietnamese, Hindi, Japanese, Korean, and various European languages. In
contrast, Operator is currently limited to ChatGPT Pro subscribers in the
United States.
The business models also differ
significantly. Manus uses per-task pricing at approximately $2 per task, while
Operator is included in the ChatGPT Pro subscription ($200/month). Computer Use
and Project Mariner's pricing models are still evolving.
Challenges Relative to Competitors
Despite its advantages, Manus faces
several challenges:
System stability issues, with
occasional crashes during longer tasks
Limited availability compared to
competitors
As a product from a relatively small
startup, it lacks the resources of tech giants backing competing agents
My Hands-On Experience
After receiving my access code
yesterday, I've tested Manus on various tasks of increasing complexity. Here's
what I've found:
Tasks I've Attempted
Research Task: Compiling a list of
top AI research papers from 2024 with summaries
Content Creation: Creating a
comparison table of electric vehicles with specifications
Data Analysis: Analyzing trends in a
spreadsheet of sales data
Travel Planning: Developing a one-week
Japan itinerary based on my preferences
Technical Task: Creating a simple
website portfolio template
Successes and Highlights
Manus performed impressively on several
tasks. The research assignment was particularly successful – Manus navigated
academic databases efficiently, organized information logically, and delivered
a well-structured document with proper citations.
For the electric vehicle comparison, it
created a detailed table with accurate, current information by navigating
multiple manufacturer websites. This would have taken me hours to compile
manually.
The travel planning showcase
demonstrated Manus's coordination abilities. It researched flights, suggested
accommodations at various price points, and created a day-by-day itinerary
respecting my preferences for cultural experiences and outdoor activities. It
even included estimated costs and transportation details.
Watching Manus work through the viewing
panel was fascinating. The agent demonstrated logical thinking, breaking
complex tasks into manageable steps and adapting when encountering obstacles.
Limitations and Frustrations
Despite these successes, Manus wasn't
without struggles. The data analysis task revealed limitations – while it
identified basic trends, its analysis lacked the depth a human analyst would
provide. The visualizations were functional but basic.
The website creation task encountered
several hiccups. Manus created a basic HTML/CSS structure but struggled with
complex responsive design elements. The result was usable but would require
significant refinement.
I experienced two system crashes during
longer tasks, requiring me to restart. In one case, Manus lost progress on a
partially completed task, which was frustrating.
When Manus encountered paywalls or
CAPTCHA challenges, it appropriately paused for intervention. While necessary,
this interrupted the otherwise autonomous workflow.
Overall User Experience
The interface is clean and intuitive,
and the viewing panel provides valuable transparency. Task results are
well-organized and easy to download. The asynchronous operation is particularly
valuable, allowing me to focus on other activities while Manus works.
However, load times can be lengthy,
especially for complex tasks. Occasional stability issues interrupt the
workflow, and the system sometimes struggles with nuanced instructions. There's
also limited ability to intervene once a task is underway.
Final Thoughts
After my initial day with Manus, I'm
cautiously optimistic about its potential. The agent demonstrates impressive
capabilities that genuinely save time on certain tasks. The research, content
creation, and planning functions are particularly strong.
However, stability issues, variable
performance across task types, and occasional need for human intervention
prevent Manus from being the truly autonomous assistant it aspires to be. It's
a powerful tool but one that still requires oversight and occasional course
correction.
The 1,000 free credits provide ample
opportunity to explore Manus's capabilities without immediate cost concerns.
Based on my usage, these should last several weeks with moderate use.
For early adopters and those with
specific use cases aligned with Manus's strengths, the value proposition is
compelling despite the $2 per-task cost. For professionals whose time is
valuable, the hours saved could easily justify the expense.
However, for general users or those with
tighter budgets, the current limitations and cost structure might make Manus a
luxury rather than a necessity.
As Manus evolves in response to user
feedback and competitive pressures, I expect many current limitations to be
addressed. The foundation is strong, and if Monica can improve stability and
refine capabilities in weaker areas, Manus could become an indispensable
productivity tool.
The autonomous AI revolution is just
beginning, and Manus represents one of its most intriguing early
manifestations. Whether it ultimately leads the field or serves as a stepping
stone to more capable systems remains to be seen, but its contribution to advancing
autonomous AI is already significant.
I'll continue experimenting with my
remaining credits, focusing on tasks where Manus excels, and will likely share
updates as I discover more about this fascinating technology.
As artificial
intelligence continues to grow more advanced—especially with the rapid rise of
Large Language Models (LLMs)—there’s been a persistent roadblock: how to
connect these powerful AI models to the massive range of tools, databases, and
services in the digital world without reinventing the wheel every time.
Traditionally,
every new integration—whether it's a link to an API, a business application, or
a data repository—has required its own unique setup. These one-off,
custom-built connections are not only time-consuming and expensive to develop,
but also make it incredibly hard to scale up when things evolve. Imagine trying
to build a bridge for every single combination of AI model and tool. That’s
what developers have been facing—what many call the "N by N problem":
integrating n LLMs with m tools requires n × m individual
solutions. Not ideal.
That’s where
the Model Context Protocol (MCP) steps in. Introduced by Anthropic in
late 2024, MCP is an open standard designed to simplify and standardize how AI
models connect to the outside world. Think of it as the USB-C of AI—one
universal plug that can connect to almost anything. Instead of developers
building custom adapters for every new tool or data source, MCP provides a
consistent, secure way to bridge the gap between AI and external systems.
Why
Integration Used to Be a Mess
Before MCP, AI
integration was like trying to wire a house with dozens of different plugs,
each needing a special adapter. Every tool—whether it's a database or a piece
of enterprise software—needed to be individually wired into the AI model. This
meant developers spent countless hours creating one-off solutions that were
hard to maintain and even harder to scale. As AI adoption grew, so did the
complexity and the frustration.
This fragmented
approach didn’t just slow things down—it also prevented different systems from
working together smoothly. There wasn’t a common language or structure, making
collaboration and reuse of integration tools nearly impossible.
MCP: A
Smarter Way to Connect AI
Anthropic
created MCP to bring some much-needed order to the chaos. The protocol lays out
a standard framework that lets applications pass relevant context and data to
LLMs while also allowing those models to tap into external tools when needed.
It’s designed to be secure, dynamic, and scalable. With MCP, LLMs can interact
with APIs, local files, business applications—you name it—all through a
predictable structure that doesn’t require starting from scratch.
How MCP Is
Built: Hosts, Clients, and Servers
The MCP
framework works using a three-part architecture that will feel familiar to
anyone with a background in networking or software development:
MCP Hosts are the AI-powered applications or
agents that need access to outside data—think tools like Claude Desktop or
AI-powered coding environments like Cursor.
MCP Clients live inside these host
applications and handle the job of talking to MCP servers. They manage the
back-and-forth communication, relaying requests and responses.
MCP Servers are lightweight programs that make
specific tools or data available through the protocol. These could connect
to anything from a file system to a web service, depending on the need.
What MCP Can
Do: The Five Core Features
MCP enables
communication through five key features—simple but powerful building blocks
that allow AI to do more without compromising structure or security:
Prompts – These are instructions or
templates the AI uses to shape how it tackles a task. They guide the model
in real-time.
Resources – Think of these as reference
materials—structured data or documents the AI can “see” and use while
working.
Tools – These are external functions the
AI can call on to fetch data or perform actions, like running a database
query or generating a report.
Root – A secure method for accessing
local files, allowing the AI to read or analyze documents without full,
unrestricted access.
Sampling – This allows the external systems
(like the MCP server) to ask the AI for help with specific tasks, enabling
two-way collaboration.
Unlocking
the Potential: Advantages of MCP
The adoption of
MCP offers a multitude of benefits compared to traditional integration methods.
It provides universal access through a single, open, and standardized
protocol. It establishes secure, standardized connections, replacing ad
hoc API connectors. MCP promotes sustainability by fostering an
ecosystem of reusable connectors (servers). It enables more relevant AI
by connecting LLMs to live, up-to-date, context-rich data. MCP offers unified
data access, simplifying the management of multiple data source
integrations. Furthermore, it prioritizes long-term maintainability,
simplifying debugging and reducing integration breakage. By offering a
standardized "connector," MCP simplifies AI integrations, potentially
granting an AI model access to multiple tools and services exposed by a single
MCP-compliant server. This eliminates the need for custom code for each tool or
API.
MCP in
Action: Applications Across Industries
The potential
applications of MCP span a wide range of industries. It aims to establish
seamless connections between AI assistants and systems housing critical data,
including content repositories, business tools, and development environments.
Several prominent development tool companies, including Zed, Replit,
Codeium, and Sourcegraph, are integrating MCP into their platforms to
enhance AI-powered features for developers. AI-powered Integrated Development
Environments (IDEs) like Cursor are deeply integrating MCP to provide
intelligent assistance with coding tasks. Early enterprise adopters like Block
and Apollo have already integrated MCP into their internal systems. Microsoft's
Copilot Studio now supports MCP, simplifying the incorporation of AI
applications into business workflows. Even Anthropic's Claude Desktop
application has built-in support for running local MCP servers.
A
Collaborative Future: Open Source and Community Growth
MCP was
officially released as an open-source project by Anthropic in November 2024.
Anthropic provides comprehensive resources for developers, including the
official specification and Software Development Kits (SDKs) for various
programming languages like TypeScript, Python, Java, and others. An open-source
repository for MCP servers is actively maintained, providing developers with
reference implementations. The open-source nature encourages broad
participation from the developer community, fostering a growing ecosystem of
pre-built, MCP-enabled connectors and servers.
Navigating
the Challenges and Looking Ahead
While MCP holds
immense promise, it is still a relatively recent innovation undergoing
development and refinement. The broader ecosystem, including robust security
frameworks and streamlined remote deployment strategies, is still evolving.
Some client implementations may have current limitations, such as the number of
tools they can effectively utilize. Security remains a paramount consideration,
requiring careful implementation of visibility, monitoring, and access
controls. Despite these challenges, the future outlook for MCP is bright. As
the demand for AI applications that seamlessly interact with the real world
grows, the adoption of standardized protocols like MCP is likely to increase
significantly. MCP has the potential to become a foundational standard in AI
integration, similar to the impact of the Language Server Protocol (LSP)
in software development.
A Smarter,
Simpler Future for AI Integration
The Model
Context Protocol represents a significant leap forward in simplifying the
integration of advanced AI models with the digital world. By offering a
standardized, open, and flexible framework, MCP has the potential to unlock a
new era of more capable, context-aware, and beneficial AI applications across
diverse industries. The collaborative, open-source nature of MCP, coupled with
the support of key players and the growing enthusiasm within the developer
community, points towards a promising future for this protocol as a cornerstone
of the evolving AI ecosystem.
Google's
Gemini ecosystem has expanded its capabilities with the introduction of Gemini
Deep Research, a sophisticated feature designed to revolutionize how users
conduct in-depth investigations online. Moving beyond the limitations of traditional search
engines, Deep Research acts as a virtual research assistant,
autonomously navigating the vast expanse of the internet to synthesize complex
information into coherent and insightful reports. This AI-powered tool promises
to significantly enhance research efficiency and provide valuable insights
across diverse domains for professionals, researchers, and individuals seeking
a deeper understanding of complex subjects.
Unpacking
Gemini Deep Research: Your Personal AI Research Partner
Gemini Deep
Research is integrated within the Gemini Apps, offering users a specialized
feature for comprehensive and real-time research on virtually any topic. It
operates as a personal AI research assistant, going beyond basic
question-answering to automate web browsing, information analysis, and
knowledge synthesis. The core objective is to significantly reduce the time
and effort typically associated with in-depth research, empowering users to
gain a thorough understanding of complex subjects much faster than with
conventional methods.
Unlike
traditional search methods that require users to manually navigate numerous
tabs and piece together information, Deep Research streamlines this process
autonomously. It navigates and analyzes potentially hundreds of websites,
thoughtfully processes the gathered information, and generates insightful,
multi-page reports. Many reports also offer an Audio Overview feature,
enhancing accessibility by allowing users to stay informed while multitasking.
This combination of autonomous research and accessible output formats sets
Gemini Deep Research apart from standard chatbots.
The
Mechanics of Deep Research: From Prompt to Insightful Report
Engaging with
Gemini Deep Research is designed to be intuitive, accessible through the Gemini
web or mobile app. The process begins with the user entering a clear and
straightforward research prompt. The system understands natural language,
eliminating the need for specialized prompting techniques.
Upon receiving
a prompt, Gemini Deep Research generates a detailed research plan tailored
to the specific topic. Importantly, users have the opportunity to review
and modify this plan before the research begins, allowing for targeted
investigation aligned with their specific objectives. Users can suggest
alterations and provide additional instructions using natural language.
Once the plan
is finalized, Deep Research autonomously searches and deeply browses the web
for relevant and up-to-date information, potentially analyzing hundreds of
websites. Transparency is maintained through options like "Sites
browsed," which lists the utilized websites, and "Show
thinking," which reveals the AI's steps.
A crucial
aspect is the AI's ability to engage in iterative reasoning and thoughtful
analysis of the gathered information. It continuously evaluates findings,
identifies key themes and patterns, and employs multiple passes of self-critique
to enhance the clarity, accuracy, and detail of the final report.
The culmination
is the generation of comprehensive and customized research reports
within minutes, depending on the topic's complexity. These reports often
include an Audio Overview and can be easily exported to Google Docs,
preserving formatting and citations. Clear citations and direct links to
original sources are always included, ensuring transparency and
facilitating easy verification.
Under the
Hood: Powering Deep Research
Gemini Deep
Research harnesses the power of Google's advanced Gemini models.
Initially powered by Gemini 1.5 Pro, known for its ability to process
large amounts of information, Deep Research was subsequently upgraded to the Gemini
2.0 Flash Thinking Experimental model. This "thinking model"
enhances reasoning by breaking down complex problems into smaller steps,
leading to more accurate and insightful responses.
At its core,
Deep Research operates as an agentic system, autonomously breaking down
complex problems into actionable steps based on a detailed, multi-step
research plan. This planning is iterative, with the model constantly
evaluating gathered information.
Given the
long-running nature of research tasks involving numerous model calls, Google
has developed a novel asynchronous task manager. This system maintains a
shared state, enabling graceful error recovery without restarting the entire
process and allowing users to return to results at their convenience.
To manage the
extensive information processed during a research session, Deep Research
leverages Gemini's large context window (up to 1 million tokens for
Gemini Advanced users). This is complemented by Retrieval-Augmented
Generation (RAG), allowing the system to effectively "remember"
information learned during a session, becoming increasingly context-aware.
The Gemini
models are trained on a massive and diverse multimodal and multilingual
dataset. This includes web documents, code, images, audio, and video.
Instruction tuning and human preference data ensure the models effectively
follow complex instructions and align with human expectations for quality.
Gemini 1.5 Pro utilizes a sparse Mixture-of-Experts (MoE) architecture
for increased efficiency and scalability.
Diverse
Applications Across Industries and Research
Gemini Deep
Research offers a wide range of applications, demonstrating its versatility.
Business Intelligence and Market
Analysis:
Competitive analysis, due diligence, identifying market trends.
Academic and Scientific Research: Literature reviews, summarizing
research papers, hypothesis generation.
Healthcare and Medical Research: Assisting in radiology reports,
summarizing health information, answering clinical questions, analyzing
medical images and genomic data.
Education: Lesson planning, grant writing,
creating assessment materials, supporting student research and
understanding.
Real-world
examples include planning home renovations, researching vehicles, analyzing
business propositions, benchmarking marketing campaigns, analyzing economic
downturns, researching product manufacturing, exploring interstellar travel
possibilities, researching game trends, assisting in coding, and conducting
biographical analysis. Industry-specific uses include accounting associations
analyzing tax reforms, professional development identifying skill gaps,
regulatory bodies assessing the impact of new regulations, and healthcare
streamlining radiology reports and summarizing patient histories.
The utility of
Deep Research is further enhanced by its integration with other Google tools
like Google Docs and NotebookLM, facilitating editing, collaboration, and
in-depth data analysis. The Audio Overview feature provides added
accessibility.
Navigating
the Competitive Landscape
Comparisons
with other AI platforms highlight Gemini Deep Research's unique strengths.
Gemini Deep Research vs. ChatGPT: Gemini excels in
research-intensive tasks and image analysis, focusing on verifiable facts.
ChatGPT is noted for creative writing and contextual explanations. User
experience preferences vary.
Gemini Deep Research vs. Grok: Grok is designed for real-time
data analysis and IT operations, with strong integration with the X
platform. Gemini offers broader research applications and handles diverse
data types.
Gemini Deep Research vs. DeepSeek: DeepSeek is strong in generating
structured and technically detailed responses, particularly for
programming and technical content. Gemini has shown superior overall
versatility and accuracy across a wider range of prompts and offers native
multimodal support.
Table 1:
Comparison of Gemini Deep Research with Other AI Platforms (a detailed side-by-side comparison
across various features.)
Feature
Gemini
Deep Research
ChatGPT
Deep Research
Grok
DeepSeek
Multimodal
Input
Yes (Text,
Images, Audio, Video)
Yes (Text,
Images, PDFs)
No (Primarily
Text)
No (Primarily
Text)
Real-time
Search
Yes (Uses
Google Search)
Yes (Uses
Bing)
Yes
(Real-time data analysis, integrates with X)
Yes
Citation
Support
Yes (Inline
and Works Cited)
Yes (Inline
and Separate List)
Yes
Yes
Planning
Yes
(User-Reviewable Plan)
Yes
No Explicit
Planning Mentioned
No Explicit
Planning Mentioned
Reasoning
Advanced
(Iterative, Self-Critique)
Advanced
Strong (Focus
on real-time data)
Strong
(Technical Reasoning)
Strengths
Research-heavy
tasks, Image Analysis, Google Ecosystem Integration
Free (for
some models), Paid (for advanced models)
The Future
Trajectory: Impact and Anticipated Enhancements
Gemini Deep
Research has the potential to fundamentally transform research across
various disciplines by automating information gathering, analysis, and
synthesis, leading to significant increases in efficiency and productivity. It
represents a step towards a future where AI actively collaborates in the
research lifecycle.
Future
developments aim to provide users with greater control over the browsing
process and expand information sources beyond the open web.
Continuous improvements in quality and efficiency are expected with the
integration of newer Gemini models. Deeper integration with other Google
applications will enable more personalized and context-aware responses.
Features like Audio Overview and personalization based on search history
indicate a trend towards a more integrated and user-centric research
experience.
Democratizing
In-Depth Analysis
Gemini Deep
Research is a powerful and evolving tool offering a sophisticated
approach to information retrieval and analysis. Its core capabilities in
autonomous web searching, iterative reasoning, and comprehensive report
generation have the potential to significantly enhance research efficiency
across numerous industries and academic fields. By providing user control and
delivering well-cited, synthesized information, Gemini Deep Research empowers
users to gain deeper insights and make more informed decisions. As the
technology advances, its role in the future of research and knowledge discovery
is poised to become increasingly significant, democratizing access to
in-depth analysis and accelerating the pace of innovation.