Mekus Mekus

Monday, August 4, 2025

From Prompt to Prize: How My AI Art Earned a Winning Spot in an AI Art Competition

What an amazing feeling to see something I created get recognized! I'm delighted to share that my AI-generated art piece was selected as one of the 50 winners out of over 420 entries in our organization's recent Visions of the Kick-Off AI Art Competition.

The competition dubbed as "Visions of the Kick-Off," was launched as part of OpenText's Kick-Off (OKO) Event this year. It was a creative challenge extended to all employees globally, inviting us to unleash our imaginations and showcase our vision of the event through AI. The core theme was to create an AI-generated image inspired by OKO whether it reflected the energy, company priorities, or memorable insights from the speakers.

Participating was straightforward but thoughtful. The primary rule was that the image had to be fully AI-generated based solely on a text prompt. We had the flexibility to use tools like Copilot or any other AI image generator, though Copilot Chat in Teams was highlighted as an internal option available to all employees. Beyond the technicalities, there was a crucial emphasis on ensuring the image was respectful, appropriate, and not offensive, given that entries would be displayed for all OpenTexters to view. We were also limited to one entry per employee, and it was essential to submit the exact prompt(s) used along with the image. The deadline was firmly set for 7 PM ET on July 10, 2025.

The judging process was comprehensive, with a panel reviewing submissions against several key criteria. My focus, and indeed what the judges were looking for, centered on:

Creativity & originality: Was the visual imaginative and unique, offering a fresh perspective?
Relevance to OKO: Was the connection to the event clear?
Prompt quality: Was the prompt well-crafted, thoughtful, and effective in guiding the AI?
Visual impact: Was the image visually striking or engaging?

Knowing these criteria certainly guided my approach to crafting the prompt and refining the image. The judging panel expressed that they were "blown away by the talent and imagination on display", which truly speaks to the high caliber of submissions.

My AI-generated art piece that earned a spot among the winners of the recently concluded Visions of the Kick-Off AI Art Competition.

When the announcement came on July 29, 2025, revealing the 50 winners, I was astonished. It s indeed gratifying to have my work recognized. As a winner, my piece, along with the other winning entries, will be featured in a special virtual gallery and celebrated across company channels. Plus, there's some "amazing swag" headed my way!.

This initiative has not only been a fun challenge but also a powerful demonstration of how creativity can truly shine when paired with technology. It's inspiring to see so many colleagues embrace AI as a tool for artistic expression. This experience has reinforced the exciting possibilities that emerge when we dare to imagine and create using the AI tools we have today.

Wednesday, July 16, 2025

New Gemini Feature Turns Photos into Videos

Google is once again redefining the boundaries of digital creativity. Its Gemini platform now lets users transform ordinary still images/photos into short, animated video clips, complete with sound. This fresh capability, revealed by David Sharon, who leads Multimodal Generation for Gemini Apps, is powered by the company’s latest video model, Veo 3.

How It Works?

Breathing life into a static photo might sound like something out of a sci-fi movie, but with Gemini, the process feels intuitive and fun. Inside the Gemini interface, users can head over to the prompt area and select the “Videos” option. Once a photo is uploaded, all that’s left to do is describe what the scene should look like in motion, and optionally, suggest accompanying audio.

That’s all it takes. A few inputs later, your snapshot evolves into an eight-second animated video. Whether you're reimagining a childhood drawing or adding motion to a scenic photo from a recent hike, the possibilities feel nearly limitless. Finished videos can be downloaded or shared instantly with friends and family.

The AI Engine Behind the Art

Under the hood, all of this is made possible by Veo 3, Google's advanced video-generation engine. Introduced in May, this model is already making waves. It recently became available to Google AI Pro users across more than 150 countries.

And users are clearly loving it. In just the past seven weeks, over 40 million videos have been created using Veo 3 (both within Gemini and Flow -- Google’s AI-powered storytelling tool). People are using it to do everything from reimagining classic fairy tales with a modern spin to building ASMR experiences around nature’s most mesmerizing sounds.

Where and How to Try It

The photo-to-video feature is currently rolling out to Gemini AI Pro and Ultra users in select countries. Curious users can check it out by visiting gemini.google.com. The same tools are also available in Flow, which is more tailored for creators working on longer or more cinematic projects.

Built With Safety in Mind

As with all of Google’s AI innovations, the launch of this feature comes with a focus on responsibility and safety. Behind the scenes, the tech giant is running continuous “red teaming” simulations, essentially stress tests designed to catch problems before they reach real users.

Each AI-generated video is clearly marked with a visible watermark to indicate it was created by artificial intelligence. Additionally, every file includes a SynthID digital signature -- Google’s invisible watermarking system designed for traceability.

And user feedback is more than welcome. With a quick thumbs-up or thumbs-down on each video, creators can share their impressions. This feedback loop helps Google continuously fine-tune the experience and maintain high standards of safety.

This feature is more than just a novelty; it’s a glimpse into what the future of personal storytelling could look like. By giving users the ability to animate their memories, drawings, or ideas with just a few prompts, Google is turning imagination into a playable format. Whether you’re an artist, a content creator, or just someone curious to explore what AI can do, Gemini now offers a platform filled with limitless potential.

Tuesday, April 29, 2025

Google Unveils Agent Development Kit (ADK) to Simplify Multi-Agent AI Development

One of the highlights revealed during the Google Cloud NEXT 2025 was the Agent Development Kit (ADK), a new open-source framework designed to streamline the creation and deployment of intelligent, autonomous AI agents and advanced multi-agent systems easier. As the AI landscape evolves beyond single-purpose models, building coordinated teams of agents presents new challenges, which ADK aims to solve by providing a full-stack, end-to-end development solution.

The framework is the same one that powers agents within Google's own products, including Agentspace and the Google Customer Engagement Suite (CES). By open-sourcing ADK, Google intends to empower developers with powerful and flexible tools for building in the rapidly changing agentic AI space. ADK is built with flexibility in mind, supporting different models and deployment environments, and designed to make agent development feel similar to traditional software development.

Core Pillars Guiding Development

ADK provides capabilities across the entire agent development lifecycle. Its core principles emphasize flexibility, modularity, and precise control over agent behavior and orchestration. Key pillars include:

Multi-Agent by Design: Facilitates building modular and scalable applications by composing specialized agents hierarchically, enabling complex coordination and delegation.
Rich Model Ecosystem: Allows developers to choose the best model for their needs, integrating with Google's Gemini models, Vertex AI Model Garden, and through LiteLLM, a wide selection of models from providers like Anthropic, Meta, and Mistral AI.
Rich Tool Ecosystem: Agents can be equipped with diverse capabilities using pre-built tools (like Search or Code Execution), custom Function Tools, integrating 3rd-party libraries (like LangChain or LlamaIndex), or even using other agents as tools via AgentTool. ADK also supports Google Cloud tools, MCP tools, and those defined by OpenAPI specifications.
Built-in Streaming: Enables human-like conversations with agents through bidirectional audio and video streaming capabilities, moving beyond text-only interactions into rich, multimodal dialogue with just a few lines of code.
Flexible Orchestration: Workflows can be defined using Workflow Agents such as Sequential, Parallel, and Loop agents for predictable pipelines, or by leveraging LLM-driven dynamic routing via LlmAgent transfer for more adaptive behavior.
Integrated Developer Experience: Offers tools for local development, testing, and debugging, including a powerful CLI, a visual Web UI to inspect execution step-by-step, and an API Server.
Built-in Evaluation: Provides systematic assessment of agent performance by evaluating both final response quality and the step-by-step execution against predefined test cases.
Easy Deployment: Agents can be containerized and deployed anywhere, with seamless integration options for Vertex AI Agent Engine, a fully managed, scalable, and enterprise-grade runtime on Google Cloud.

Additional core concepts include Sessions & Memory for managing conversational context and agent state, Artifacts for handling files and binary data tied to a session, and Callbacks for customizing behavior at specific execution points. The Runtime manages execution, tracked via Events, with Context providing relevant information.

A notable architectural component is the Agent2Agent (A2A) Protocol, an open protocol designed to facilitate communication and collaboration between agents across different platforms and frameworks, promoting a more interconnected ecosystem.

Getting Started and Building with ADK

Building a basic agent with ADK is designed for Pythonic simplicity, potentially requiring fewer than 100 lines of code. Developers define an agent's logic, tools, and information processing. ADK provides the structure for managing state, orchestrating tool calls, and interacting with underlying LLMs.

The framework primarily supports the Python programming language, with future plans for additional language support. Installation is straightforward using pip. The documentation includes quickstart guides and examples, such as a simple "question_answer_agent" using a Google Search tool.

For a more integrated experience, ADK includes a developer Web UI that can be launched locally via a CLI command (adk web), providing a user-friendly interface for running, testing, and debugging agents. Agents can also be run and tested via the CLI (adk run) or accessed through an api_server.

ADK truly excels in building multi-agent applications. This involves creating teams of specialized agents that can delegate tasks based on conversation context, utilizing hierarchical structures and intelligent routing. An illustrative example shows a WeatherAgent delegating simple greetings and farewells to specialized GreetingAgent and FarewellAgent instances, while handling weather queries itself using a get_weather tool. Delegation relies heavily on clear, distinct descriptions of each agent's capability, which the LLM uses to route tasks effectively.

Evaluation and Deployment Path

Ensuring agents behave reliably is crucial before deployment. ADK offers built-in evaluation tools that allow developers to systematically assess performance by testing execution paths and response quality against predefined datasets. These checks can be run programmatically or via the ADK eval command-line tool or Web UI.

Once satisfied with performance, ADK provides a clear path to production. Agents can be containerized and deployed on various platforms, including Cloud Run and Google Kubernetes Engine (GKE), or using Docker. For a fully managed, scalable, and enterprise-grade runtime, ADK offers seamless integration with Vertex AI Agent Engine on Google Cloud.

Optimized for Google Cloud, but Versatile

While offering flexibility across various tools and models, ADK is optimized for seamless integration within the Google Cloud ecosystem, particularly with Gemini models and Vertex AI. This integration allows developers to leverage advanced Gemini capabilities and provides a native pathway to deploy on Vertex AI.

Beyond model integration, ADK enables agents to connect directly to enterprise systems and data. This includes over 100 pre-built connectors, utilization of workflows built with Application Integration, and access to data in systems like AlloyDB, BigQuery, and NetApp, without requiring data duplication. Agents can also securely tap into existing API investments managed through Apigee. This comprehensive connectivity enhances ADK's power within the Google Cloud environment.

Despite this optimization, ADK is designed to be model and deployment agnostic. Through integration with Vertex AI Model Garden and LiteLLM, it supports models from providers like Anthropic, Meta, Mistral AI, and AI21 Labs. It also allows integration with third-party libraries such as LangChain and CrewAI.

Open Source and Community Driven

ADK is released as an open-source framework, with its primary codebase hosted on GitHub. The repository includes essential files like CHANGELOG.md, CONTRIBUTING.md, LICENSE (Apache-2.0), and README.md, providing information and guidelines for the project and community engagement.

Google actively encourages contributions from the community, welcoming issue reports, feature suggestions, documentation improvements, and code contributions. There is already an active community, with a number of contributors submitting changes. Stable releases are available weekly via pip, while a development version can be installed directly from GitHub for access to the latest changes (though this version may contain experimental features or bugs).

Community Perspectives and Comparisons

Initial community reception of ADK includes significant praise for its excellent CLI and developer-first tools (adk web, adk run, api_server), smooth building and debugging process, and robust support for multiple model providers beyond Google's, thanks to LiteLLM integration. Features like artifact management for stateful agents and the AgentTool concept have also been highlighted as useful additions. Deployment options, including the fully managed Agent Engine and Cloud Run, are seen as providing appropriate levels of control.

However, community feedback also points out areas for potential improvement. Some developers found initial setup steps cumbersome and perceived the different agent types (LlmAgent, Sequential, Parallel, Loop) as potentially over-engineered. The developer experience for evaluation and guardrails is not consistently perceived as intuitive or smooth. Session state management has been identified as a notable weak point, making it challenging to work with effectively. A recurring concern is that the overall developer experience seems optimized for advanced users, potentially posing a higher barrier for beginners.

Compared to other open-source frameworks:

LangChain/LangGraph: ADK is presented as a higher-level framework potentially easier for simpler workflows, whereas LangGraph offers lower-level, granular control, especially for persistence. ADK has a built-in chat UI and monitoring, while LangChain is a widely adopted standard. ADK integrates with LangChain tools, leveraging its extensive library.
CrewAI: Both facilitate complex agents. ADK offers smoother integration with Google products, stronger enterprise features like built-in RAG and connectors, and better deployment options like Vertex AI Agent Engine. CrewAI focuses specifically on role-based multi-agent collaboration.
AutoGen: Microsoft's AutoGen focuses on swarms of specialized agents and multi-agent conversation frameworks. ADK is a more general toolkit and uniquely offers built-in bidirectional audio/video streaming. Some users prefer ADK.
OpenAI Agents SDK: Both are for multi-agent workflows. ADK supports multiple providers via LiteLLM (including OpenAI models), while OpenAI Agents SDK focuses on native integration with OpenAI endpoints. ADK includes artifact management.
Semantic Kernel: A model-agnostic SDK for building/orchestrating agents, with a focus on enterprise integration. ADK integrates with Semantic Kernel via MCP and A2A for cross-cloud collaboration. Semantic Kernel supports more languages and has a plugin ecosystem.
LlamaIndex: Primarily focuses on connecting LLMs to external data for Retrieval-Augmented Generation (RAG). ADK integrates with LlamaIndex as a library to leverage its data indexing/retrieval capabilities.

This comparison highlights that ADK brings unique strengths, particularly its deep integration with the Google Cloud ecosystem and the A2A protocol, while fitting into a broader landscape of tools, each with its own focus and advantages.

Intended Use Cases and Future Potential

Google ADK is designed for a wide range of AI agent applications, from simple automated tasks to complex, collaborative multi-agent systems. Its flexible orchestration makes it suitable for applications requiring both deterministic task execution and intelligent, context-aware routing. The ability to compose specialized agents hierarchically is useful for breaking down intricate tasks. The rich tool ecosystem and deployment flexibility broaden its applicability across diverse scenarios. Built-in evaluation and emphasis on safety and security make it suitable for applications where reliability and trustworthiness are key.

Specific potential applications mentioned include travel planning, retail automation, and internal automation.

The Agent Development Kit provides a robust, flexible, and open-source foundation for building the next generation of AI applications. While community feedback indicates areas for refinement, particularly in developer experience for beginners and session state management, its strengths in modularity, Google Cloud integration, and the innovative A2A protocol position it as a significant tool in the evolving field of agentic AI. Its open-source nature and the encouragement of community contributions suggest a promising path for future development.

For developers interested in exploring ADK, the official documentation and quickstart guide are recommended starting points. Utilizing the developer Web UI for local testing provides valuable hands-on experience. Considering contributions to the GitHub project is also encouraged to engage with the community and shape the framework's evolution. Developers with projects requiring strong Google Cloud integration or leveraging Gemini models should seriously evaluate ADK, while also considering other frameworks based on specific project needs and infrastructure.

Friday, April 25, 2025

Your 'Please' and 'Thank You' Cost OpenAI Millions, Sam Altman Reveals

In the rapidly evolving world of artificial intelligence, even seemingly small gestures of human courtesy towards chatbots like ChatGPT come with a price tag. OpenAI CEO Sam Altman recently revealed that users saying "please" and "thank you" to the company's AI models is costing "tens of millions of dollars". While the notion of politeness having a significant financial impact on a tech giant might seem surprising, experts explain that this cost is a consequence of how these powerful AI systems operate on an immense scale.

How AI Processes Language (And Politeness)

Understanding the cost involves looking into the technical underpinnings of AI chatbots. Large language models (LLMs) like ChatGPT process text by breaking it down into smaller units called tokens. These tokens can be words, parts of words, or even punctuation marks. When a user inputs a prompt, the AI processes each token, requiring computational resources like processing power and memory housed in massive data centers. Generally, more tokens in a prompt require more computational resources.

Polite phrases such as "please" and "thank you" typically add a small number of tokens to a user's input, usually between two and four tokens in total. According to OpenAI's API pricing, which charges based on token usage, the cost per million tokens varies by model. For example, the GPT-3.5 Turbo model has an input cost of $0.50 per million tokens. Based on this, adding three tokens for politeness to a single prompt costs an exceedingly small amount – roughly $0.0000015.

Scale: The Reason for the Millions

So, how does a cost of a fraction of a cent per interaction balloon into "tens of millions of dollars"? The answer lies in the sheer volume of daily usage. ChatGPT handles over one billion queries daily. When a minuscule cost per interaction is multiplied by billions of interactions, it accumulates into a substantial aggregate figure.

Operating LLMs like ChatGPT requires a vast infrastructure of data centers, high-performance servers, and specialized processing chips (GPUs). These facilities consume substantial amounts of energy. Estimates suggest the daily energy cost to run ChatGPT could be around $700,000. A single query to GPT-4 is estimated to consume about 2.9 watt-hours of electricity, significantly more than a standard search. Scaling this across billions of daily queries results in millions of kilowatt-hours consumed daily. Beyond electricity, these data centers also require significant water for cooling systems. Generating a short 100-word email with GPT-4, for instance, can use as much as 519 milliliters of water for cooling the servers involved. The cost of processing polite language is embedded within this broader framework of infrastructure, energy, and water expenses.

While some expert analyses, based purely on token costs for models like GPT-3.5 Turbo, estimate the annual cost of politeness to be significantly lower, around $146,000, Altman's statement of "tens of millions" likely reflects the broader increase in computational load and associated energy consumption across their vast infrastructure. Even at "tens of millions," this figure represents a non-negligible part of the overall operational costs for ChatGPT, which are estimated to be in the hundreds of millions annually.

Why Users Are Polite, And Why It Might Matter

Despite the cost, a significant portion of users are polite to AI. A late 2024 survey found that 67 percent of US respondents reported being nice to their chatbots. Among those, 55 percent said they did it "because it's the right thing to do," while 12 percent cited appeasing the algorithm, perhaps out of fear of a future AI uprising. About two-thirds of people who are impolite said it was for brevity.

Moreover, experts suggest that being polite to AI might offer benefits beyond simple etiquette. Microsoft's design manager Kurtis Beavers noted that using proper etiquette helps generate "respectful, collaborative outputs," explaining that polite language "sets a tone for the response". A Microsoft WorkLab memo added that generative AI mirrors the levels of professionalism, clarity, and detail in user prompts. Beavers also suggested that being polite ensures you get the same graciousness in return and improves the AI's responsiveness and performance.

Research hints that polite and well-phrased prompts could lead to higher quality and less biased AI outputs, with one study finding a 9% improvement in AI accuracy. Additionally, current interactions with AI are seen as contributing to the training data for future models. Polite exchanges might help train AI to default towards helpfulness, whereas curt interactions could reinforce transactional behavior, potentially shaping the ethical frameworks of future AI. OpenAI CEO Sam Altman's comment that the tens of millions spent on politeness were "well spent" and his cryptic "you never know" remark could indicate that OpenAI sees long-term strategic value in fostering these more natural interactions.

Contextualizing the Cost and Environmental Impact

Ultimately, while politeness does add to the computational load and contributes to energy consumption, the cost per individual user remains minimal. The significant expense arises from the aggregate effect across billions of interactions.

However, the discussion underscores a broader issue: the substantial environmental footprint of AI. Data centers powering AI already consume around 2 percent of the world's energy, a figure projected to increase dramatically. Those "pleases" and "thank yous," while seemingly small, contribute to this growing energy demand. This reality has led some to suggest that for tasks like writing a simple email, the most environmentally conscious choice might be to bypass the chatbot entirely and write it yourself.

As AI becomes more integrated into daily life, the balance between optimizing computational efficiency and fostering positive, human-like interactions remains a key consideration. The debate over the cost of politeness highlights this intersection of technical performance, economic reality, environmental impact, and evolving human-AI relationships.

Tuesday, April 8, 2025

Hands-On with Manus: My First Impression with an Autonomous AI Agent

Last month, I stumbled across an article about a new AI agent called Manus that was making waves in tech circles. Developed by Chinese startup Monica, Manus promised something different from the usual chatbots – true autonomy. Intrigued, I joined their waitlist without much expectation.

Then yesterday, my inbox pinged with a surprise: I'd been granted early access to Manus, complete with 1,000 complimentary credits to explore the platform. As someone who's tested every AI tool from ChatGPT to Claude, I couldn't wait to see if Manus lived up to its ambitious claims.

For context, Manus enters an increasingly crowded field of AI agents. OpenAI released Operator in January, Anthropic launched Computer Use last fall, and Google unveiled Project Mariner in December. Each promises to automate tasks across the web, but Manus claims to take autonomy further than its competitors.

This post shares my unfiltered experience – what Manus is, how it works, where it shines, where it struggles, and whether it's worth the hype. Whether you're considering joining the waitlist or just curious about where AI agents are headed, here's my take on being among the first to try this intriguing technology.

What Exactly Is Manus?

Manus (Latin for "hands") launched on March 6th as what Monica calls a "fully autonomous AI agent." Unlike conventional chatbots that primarily generate text within their interfaces, Manus can independently navigate websites, fill forms, analyze data, and complete complex tasks with minimal human guidance.

The name cleverly reflects its purpose – to be the hands that execute tasks in digital spaces. It represents a fundamental shift from AI that just "thinks" to AI that "does."

Beyond Conversational AI

Traditional AI assistants like ChatGPT excel at answering questions and generating content but typically can't take action outside their chat interfaces. Manus bridges this gap by combining multiple specialized AI models that work together to understand tasks, plan execution steps, navigate digital environments, and deliver results.

According to my research, Manus uses a combination of models including fine-tuned versions of Alibaba's open-source Qwen and possibly components from Anthropic's Claude. This multi-model approach allows it to handle complex assignments that would typically require human intervention – from building simple websites to planning detailed travel itineraries.

The Team Behind Manus

Monica (Monica.im) operates from Wuhan rather than China's typical tech hubs like Beijing or Shanghai. Founded in 2022 by Xiao Hong, a graduate of Huazhong University of Science and Technology, the company began as a developer of AI-powered browser extensions.

What started as a "ChatGPT for Google" browser plugin evolved rapidly as the team recognized the potential of autonomous agents. After securing initial backing from ZhenFund, Monica raised Series A funding led by Tencent and Sequoia Capital China in 2023.

In an interesting twist, ByteDance reportedly offered $30 million to acquire Monica in early 2024, but Xiao Hong declined. By late 2024, Monica closed another funding round that valued the company at approximately $100 million.

Current Availability

Manus remains highly exclusive. From what I've gathered, less than 1% of waitlist applicants have received access codes. The platform operates on a credit system, with tasks costing roughly $2 each. My 1,000 free credits theoretically allow for 500 basic tasks, though complex assignments consume more credits.

Despite limited access, Manus has generated considerable buzz. Several tech influencers have praised its capabilities, comparing its potential impact to that of DeepSeek, another Chinese AI breakthrough that surprised the industry last year.

How Manus Works

My first impression upon logging in was that Manus offers a clean, minimalist interface. The landing page displays previous sessions in a sidebar and features a central input box for task descriptions. What immediately sets it apart is the "Manus's Computer" viewing panel, which shows the agent's actions in real-time.

The Technical Approach

From what I've observed and researched, Manus operates through several coordinated steps:

When you describe a task, Manus analyzes your request and breaks it into logical components
It creates a step-by-step plan, identifying necessary tools and actions
The agent executes this plan by navigating websites, filling forms, and analyzing information
If it encounters obstacles, it attempts to adapt its approach
Once complete, it delivers results in a structured format

This process happens with minimal intervention. Unlike chatbots that need continuous guidance, Manus works independently after receiving initial instructions.

The User Experience

Using Manus follows a straightforward pattern:

You describe your task in natural language
Manus acknowledges and may ask clarifying questions
The agent begins working, with its actions visible in the viewing panel
For complex tasks, it might provide progress updates
Upon completion, it delivers downloadable results in various formats

One valuable feature is Manus's asynchronous operation. Once a task begins, it continues in the cloud, allowing you to disconnect or work on other things. This contrasts with some competing agents that require constant monitoring.

Pricing Structure

Each task costs approximately $2 worth of credits, though I've noticed complex tasks consume more. For instance, a simple research assignment used 1 credit, while a detailed travel itinerary planning task used 5 credits.

At current rates, regular use would represent a significant investment. Whether this cost is justified depends entirely on how much you value the time saved and the quality of results.

Limitations and Safeguards

Like all AI systems, Manus has constraints. It cannot bypass paywalls or complete CAPTCHA challenges without assistance. When encountering these obstacles, it pauses and requests intervention.

The system also includes safeguards against potentially harmful actions. It won't make purchases or enter payment information without explicit confirmation and avoids actions that might violate terms of service.

How Manus Compares to Competitors

The AI agent landscape has become increasingly competitive, with major players offering their own solutions. Based on my testing and research, here's how Manus stacks up:

Performance Benchmarks

Manus reportedly scores around 86.5% on the General AI Assistants (GAIA) benchmark, though these figures remain partially unverified. For comparison:

OpenAI's Operator achieves 38.1% on OSWorld (testing general computer tasks) and 87% on WebVoyager (testing browser-based tasks)
Anthropic's Computer Use scores 22.0% on OSWorld and 56% on WebVoyager
Google's Project Mariner scores 83.5% on WebVoyager

For context, human performance on OSWorld is approximately 72.4%, indicating that even advanced AI agents still fall short of human capabilities in many scenarios.

Key Differentiators

From my experience, Manus's most significant advantage is its level of autonomy. While all these agents perform tasks with some independence, Manus requires less intervention:

Manus operates asynchronously in the cloud, allowing you to focus on other activities
Operator requires confirmation before finalizing tasks with external effects
Computer Use frequently needs clarification during execution
Project Mariner often pauses for guidance and requires users to watch it work

Manus also offers exceptional transparency through its viewing panel, allowing you to observe its process in real-time. This builds trust and helps you understand how the AI approaches complex tasks.

Regarding speed, the picture is mixed. Manus can take 30+ minutes for complex tasks but works asynchronously. Operator is generally faster but still significantly slower than humans. Computer Use takes numerous steps for simple actions, while Project Mariner has noticeable delays between actions.

Manus stands out for global accessibility, supporting multiple languages including English, Chinese (traditional and simplified), Russian, Ukrainian, Indonesian, Persian, Arabic, Thai, Vietnamese, Hindi, Japanese, Korean, and various European languages. In contrast, Operator is currently limited to ChatGPT Pro subscribers in the United States.

The business models also differ significantly. Manus uses per-task pricing at approximately $2 per task, while Operator is included in the ChatGPT Pro subscription ($200/month). Computer Use and Project Mariner's pricing models are still evolving.

Challenges Relative to Competitors

Despite its advantages, Manus faces several challenges:

System stability issues, with occasional crashes during longer tasks
Limited availability compared to competitors
As a product from a relatively small startup, it lacks the resources of tech giants backing competing agents

My Hands-On Experience

After receiving my access code yesterday, I've tested Manus on various tasks of increasing complexity. Here's what I've found:

Tasks I've Attempted

Research Task: Compiling a list of top AI research papers from 2024 with summaries
Content Creation: Creating a comparison table of electric vehicles with specifications
Data Analysis: Analyzing trends in a spreadsheet of sales data
Travel Planning: Developing a one-week Japan itinerary based on my preferences
Technical Task: Creating a simple website portfolio template

Successes and Highlights

Manus performed impressively on several tasks. The research assignment was particularly successful – Manus navigated academic databases efficiently, organized information logically, and delivered a well-structured document with proper citations.

For the electric vehicle comparison, it created a detailed table with accurate, current information by navigating multiple manufacturer websites. This would have taken me hours to compile manually.

The travel planning showcase demonstrated Manus's coordination abilities. It researched flights, suggested accommodations at various price points, and created a day-by-day itinerary respecting my preferences for cultural experiences and outdoor activities. It even included estimated costs and transportation details.

Watching Manus work through the viewing panel was fascinating. The agent demonstrated logical thinking, breaking complex tasks into manageable steps and adapting when encountering obstacles.

Limitations and Frustrations

Despite these successes, Manus wasn't without struggles. The data analysis task revealed limitations – while it identified basic trends, its analysis lacked the depth a human analyst would provide. The visualizations were functional but basic.

The website creation task encountered several hiccups. Manus created a basic HTML/CSS structure but struggled with complex responsive design elements. The result was usable but would require significant refinement.

I experienced two system crashes during longer tasks, requiring me to restart. In one case, Manus lost progress on a partially completed task, which was frustrating.

When Manus encountered paywalls or CAPTCHA challenges, it appropriately paused for intervention. While necessary, this interrupted the otherwise autonomous workflow.

Overall User Experience

The interface is clean and intuitive, and the viewing panel provides valuable transparency. Task results are well-organized and easy to download. The asynchronous operation is particularly valuable, allowing me to focus on other activities while Manus works.

However, load times can be lengthy, especially for complex tasks. Occasional stability issues interrupt the workflow, and the system sometimes struggles with nuanced instructions. There's also limited ability to intervene once a task is underway.

Final Thoughts

After my initial day with Manus, I'm cautiously optimistic about its potential. The agent demonstrates impressive capabilities that genuinely save time on certain tasks. The research, content creation, and planning functions are particularly strong.

However, stability issues, variable performance across task types, and occasional need for human intervention prevent Manus from being the truly autonomous assistant it aspires to be. It's a powerful tool but one that still requires oversight and occasional course correction.

The 1,000 free credits provide ample opportunity to explore Manus's capabilities without immediate cost concerns. Based on my usage, these should last several weeks with moderate use.

For early adopters and those with specific use cases aligned with Manus's strengths, the value proposition is compelling despite the $2 per-task cost. For professionals whose time is valuable, the hours saved could easily justify the expense.

However, for general users or those with tighter budgets, the current limitations and cost structure might make Manus a luxury rather than a necessity.

As Manus evolves in response to user feedback and competitive pressures, I expect many current limitations to be addressed. The foundation is strong, and if Monica can improve stability and refine capabilities in weaker areas, Manus could become an indispensable productivity tool.

The autonomous AI revolution is just beginning, and Manus represents one of its most intriguing early manifestations. Whether it ultimately leads the field or serves as a stepping stone to more capable systems remains to be seen, but its contribution to advancing autonomous AI is already significant.

I'll continue experimenting with my remaining credits, focusing on tasks where Manus excels, and will likely share updates as I discover more about this fascinating technology.

Friday, April 4, 2025

How the Model Context Protocol (MCP) is Revolutionizing AI Model Integration

As artificial intelligence continues to grow more advanced—especially with the rapid rise of Large Language Models (LLMs)—there’s been a persistent roadblock: how to connect these powerful AI models to the massive range of tools, databases, and services in the digital world without reinventing the wheel every time.

Traditionally, every new integration—whether it's a link to an API, a business application, or a data repository—has required its own unique setup. These one-off, custom-built connections are not only time-consuming and expensive to develop, but also make it incredibly hard to scale up when things evolve. Imagine trying to build a bridge for every single combination of AI model and tool. That’s what developers have been facing—what many call the "N by N problem": integrating n LLMs with m tools requires n × m individual solutions. Not ideal.

That’s where the Model Context Protocol (MCP) steps in. Introduced by Anthropic in late 2024, MCP is an open standard designed to simplify and standardize how AI models connect to the outside world. Think of it as the USB-C of AI—one universal plug that can connect to almost anything. Instead of developers building custom adapters for every new tool or data source, MCP provides a consistent, secure way to bridge the gap between AI and external systems.

Why Integration Used to Be a Mess

Before MCP, AI integration was like trying to wire a house with dozens of different plugs, each needing a special adapter. Every tool—whether it's a database or a piece of enterprise software—needed to be individually wired into the AI model. This meant developers spent countless hours creating one-off solutions that were hard to maintain and even harder to scale. As AI adoption grew, so did the complexity and the frustration.

This fragmented approach didn’t just slow things down—it also prevented different systems from working together smoothly. There wasn’t a common language or structure, making collaboration and reuse of integration tools nearly impossible.

MCP: A Smarter Way to Connect AI

Anthropic created MCP to bring some much-needed order to the chaos. The protocol lays out a standard framework that lets applications pass relevant context and data to LLMs while also allowing those models to tap into external tools when needed. It’s designed to be secure, dynamic, and scalable. With MCP, LLMs can interact with APIs, local files, business applications—you name it—all through a predictable structure that doesn’t require starting from scratch.

How MCP Is Built: Hosts, Clients, and Servers

The MCP framework works using a three-part architecture that will feel familiar to anyone with a background in networking or software development:

MCP Hosts are the AI-powered applications or agents that need access to outside data—think tools like Claude Desktop or AI-powered coding environments like Cursor.
MCP Clients live inside these host applications and handle the job of talking to MCP servers. They manage the back-and-forth communication, relaying requests and responses.
MCP Servers are lightweight programs that make specific tools or data available through the protocol. These could connect to anything from a file system to a web service, depending on the need.

What MCP Can Do: The Five Core Features

MCP enables communication through five key features—simple but powerful building blocks that allow AI to do more without compromising structure or security:

Prompts – These are instructions or templates the AI uses to shape how it tackles a task. They guide the model in real-time.
Resources – Think of these as reference materials—structured data or documents the AI can “see” and use while working.
Tools – These are external functions the AI can call on to fetch data or perform actions, like running a database query or generating a report.
Root – A secure method for accessing local files, allowing the AI to read or analyze documents without full, unrestricted access.
Sampling – This allows the external systems (like the MCP server) to ask the AI for help with specific tasks, enabling two-way collaboration.

Unlocking the Potential: Advantages of MCP

The adoption of MCP offers a multitude of benefits compared to traditional integration methods. It provides universal access through a single, open, and standardized protocol. It establishes secure, standardized connections, replacing ad hoc API connectors. MCP promotes sustainability by fostering an ecosystem of reusable connectors (servers). It enables more relevant AI by connecting LLMs to live, up-to-date, context-rich data. MCP offers unified data access, simplifying the management of multiple data source integrations. Furthermore, it prioritizes long-term maintainability, simplifying debugging and reducing integration breakage. By offering a standardized "connector," MCP simplifies AI integrations, potentially granting an AI model access to multiple tools and services exposed by a single MCP-compliant server. This eliminates the need for custom code for each tool or API.

MCP in Action: Applications Across Industries

The potential applications of MCP span a wide range of industries. It aims to establish seamless connections between AI assistants and systems housing critical data, including content repositories, business tools, and development environments. Several prominent development tool companies, including Zed, Replit, Codeium, and Sourcegraph, are integrating MCP into their platforms to enhance AI-powered features for developers. AI-powered Integrated Development Environments (IDEs) like Cursor are deeply integrating MCP to provide intelligent assistance with coding tasks. Early enterprise adopters like Block and Apollo have already integrated MCP into their internal systems. Microsoft's Copilot Studio now supports MCP, simplifying the incorporation of AI applications into business workflows. Even Anthropic's Claude Desktop application has built-in support for running local MCP servers.

A Collaborative Future: Open Source and Community Growth

MCP was officially released as an open-source project by Anthropic in November 2024. Anthropic provides comprehensive resources for developers, including the official specification and Software Development Kits (SDKs) for various programming languages like TypeScript, Python, Java, and others. An open-source repository for MCP servers is actively maintained, providing developers with reference implementations. The open-source nature encourages broad participation from the developer community, fostering a growing ecosystem of pre-built, MCP-enabled connectors and servers.

Navigating the Challenges and Looking Ahead

While MCP holds immense promise, it is still a relatively recent innovation undergoing development and refinement. The broader ecosystem, including robust security frameworks and streamlined remote deployment strategies, is still evolving. Some client implementations may have current limitations, such as the number of tools they can effectively utilize. Security remains a paramount consideration, requiring careful implementation of visibility, monitoring, and access controls. Despite these challenges, the future outlook for MCP is bright. As the demand for AI applications that seamlessly interact with the real world grows, the adoption of standardized protocols like MCP is likely to increase significantly. MCP has the potential to become a foundational standard in AI integration, similar to the impact of the Language Server Protocol (LSP) in software development.

A Smarter, Simpler Future for AI Integration

The Model Context Protocol represents a significant leap forward in simplifying the integration of advanced AI models with the digital world. By offering a standardized, open, and flexible framework, MCP has the potential to unlock a new era of more capable, context-aware, and beneficial AI applications across diverse industries. The collaborative, open-source nature of MCP, coupled with the support of key players and the growing enthusiasm within the developer community, points towards a promising future for this protocol as a cornerstone of the evolving AI ecosystem.

Friday, March 28, 2025

How Gemini Deep Research Works

Google's Gemini ecosystem has expanded its capabilities with the introduction of Gemini Deep Research, a sophisticated feature designed to revolutionize how users conduct in-depth investigations online. Moving beyond the limitations of traditional search engines, Deep Research acts as a virtual research assistant, autonomously navigating the vast expanse of the internet to synthesize complex information into coherent and insightful reports. This AI-powered tool promises to significantly enhance research efficiency and provide valuable insights across diverse domains for professionals, researchers, and individuals seeking a deeper understanding of complex subjects.

Unpacking Gemini Deep Research: Your Personal AI Research Partner

Gemini Deep Research is integrated within the Gemini Apps, offering users a specialized feature for comprehensive and real-time research on virtually any topic. It operates as a personal AI research assistant, going beyond basic question-answering to automate web browsing, information analysis, and knowledge synthesis. The core objective is to significantly reduce the time and effort typically associated with in-depth research, empowering users to gain a thorough understanding of complex subjects much faster than with conventional methods.

Unlike traditional search methods that require users to manually navigate numerous tabs and piece together information, Deep Research streamlines this process autonomously. It navigates and analyzes potentially hundreds of websites, thoughtfully processes the gathered information, and generates insightful, multi-page reports. Many reports also offer an Audio Overview feature, enhancing accessibility by allowing users to stay informed while multitasking. This combination of autonomous research and accessible output formats sets Gemini Deep Research apart from standard chatbots.

The Mechanics of Deep Research: From Prompt to Insightful Report

Engaging with Gemini Deep Research is designed to be intuitive, accessible through the Gemini web or mobile app. The process begins with the user entering a clear and straightforward research prompt. The system understands natural language, eliminating the need for specialized prompting techniques.

Upon receiving a prompt, Gemini Deep Research generates a detailed research plan tailored to the specific topic. Importantly, users have the opportunity to review and modify this plan before the research begins, allowing for targeted investigation aligned with their specific objectives. Users can suggest alterations and provide additional instructions using natural language.

Once the plan is finalized, Deep Research autonomously searches and deeply browses the web for relevant and up-to-date information, potentially analyzing hundreds of websites. Transparency is maintained through options like "Sites browsed," which lists the utilized websites, and "Show thinking," which reveals the AI's steps.

A crucial aspect is the AI's ability to engage in iterative reasoning and thoughtful analysis of the gathered information. It continuously evaluates findings, identifies key themes and patterns, and employs multiple passes of self-critique to enhance the clarity, accuracy, and detail of the final report.

The culmination is the generation of comprehensive and customized research reports within minutes, depending on the topic's complexity. These reports often include an Audio Overview and can be easily exported to Google Docs, preserving formatting and citations. Clear citations and direct links to original sources are always included, ensuring transparency and facilitating easy verification.

Under the Hood: Powering Deep Research

Gemini Deep Research harnesses the power of Google's advanced Gemini models. Initially powered by Gemini 1.5 Pro, known for its ability to process large amounts of information, Deep Research was subsequently upgraded to the Gemini 2.0 Flash Thinking Experimental model. This "thinking model" enhances reasoning by breaking down complex problems into smaller steps, leading to more accurate and insightful responses.

At its core, Deep Research operates as an agentic system, autonomously breaking down complex problems into actionable steps based on a detailed, multi-step research plan. This planning is iterative, with the model constantly evaluating gathered information.

Given the long-running nature of research tasks involving numerous model calls, Google has developed a novel asynchronous task manager. This system maintains a shared state, enabling graceful error recovery without restarting the entire process and allowing users to return to results at their convenience.

To manage the extensive information processed during a research session, Deep Research leverages Gemini's large context window (up to 1 million tokens for Gemini Advanced users). This is complemented by Retrieval-Augmented Generation (RAG), allowing the system to effectively "remember" information learned during a session, becoming increasingly context-aware.

The Gemini models are trained on a massive and diverse multimodal and multilingual dataset. This includes web documents, code, images, audio, and video. Instruction tuning and human preference data ensure the models effectively follow complex instructions and align with human expectations for quality. Gemini 1.5 Pro utilizes a sparse Mixture-of-Experts (MoE) architecture for increased efficiency and scalability.

Diverse Applications Across Industries and Research

Gemini Deep Research offers a wide range of applications, demonstrating its versatility.

Business Intelligence and Market Analysis: Competitive analysis, due diligence, identifying market trends.
Academic and Scientific Research: Literature reviews, summarizing research papers, hypothesis generation.
Healthcare and Medical Research: Assisting in radiology reports, summarizing health information, answering clinical questions, analyzing medical images and genomic data.
Finance and Investment Analysis: Examining market capitalization, identifying investment opportunities, flagging potential risks, analyzing financial reports.
Education: Lesson planning, grant writing, creating assessment materials, supporting student research and understanding.

Real-world examples include planning home renovations, researching vehicles, analyzing business propositions, benchmarking marketing campaigns, analyzing economic downturns, researching product manufacturing, exploring interstellar travel possibilities, researching game trends, assisting in coding, and conducting biographical analysis. Industry-specific uses include accounting associations analyzing tax reforms, professional development identifying skill gaps, regulatory bodies assessing the impact of new regulations, and healthcare streamlining radiology reports and summarizing patient histories.

The utility of Deep Research is further enhanced by its integration with other Google tools like Google Docs and NotebookLM, facilitating editing, collaboration, and in-depth data analysis. The Audio Overview feature provides added accessibility.

Navigating the Competitive Landscape

Comparisons with other AI platforms highlight Gemini Deep Research's unique strengths.

Gemini Deep Research vs. ChatGPT: Gemini excels in research-intensive tasks and image analysis, focusing on verifiable facts. ChatGPT is noted for creative writing and contextual explanations. User experience preferences vary.
Gemini Deep Research vs. Grok: Grok is designed for real-time data analysis and IT operations, with strong integration with the X platform. Gemini offers broader research applications and handles diverse data types.
Gemini Deep Research vs. DeepSeek: DeepSeek is strong in generating structured and technically detailed responses, particularly for programming and technical content. Gemini has shown superior overall versatility and accuracy across a wider range of prompts and offers native multimodal support.

Table 1: Comparison of Gemini Deep Research with Other AI Platforms (a detailed side-by-side comparison across various features.)

Feature	Gemini Deep Research	ChatGPT Deep Research	Grok	DeepSeek
Multimodal Input	Yes (Text, Images, Audio, Video)	Yes (Text, Images, PDFs)	No (Primarily Text)	No (Primarily Text)
Real-time Search	Yes (Uses Google Search)	Yes (Uses Bing)	Yes (Real-time data analysis, integrates with X)	Yes
Citation Support	Yes (Inline and Works Cited)	Yes (Inline and Separate List)	Yes	Yes
Planning	Yes (User-Reviewable Plan)	Yes	No Explicit Planning Mentioned	No Explicit Planning Mentioned
Reasoning	Advanced (Iterative, Self-Critique)	Advanced	Strong (Focus on real-time data)	Strong (Technical Reasoning)
Strengths	Research-heavy tasks, Image Analysis, Google Ecosystem Integration	Creative Writing, Contextual Explanations, Structured Output	Real-time Data Analysis, Social Media Analysis, IT Operations	Structured Technical Responses, Coding, Cost-Effectiveness
Weaknesses	May lack diverse perspectives, Cannot bypass paywalls	Occasional Inaccuracies, Subscription Fee for Full Access	Less Depth in Some Areas, Limited Visuals	Primarily Text-Based, Limited Public Information
Key Use Cases	Business Intelligence, Academic Research, Healthcare, Finance, Education	Content Creation, Brainstorming, Academic Projects, Business Research	Marketing, Financial Planning, Social Media Management, IT Automation	Programming, Math, Scientific Research, Technical Documentation
Pricing (Approx.)	Free (Limited), Paid (with Gemini Advanced)	Paid (with ChatGPT Plus)	Paid (with Grok Premium+)	Free (for some models), Paid (for advanced models)

The Future Trajectory: Impact and Anticipated Enhancements

Gemini Deep Research has the potential to fundamentally transform research across various disciplines by automating information gathering, analysis, and synthesis, leading to significant increases in efficiency and productivity. It represents a step towards a future where AI actively collaborates in the research lifecycle.

Future developments aim to provide users with greater control over the browsing process and expand information sources beyond the open web. Continuous improvements in quality and efficiency are expected with the integration of newer Gemini models. Deeper integration with other Google applications will enable more personalized and context-aware responses. Features like Audio Overview and personalization based on search history indicate a trend towards a more integrated and user-centric research experience.

Democratizing In-Depth Analysis

Gemini Deep Research is a powerful and evolving tool offering a sophisticated approach to information retrieval and analysis. Its core capabilities in autonomous web searching, iterative reasoning, and comprehensive report generation have the potential to significantly enhance research efficiency across numerous industries and academic fields. By providing user control and delivering well-cited, synthesized information, Gemini Deep Research empowers users to gain deeper insights and make more informed decisions. As the technology advances, its role in the future of research and knowledge discovery is poised to become increasingly significant, democratizing access to in-depth analysis and accelerating the pace of innovation.