- AI with ALLIE
- Posts
- My personal review of 10+ AI agents and what actually works
My personal review of 10+ AI agents and what actually works
The AI Agents Report Card you wish your boss gave you.

AI with ALLIE
The professional’s guide to quick AI bites for your personal life, work life, and beyond.
What you'll learn in this newsletter:
Which AI agents actually deliver value right now
Where even the best agents still fall embarrassingly short
The surprising truth about those sleek, impressive interfaces
The economics of delegating to AI (and when it's worth the premium)
Five practical takeaways to guide your AI strategy
Hey cyborgs,
What if AI could just handle tasks for you? Plan a week's worth of perfect meals, generate a detailed competitor analysis, or intelligently reschedule your entire day when one meeting shifts—all without you lifting a finger? An AI agent, or thousands of AI agents, that can reason through its own steps and execute on your behalf would be pretty life-changing…
I've been on an AI agent testing spree lately, and I dragged my entire team along for the ride. Last month, I assigned everyone to test at least two different AI agents and come back with demos to share and very opinionated first impressions on design, process, flow, performance, speed, and accuracy. After watching presentations, comparing notes, and hashing out what worked (and mostly what didn't), we've uncovered some fascinating insights about where these digital helpers actually stand today.
Allie note: you can steal my idea and have your team test AI agents too! Since many AI agent platforms cost money because it’s a more expensive workflow to run, I gave everyone a $100 budget—you can scale yours accordingly. You can also provide a list of agents that you’re interested in your team testing. I would also recommend a quick primer of your company AI policy prior to this test, as well as discussing as a group potential use cases you might have internally. If you plan on using these in work scenarios that require company data, it is a safer test to provide your employees or small team with a set of documents with fake data.
The dream is seductive, but I’m just going to rip the bandaid off here, we're definitely living in the awkward teenage years of AI agents. They show flashes of brilliance followed by face-palm moments of utter incompetence.
And if you haven’t watched my Intro to AI Agents primer, you need to start there.
Our Testing Team Goes Wild
Our testing covered a broad range of players in the AI agent space, from the giants to the newcomers:
Each team member approached testing with different real-world scenarios—from selling jewelry to custom-framing art, scheduling personal errands around work meetings to gathering sales intelligence. Let's dive into what we discovered.
Task-Focused Agents
These tools aim to execute specific tasks with varying levels of autonomy.
Operator: OpenAI's literal machine
Operator earned a solid B-/C+ in our testing. It works for quick, straightforward tasks but lacks nuance. One team member used it to find a gluten-free restaurant near a concert venue in Minneapolis and successfully made a reservation (that’s a big deal!). Another tested it to find options for selling a unique piece of jewelry, though the results left her uncertain whether she was getting the best value.
This Operator highlights a key insight that would reappear throughout our testing: AI agents excel when you need a solution (any decent restaurant nearby) but fall apart when you want the perfect solution.

Operator finding me two tickets to the next concert at the Sphere (listed in Canadian dollars because I took the screenshot while in Toronto)
We saw this limitation clearly when another test involved asking Operator to find a quilt pattern—something more specific with design requirements. After 15+ minutes of searching, the results were mediocre at best. It took words too literally, overweighting terms like "modern" and "curves" rather than understanding the overall intent (like maybe showing a wave pattern that is curvy).
One of our most amusing failures came from a keto meal planning request. When asked to create a week's worth of keto-friendly meals, Operator pulled from a random website and suggested just three items—bacon, avocados, and steak. "Here's your week's worth of food," it proclaimed, seemingly unaware that humans can't survive on the same three foods for a week straight. GIVE ME SOME PECANS, OPENAI.
The key insight: you're offloading work but NOT offloading thinking, and that's a critical distinction when evaluating these tools.
Perplexity AI Assistant: fabulous form, functional flop
I was genuinely excited about Perplexity's brand-new AI Assistant. The concept is enticing: a constant voice companion with “deep” app integrations that truly understands your digital life and handles tasks seamlessly across your device. I loved the form factor—finally, the ambient AI assistant we've been promised for years! It’s Siri’s way cooler older sister!
But the reality is essentially a data vacuum with minimal functional return.

Perplexity AI Assistant is voice-only, just talk to the amorphous blob on your screen!
The app immediately asks for sweeping access to your personal information: contacts, calendar, location, messages, and more. Yet despite these extensive permissions, it consistently fails at even basic tasks:
Can't properly map (confused Washington Sq Park with Washington Heights)
Can't book a correct Uber (kept defaulting to home address)
Can't share images
Can't share websites
Can't make reservations (pulled up OpenTable preloaded with search results)
Can't honor existing calendar entries (scheduled events directly over meetings despite having full calendar access)
The imbalance in what I wanted from the thing and what I got from the thing was stark: maximum data access, minimum capability. This asymmetry makes it feel less like a helpful assistant and more like a sophisticated data collection operation with a thin veneer of functionality.
Email drafting in the native mail app was its one bright spot. But this only highlights the absurdity: why does it need access to your entire digital life just to draft emails?!?
Most team members concluded that clicking an app directly remains faster than asking the assistant to open it. As one tester put it: "If I have to open an app, click onto the thing to talk to it, and then ask it to do the thing, I might as well just click to open the app I want to use instead of the middleman."
Perplexity is very clearly trying to replace Google search and Siri - and April had headlines with potential Samsung partnerships. But I need more from this.
Manus AI: the best (but at what cost?)
Manus consistently emerged as the best performer in our task-focused testing—but the price tag is striking when compared to alternatives. At essentially $4 per task ($40-$199 per month for limited credits), it's dramatically more expensive than most other options. In my testing, I burned through 300 credits (a third of my initial allocation) on a single prompt.
Oh, and by the way, it took that prompt 15 minutes to complete.

Using Manus AI to research buying a yak for my family because why the hell not
The capabilities are impressive: excellent browser navigation, handling of pop-ups and ads, and helpful summaries of findings. But it's not a completely hands-free experience. You still need to craft detailed prompts, review the results, often take over manually at certain points, and sometimes redirect it when it gets stuck.
The fundamental question becomes: is Manus better enough to justify the premium? When ChatGPT costs $20/month with unlimited usage and many specialized tools are completely free, it's hard to justify paying 10x more for what amounts to an incremental improvement in capabilities. Yes, Manus performs better at certain tasks, but the value gap doesn't seem to match the price gap. Manus now gives 300 credits per day for free, so you can use it once a day with no fee. Test Manus here.
If you want to get the best sense of AI capabilities, I’d recommend one paid month of Manus to test it. Or you can watch my YouTube demo of Manus AI here.
Specialized Tools with Specific Strengths
Unlike general task assistants, these tools focus on doing one thing particularly well.
Clay AI: surprisingly sophisticated
Some of the most practical AI capabilities we tested emerged in the humble spreadsheet. Clay AI impressed with its sophisticated approach to sales intelligence, offering integrations with premium data services like Apollo, HubSpot, and Pitchbook.
What particularly stood out was Clay's waterfall approach to data enrichment. It first checks cheaper or free data sources before moving to more expensive ones to fill in missing information. This means you're not burning through expensive API calls unnecessarily. The system shows you exactly how many credits each data source costs, so you can easily put the most expensive options last in your waterfall sequence.

Clay AI waterfall in use (on the right) for data enrichment (the table)
For sales operations and data enrichment, this specialized tool provides immediate business value without the frustration of general-purpose agents. I’m not surprised at all with some of the customer testimonials I found on their site (Anthropic used Clay AI and got a 3X improvement in their data enrichment rate and Notion used it to get a 40% increase in their auto-approval rate for their startup program).
Lindy AI: permission-heavy but pretty clean
Lindy AI requires granting a lot of permissions—creating an adoption barrier for many users. However, team members who tested it noted that the templates actually work, plus the interface has a cleaner layout than alternatives.
When compared to similar workflow automation tools, Lindy stood out for its more refined user experience, though the functionality remains focused on connecting various apps and automating repetitive tasks. The question is whether the improved interface justifies the extensive permissions required. And you have to LOVE flowcharts to make this tool work for you. I mean, just look at this screenshot below.

Lead Outreach template from Lindy AI - example flowchart
One tester summed up the permission challenge: "I spun up some dummy accounts because I don't want to give these people access to all of my work email. It's creating a blocker for real-world testing."
Buyer beware: there is definitely some sketchiness with the Lindy blog, where they’re clearly using AI to generate SEO-optimized blog posts just advertising themselves against competitors. May 8th alone had almost 10 blog posts, and I disagree with a ton of the content inside each one (like when they ranked Alexa as a better assistant than ChatGPT). Don’t trust the blogs to give you honest coverage.
Proxy by Convergence: a little Jekyll and Hyde experience
The free version of Proxy performed reasonably well at retrieving research papers from reputable sources—one team member noted it was particularly good at finding peer-reviewed papers on specific topics. If you're just looking for a research assistant, the free version might suffice for basic needs.
However, I have to talk about my bizarre experience with the paid "multi-agent" version. This premium tier received our harshest assessment, labeled a "marketing gimmick" by our team. Their multi-agent visualization shows one agent spawning three mini-agents that eventually recombine results—impressive in theory, but the agents fail non-stop, and the UX is terrible.

Proxy by Convergence interface with the mini agents view on the right
The layout is confusing, the experience is frustrating, and incessant emails about "completed tasks" continue long after you're done. One team member put it bluntly about the paid version: "I think it's a marketing gimmick, borderline scam."
That said, if you're interested in multi-agent orchestration, spending $20 for a month might be worth it purely as an educational tool. You'll get to see how a flow might be set up—but frankly, there are much better ways to spend $20. Like a decent lunch, or literally any other AI product on the market.
Microsoft Copilot AI Agents
There’s a plague hitting all big tech companies at once and is unfortunately, that the word “agent” doesn’t have much meaning anymore. I want my agents to actually take action - like purchasing or sending or scheduling or calling for an appointment.

Microsoft Copilot Agents - both prebuilt and DIY
Microsoft actually has some really thoughtful design around connecting Copilot to your work documents and email, and that’s what gets me excited about these “agents”. For example, other platforms only let you upload static PDFs to pull from (so if a document in your business changes, you have to have manually swap out the old PDF for the new). Microsoft Copilot can connect to living documents so you can constantly layer AI over that set of data.
Microsoft has been a bit of a laggard behind the top 3 AI labs (OpenAI, Google, Anthropic), so I think it will be a bit before we see true AI agents inside of Copilot. For now, the agents, even the pre-built sales agent we tested out, feel like very high-quality GPTs.
Research & Information Tools
These agents specialize in gathering, analyzing, and synthesizing information.
ChatGPT Deep Research: ChatGPT's edge
For research-intensive tasks, ChatGPT's Deep Research emerged as a favorite in our comparative testing, and we also liked it over Perplexity Deep Research. One team member demonstrated ChatGPT Deep Research capabilities by researching energy market trends related to the Ukraine-Russia conflict, even showing how it could deliver the entire analysis in Italian on request.
If you’re researching something in your own field, absolutely try it out immediately. And if you’re researching a brand-new topic you know nothing about, prepare to spend a lot of time validating.
I tried it on something I’m a master on and something you’d hope ChatGPT would be a master on: OpenAI’s history of launches. On a quick skim, at least two launches mentioned were off by a YEAR. Oof.
While impressive, these research tools still have limitations - they require patience as you wait for results, follow-up requests struggle to refine initial answers, chronological order is jumbled, and spatial reasoning fails.
You.com ARI: amazing demo, stubborn on follow-ups
ChatGPT Deep Research still has pretty strict rate limits, so I've been frequently using You.com ARI (Advanced Research & Insights). The design of this tool is my favorite by far, mostly because of the PDF report that it provides as a download when it’s done. It looks like a report from McKinsey or something!
However, the ability to control the direction of the research was a bit problematic, so it’s not getting an A. When it kept pulling from questionable sources like "allaboutai.com" (a site I'd never heard of with unverifiable stats), I explicitly asked it to only give me highly credible sites, and provided a specific list of sources. Instead of improving, it went in the opposite direction - giving me nothing cited at all. Even with direct follow-up requests for better sourcing, the system couldn't or wouldn't deliver.

8-page real estate report generated by You.com in just a few minutes (table of contents)
Even the most basic research standards expect reliable citations and verifiable sources. You.com made it harder to hit that bar. We know to verify information from all tools (whether that’s Claude without browsing turned on, Perplexity with browsing, ChatGPT Deep Research, or You.com ARI), and there is no current horizontal AI research tool that is so accurate that it doesn’t require validation.
Key Takeaways From Our Testing
After our extensive testing, of these tools and others that have yet to be released, several patterns emerged about the current state of AI agents:
Satisficing succeeds where maximizing fails
Current agents excel at "satisficing" (finding good-enough solutions) but struggle with "maximizing" (finding optimal solutions). Operator can quickly find a restaurant near your concert venue but fails when asked to create the perfect keto meal plan. You.com will pull research that probably leads you down the right path, but fails when you need hyperspecific stats from targeted sources. This pattern held true across many of these task-focused agents—they're momentum-builders that help you overcome decision paralysis and take immediate action, but rarely deliver the nuanced, high-quality results you'd get from dedicated human effort. Said another way: if you need THE answer, and not just AN answer, use AI as a cheap second opinion, but not as a replacement.
Specialized tools deliver specific value
The most immediately useful agents aren't trying to do everything. Clay AI's focused approach to sales intelligence and Lindy's mini workflow templates deliver targeted value precisely because they don't overreach. However, they still require significant permission barriers that create adoption challenges for enterprises. Look for social proof (trusted recommendations, logos on site, tier one investors, solid fundraising, terms and conditions, free trial testing, YouTube demos and reviews) before over-exposing your data.
Interface appeal ≠ actual capability
Across all categories, the most visually impressive and intuitively appealing interfaces (Perplexity, Proxy's multi-agent visualization) didn’t always mean the most performant. Design can even mask weak capabilities. Less flashy tools like Clay AI can deliver practical value with minimal fuss.
Price-to-value ratios vary dramatically
The cost-effectiveness spectrum is extreme. Manus charges an effective $4 per task for superior performance. ChatGPT offers Operator in Pro ($200/mo) and Plus ($20/month). Clay AI provides specialized value for GTM teams. Proxy's paid tier charges premium prices for flashy visualizations with minimal practical benefit. The economic equation differs widely across tools and use cases. But AI-forward SMBs should continue to earmark $2000-$3000 per head per year for AI (including tools, compute, upskilling). Enterprises might be even more.
Full autonomy remains a mirage (for now)
Whether it's Operator suggesting three foods for a week-long meal plan (again, where art thou, pecans?), Perplexity scheduling over existing meetings or citing non-existent sources, AI agents today consistently require human oversight, correction, and verification. The promise of "set it and forget it" automation simply isn't here yet. Today's agents are assistants, not replacements…and often not very competent ones.
AI agents today consistently require human oversight, correction, and verification.
What This Means For You
So where does all this leave us? Here are five practical takeaways to guide your AI agent strategy:
Be realistic about delegation boundaries. Use agents for "good enough" tasks where you just need a solution quickly. For tasks where quality and nuance matter, either handle them yourself or accept that you'll need to provide significant oversight to any AI agent.
Start with some specialized tools. Rather than general agents trying to do everything, domain-specific tools like Clay AI and industry-specific solutions deliver immediate value by focusing on what they do best. Ex: use Clay for sales intelligence, Deep Research for research tasks, Operator for quick decisions.
Beware the data access trap. When an agent asks for sweeping permissions, ask yourself if the demonstrated value justifies the privacy trade-off. Consider testing with dummy accounts first, as our team did with Lindy, to evaluate performance before connecting your real data.
Calculate your personal task economics. Before subscribing to premium agents, track how many delegable tasks you perform weekly and run the numbers. Is Manus at $4/task worth it for your specific use case? Is it worth it to get a glimpse into the future? The answer depends entirely on your time value and what alternative solutions cost you.
Focus on complementary capabilities. The most effective agent strategy pairs AI capabilities with human expertise. Seek tools that enhance your work process rather than promising to replace it entirely—like Clay's intelligent data gathering or ChatGPT's ability to translate complex concepts or You.com researching a topic you know well.
AI Agents and the Road Ahead
Before I sign off, it's worth noting that this testing captures AI agents at a specific moment in time (aka RIGHT NOW, people!), and it’s a snapshot that may soon look as outdated as your first smartphone. Mine was a Blackberry by the way, shoutout BBM.
Despite today’s face-palm failures, I’m aligned with Sam Altman’s bold declaration that “2025 is when agents will work.” The awkward teenage phase never lasts forever, and these agents are sure to grow up fast.
My prediction is that by the end of this year, we'll see at least one genuinely capable horizontal AI agent that doesn't require a computer science degree to operate. I fully expect the evolution to continue past agents that think a bacon-avocado-steak trifecta constitutes a complete meal plan to ones that actually understand keto nutrition action plans and won't confuse Washington Square Park with Washington Heights. THE FUTURE WILL HAVE PECANS!
We also know more is coming, like the autonomous browser from Google called Project Mariner as well as Agentspace from Google, which really does look like those Microsoft Copilot Agents I shared earlier.

Google Agent marketplace - screenshot from Agentspace teaser video
An AI agent requires goals, an agent core, memory, reasoning and planning, and tools. So it stands to reason that the player who handles everything on that list the best (at the best performance, with the most intuitive design, and at the right speed) will win. But hey - let’s check in next week, maybe the world will look completely different.
Helpful stats on AI agents to share with your friends
» The global AI agents market is projected to grow from $5.1 billion in 2024 to $47.1 billion by 2030, with a CAGR of 44.8% (source)
» 82% of companies plan to integrate AI agents in the next 1-3 years (source)
» 90% of users prefer to receive assistance from a human rather than a chatbot, despite AI offering better availability, faster answers, and more information (source)
» AI agents can increase task completion speed by 126% for programmers (source)
» 39% of consumers are comfortable with AI agents scheduling appointments while only 24% are comfortable with AI agents shopping for them (source)
Stay curious, stay informed,
Allie
P.S. Are you testing AI agents? In your personal life or at work? Which AI agent have you found most useful? Reply to this email and let me know. I'd love to hear about your experiences, especially if they contradict our findings.
🧠 I'm working on a new AI course, and I need your help. What best describes where you are in your AI journey? |
|
TAKE THIS QUIZ TO SEE IF YOU’RE AI-FIRST

If you’re reading this newsletter, you’re probably not behind—but the clock is ticking.
Thousands of people ask me the same thing: “Where do I start with AI?”
Now I have a fun answer: Take the 3-minute AI-First Quiz. You’ll walk away with:
Your AI Archetype (10 unique types based on real usage data)
Your personal edge (or blind spot)
A tailored game plan to level up
The exact skills, tools, and mindset you need next
It’s designed to help you stop dabbling and start competing in the AI age with clarity.
Feedback is a Gift
I would love to know what you thought of this newsletter and any feedback you have for me. Do you have a favorite part? Wish I would change something? Felt confused? Please reply and share your thoughts or just take the poll below so I can continue to improve and deliver value for you all.
What did you think of this email course? |
Learn more AI through Allie’s socials