Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information

Meta today announced an association with Systems brains to feed their new flame API, offering developers access to inference speeds up to 18 times faster than traditional GPU -based solutions.

The announcement, made at the inaugural Conference of MetaCon developers in Menlo Park, positions the company to compete directly with OpenAi, Anthropic and Google in the fast -growing insert service market market, where developers are the tokens of purplication.

“Meta has selected brains to collaborate to offer the ultra -grape inference they need to serve developers through their new flame API,” said Julie Shin Choi, director of brain marketing, lasts briefly. “In brains we are very, very excited to announce our first Hyperscaler CSP association to offer ultra fast inference to all developers.”

The association marks the formal finish line in the sale business of AI Computation, transforming its popular open source flame models into a commercial service. While the goal flame models have accumulated around one billion downloads, so far the company had not offered a first -part cloud fruitlessness for developers to create applications with them.

“This is very exciting, just without talking about brains specifically,” said James Wang, a brain senior executive. “OpenAi, Anthrope, Google: they are a new completely new business from scratch, which is the inference business of AI. Developers who are building applications of AI will buy tokens for millions, sometimes for billions of applications.”

A reference table shows Processing brains calling 4 to 2,648 tokens per second, drastically surpassing the Sambanova competitors (747), Groq (600) and Google GPU -based services and others that explain the target hardware for him. (Credit: Brains)

Breaking The Speed Barrier: How Superchard Brain Clacks Models

What distinguishes the goal sacrifices is the dramatic increase in the providing speed of the specialized brains. The brain system offers more than 2,600 tokens per second for flame 4 scout, compared to approximately 130 tokens per second for chatgpt and around 25 tokens per second for Deepseek, according to the reference points of the artificial analysis.

“If only API, Gemini and GPT are compared at the API base, they are all excellent models, but all run at GPU speeds, which are approximately 100 tokens per second,” Wang explained. “And 100 tokens per second are fine for the chat, but it is very slow to refer. It is very slow for the agents. And people are fighting with that today.”

This speed advantage enables the completely new categories of applications that were previously shocking, including real -time agents, low -conversational tartle systems, the generation of interactive codes and instant reasoning of multiple steps, everything that Thalle’s chain was completed in seconds instead of minutes.

The flame API reacts a significant change in the Meta’s strategy, in the transition of being a models provider to become a full -service infrastructure company. By offering an API service, goal is creating a flow of income from its investments as AI while assembling its commitment to open models.

“Meta is now in the business of selling tokens, and it is excellent for the American son of the AI ecosystem,” Wang said the press conference. “They bring a lot to the table.”

The API will offer tools for adjustment and evaluation, starting with the call 3.3 8b, allowing developers to generate data, train and test the quality of their personalized models. Meta emphasizes that it won customer data to train their own models, and construction models using the flame API can be transferred to other hosts, a clear differentiation of the most closed approval of some competitors.

The brains will feed the new finish line through its network of data centers located through North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal and California.

“All our data centers that are found in North America at this time,” Choi explained. “We will serve a goal with all the capacity of the brains. The workload will be balanced in all these different data centers.”

The commercial package follows what Choi described as “the classic computing provider for a hyperscaler model”, similar to the way Nvidia provides hardware to the main cloud suppliers. “They are reserve blocks of our calculation so that they can serve their population of developers,” he said.

Beyond the brains, Meta has also announced an association with Groq to provide rapid inference options, providing multiple high -performance alternatives beyond the traditional GPU -based inference.

Meta entry into the inference API market with higher performance metrics could interrupt the established order dominated by Operai, Google and Anthrope. By combining the popularity of its open source models with dramatically faster inference capabilities, a goal is being positioned as a formidable competition in the AI commercial space.

“Meta is in a unique position with 3 billion users, hyper scale data centers and a large developer ecosystem,” according to brains “presentation materials. The integration of brain technology” helps Meta Leapfrog OpenAi and Google in performance in approximately 20 times. “

For brains, this association represents an important milestone and the validation of its specialized Hardware of AI approach. “We have the bones building this engine at the scale of wafers for years, and we always knew that the first technology rate, but ultimately it has to end as part of the hyperscale cloud of another person. That was the final strategy of Ross final, and we are Wala, and we are Wala, and we are Wally, and we are final, and we are the perspective of the milestones.

The flame API is currently notable as a limited previous view, with a goal, plan a broader deployment in the coming weeks and months. Developers interested in accessing Ultra-Fast inference call 4 can request early access by selecting brains from the model options inside the flame API.

“If you imagine a developer who does not know anything about brains because we are a relatively small company, they can only click on two buttons on the standard SDK of finish We are at the back of the goal developer ecosystem, it is tremendous for us. “

The choice of specialized silicon goal indicates something deep: in the next phase of the AI, it is not only what its models know, but how quickly they can think about it. In that future, speed is not just a characteristic, it is all the point.

Daily insights on commercial use cases with VB daily

If you want to impress your boss, for example, he has covered you daily. We give him the scoop on what the classmates are doing with the generative AI, from regulatory changes to practical implementations, so he can share ideas for the maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Look more VB bulletins here.

An error occurred.

Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Breaking The Speed Barrier: How Superchard Brain Clacks Models

Beyond A2A and MCP: How LOKA’s Universal Agent Identity Layer changes the game

Gallium Studios finds Will Wright’s memories game, Proxi, as hard to pitch as The Sims

Tripp launches Kōkua AI as mental wellness coach across multiple platforms

Ex-OpenAI CEO and power users sound alarm over AI sycophancy and flattery of users

Iwot Studios launches game team for Wheel of Time RPG

Mawari’s DIO network will offer AI-driven immersive 3D experiences

India

Business

Lifestyle

Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Breaking The Speed ​​Barrier: How Superchard Brain Clacks Models

Keep Reading

Breaking The Speed Barrier: How Superchard Brain Clacks Models