Is your AI product actually working? How to develop the right metric system

Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information

In my first automatic learning products administrator (ML), a simple question inspired passionate debates between functions and leaders: how do we know if this product is working real? The product in question that I managed attended internal and external clients. The model allowed internal teams to identify the main problems that our clients face so that they could establish a set of experiences to solve customer problems. With such a complex interdependencies network between internal and external customers, choosing the correct metrics to capture the impact of the product was fundamental to direct it towards success.

Do not track white, your product works well is how to land a plane without any instruction of air traffic control. There is absolutely no way to make informed decisions for your client without knowing what is going well or badly. In addition, if you do not activate the metrics, your team will identify your own backup metrics. The risk of having multiple flavors of a “precision” or “quality” metric is that everyone will develop their own version, which leads to a scenario in which not everyone is working the same exit.

For example, when I reviewed my annual objective and the underlying metric with our engineering team, immediate feedback was: “But this is a commercial metric, we already trace the accuracy and remember.”

First, identify what you want to know about your AI product

Once you get the task of defining the metrics for your product, where to start? In my experience, the complexity of operating a ML product with multiple customers translates into defining metrics for the model as well. What use to measure that a model works well? Measuring the result of internal equipment to prioritize launches based on our models would not be quick enough; Measure if the client adopted solutions recommended by our model could risk drawing conclusions from a very wide adoption metric (what would happen if the client did not adopt the solution because he just wanted to reach a support agent?).

Fast advance to the era of large language models (LLM), where we not only have a single ML model, we also have text answers, images and music as exits. Product dimensions that require metrics now increase rapidly: formats, customers, type … the list continues.

In all my products, when I try to find metrics, my first step is to distill what I want to know about its impact on customers on some key questions. Identifying the correct set of questions makes it easier to identify the correct set of metrics. Here are some examples:

Did the client get a way out? → Metric for coverage
How long did it take in the product to provide an output? → Metric for latency
Did the user like the exit? → Metrics for customer comments, adoption and retention of customers

Once you identify your key questions, the next step is to identify a set of sub -shoes for ‘entry’ and ‘output’ signals. The output metrics are lagged indicators where you can measure an event that has already happened. Input metrics and main indicators can be used to identify trends or predict the results. See below the ways of adding the correct subset to lag and lead indicators to the previous questions. Not all questions should have leadership/placement indicators.

Did the client get a way out? → Coverage
How long did it take in the product to provide an output? → latency
Did the user like the exit? → Client comments, customer adoption and retention
1. Did the user indicate that the output is correct/incorrect? (production)
2. Was the exit good/fair? (Input)

The third and last step is to identify the method to collect metrics. Most metrics are collected at the scale of a new instrumentation through data engineering. However, in some cases (such as question 3), especially for ML -based products, it has the option of manual or automated evaluations that evaluate the results of the model. While it is always better to develop automated evaluations, start with manual evaluations for “the output was good/fair” and create a rubric for the definitions of good, fair and not good will help to lay the foundations for a rigorous and proven.

Example of use cases: search for listed descriptions

The previous frame can be applied to any ML -based product to identify the list of primary metrics for your product. Let’s take the search as an example.

Ask	Metrics	Metric nature
Did the client get a way out? → Coverage	% Search sessions with search results show the client	Production
How long did it take in the product to provide an output? → latency	Time tasks to show user search results	Production
Did the user like the exit? → Client comments, customer adoption and retention Did the user indicate that the output is correct/incorrect? (Departure) Was the output good/fair? (Input)	% of the search sessions with ‘thumb up’ comments on customer search results or % of customer click sessions % of the search results marked as ‘good/righteous’ for each search term, by quality rubric	Production Input

How about a product to generate descriptions for a list (either an element of menu in doordash or a list of products on Amazon)?

Ask	Metrics	Metric nature
Did the client get a way out? → Coverage	% Lists with description generated	Production
How long did it take in the product to provide an output? → latency	Time tasks to generate user descriptions	Production
Did the user like the exit? → Client comments, customer adoption and retention Did the user indicate that the output is correct/incorrect? (Departure) Was the output good/fair? (Input)	% of lists with descriptions generated that required editions of the Technical/Seller/Client Team % of the listed descriptions marked as ‘good/fair’, by quality rubric	Production Input

The approach described above is extensible for multiple ml -based products. I hope this framework helps you define the correct set of metrics for your ML model.

Sharanya Rao is group product manager in Intuit.

Daily insights on commercial use cases with VB daily

If you want to impress your boss, for example, he has covered you daily. We give him the scoop on what the classmates are doing with the generative AI, from regulatory changes to practical implementations, so he can share ideas for the maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Look more VB bulletins here.

An error occurred.

Is your AI product actually working? How to develop the right metric system

First, identify what you want to know about your AI product

Example of use cases: search for listed descriptions

Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance

Your AI models are failing in production—Here’s how to fix model selection

How to watch the Green Games Showcase at Summer Game Fest 2025

Agentic AI defeated DanaBot, exposing key lessons for SOC teams

After GPT-4o backlash, researchers benchmark models on moral endorsement—Find sycophancy persists across the board

Inside Google’s AI leap: Gemini 2.5 thinks deeper, speaks smarter and codes faster

India

Business

Lifestyle

Is your AI product actually working? How to develop the right metric system

First, identify what you want to know about your AI product

Example of use cases: search for listed descriptions

Keep Reading