Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information
In my first automatic learning products administrator (ML), a simple question inspired passionate debates between functions and leaders: how do we know if this product is working real? The product in question that I managed attended internal and external clients. The model allowed internal teams to identify the main problems that our clients face so that they could establish a set of experiences to solve customer problems. With such a complex interdependencies network between internal and external customers, choosing the correct metrics to capture the impact of the product was fundamental to direct it towards success.
Do not track white, your product works well is how to land a plane without any instruction of air traffic control. There is absolutely no way to make informed decisions for your client without knowing what is going well or badly. In addition, if you do not activate the metrics, your team will identify your own backup metrics. The risk of having multiple flavors of a “precision” or “quality” metric is that everyone will develop their own version, which leads to a scenario in which not everyone is working the same exit.
For example, when I reviewed my annual objective and the underlying metric with our engineering team, immediate feedback was: “But this is a commercial metric, we already trace the accuracy and remember.”
First, identify what you want to know about your AI product
Once you get the task of defining the metrics for your product, where to start? In my experience, the complexity of operating a ML product with multiple customers translates into defining metrics for the model as well. What use to measure that a model works well? Measuring the result of internal equipment to prioritize launches based on our models would not be quick enough; Measure if the client adopted solutions recommended by our model could risk drawing conclusions from a very wide adoption metric (what would happen if the client did not adopt the solution because he just wanted to reach a support agent?).
Fast advance to the era of large language models (LLM), where we not only have a single ML model, we also have text answers, images and music as exits. Product dimensions that require metrics now increase rapidly: formats, customers, type … the list continues.
In all my products, when I try to find metrics, my first step is to distill what I want to know about its impact on customers on some key questions. Identifying the correct set of questions makes it easier to identify the correct set of metrics. Here are some examples:
- Did the client get a way out? → Metric for coverage
- How long did it take in the product to provide an output? → Metric for latency
- Did the user like the exit? → Metrics for customer comments, adoption and retention of customers
Once you identify your key questions, the next step is to identify a set of sub -shoes for ‘entry’ and ‘output’ signals. The output metrics are lagged indicators where you can measure an event that has already happened. Input metrics and main indicators can be used to identify trends or predict the results. See below the ways of adding the correct subset to lag and lead indicators to the previous questions. Not all questions should have leadership/placement indicators.
- Did the client get a way out? → Coverage
- How long did it take in the product to provide an output? → latency
- Did the user like the exit? → Client comments, customer adoption and retention
- Did the user indicate that the output is correct/incorrect? (production)
- Was the exit good/fair? (Input)
The third and last step is to identify the method to collect metrics. Most metrics are collected at the scale of a new instrumentation through data engineering. However, in some cases (such as question 3), especially for ML -based products, it has the option of manual or automated evaluations that evaluate the results of the model. While it is always better to develop automated evaluations, start with manual evaluations for “the output was good/fair” and create a rubric for the definitions of good, fair and not good will help to lay the foundations for a rigorous and proven.
Example of use cases: search for listed descriptions
The previous frame can be applied to any ML -based product to identify the list of primary metrics for your product. Let’s take the search as an example.
Ask | Metrics | Metric nature |
---|---|---|
Did the client get a way out? → Coverage | % Search sessions with search results show the client | Production |
How long did it take in the product to provide an output? → latency | Time tasks to show user search results | Production |
Did the user like the exit? → Client comments, customer adoption and retention Did the user indicate that the output is correct/incorrect? (Departure) Was the output good/fair? (Input) | % of the search sessions with ‘thumb up’ comments on customer search results or % of customer click sessions % of the search results marked as ‘good/righteous’ for each search term, by quality rubric | Production Input |
How about a product to generate descriptions for a list (either an element of menu in doordash or a list of products on Amazon)?
Ask | Metrics | Metric nature |
---|---|---|
Did the client get a way out? → Coverage | % Lists with description generated | Production |
How long did it take in the product to provide an output? → latency | Time tasks to generate user descriptions | Production |
Did the user like the exit? → Client comments, customer adoption and retention Did the user indicate that the output is correct/incorrect? (Departure) Was the output good/fair? (Input) | % of lists with descriptions generated that required editions of the Technical/Seller/Client Team % of the listed descriptions marked as ‘good/fair’, by quality rubric | Production Input |
The approach described above is extensible for multiple ml -based products. I hope this framework helps you define the correct set of metrics for your ML model.
Sharanya Rao is group product manager in Intuit.