This file contains the results of the benchmark we ran on June 1st 2017 to compare natural language understanding services offering custom solutions (Wit, Luis, Api, and Snips) for seven intents. This benchmark and its results are described in this paper and this blog post.
Any publication based on these datasets must include a full citation to the following paper in which the results were published by the Snips Team1:
accepted for a spotlight presentation at the Privacy in Machine Learning and Artificial Intelligence workshop colocated with ICML 2018.
We focused on seven intents
:
- SearchCreativeWork (e.g. Find me the I, Robot television show),
- GetWeather (e.g. Is it windy in Boston, MA right now?),
- BookRestaurant (e.g. I want to book a highly rated restaurant for me and my boyfriend tomorrow night),
- PlayMusic (e.g. Play the last track from Beyoncé off Spotify),
- AddToPlaylist (e.g. Add Diamonds to my roadtrip playlist)
- RateBook (e.g. Give 6 stars to Of Mice and Men)
- SearchScreeningEvent (e.g. Check the showtimes for Wonder Woman in Paris)
More than 2000 queries have been generated for each intent with crowdsourcing methods. 10, 20, 50, 70, and around 2000 (depending on the intent) queries have been randomly selected several times to train the competitors. The training sets up to 70 queries are drawn among the train_intent.json
files, and the ~2000 are provided by the train_intent_full.json
files.
We focused here on slot filling (rather than intent classification), hence the benchmark was performed separately for each intent.
The results of the benchmark are displayed on the different metrics
files, one per each intent and competitor. More specifically, the precision
and recall
are displayed for each slot
.
Each value is displayed 3 times, corresponding to the number of random samples selected for each training size.
In general, we gave each provider the benefit of the doubt. If anything, their performance is overestimated.
The following built-in entities were used:
@sys.date-time
, @sys.geo-city-us
, @sys.geo-country
, @sys.geo-state-us
, @sys.music-artist
, @sys.music-genre
.
The model cannot really be trained programmatically: each intent still needs to be manually saved on the web console (or through a curl command with an updated cookie). The parsing giving back formatted values for the date-time and location built-in entities without span indices, we counted a prediction as correct if an entity of the same type was present in the original query (without checking for an absolute match).
The following built-in entities were used:
wit$datetime
.
The model cannot really be trained programmatically: each entity needs to be manually set to 'free text' on the web console (or through a curl command with an updated cookie). The parsing giving back formatted values for the datetime built-in entity without span indices, we counted a prediction as correct if a datetime entity was also present in the original query (without checking for an absolute match).
The following built-in entities were used:
datetimeV2
.
The model can be trained fully programmatically through the API reference.
There is not API reference available, we had to proceed manually.
The following built-in entities were used:
AMAZON.TIME
, AMAZON.DATE
, AMAZON.US_CITY
, AMAZON.Country
, AMAZON.US_STATE
, AMAZON.MusicGroup
, AMAZON.Genre
, AMAZON.MusicRecording
, AMAZON.MusicRecording
, AMAZON.MusicAlbum
, AMAZON.MusicCreativeWorkType
, AMAZON.MusicPlaylist
, AMAZON.Book
, AMAZON.CreativeWorkType
, AMAZON.MovieTheater
, AMAZON.Movie
.
The parsing giving back formatted values for the datetime built-in entity without span indices, we counted a prediction as correct if a datetime entity was also present in the original query (without checking for an absolute match). We also took into account that this provider maps digits to letters.
The following built-in entities were used:
snips/datetime
.
This benchmark has been run by Snips, which is one of the solutions benchmarked. To avoid biases in responses, this benchmark has been run in a blindfolded way, by a team independent from the data scientists working on the problem. The raw data is provided here to guarantee transparency.
1 The Snips team has joined Sonos in November 2019. These open datasets remain available and their access is now managed by the Sonos Voice Experience Team. Please email sve-research@sonos.com with any question.