Q: IS THE PREDICTIVE TOOLKIT A NEW ADDITIONAL ADD-ON OR A NEW INCLUDED TOOL?
A: The Big Squid Predictive Tool Kit is a stand-alone app found in the Domo Appstore. It has been available for several months now. To learn more, visit the Domo Appstore and search for it.
Q: HOW ARE THE THREE PREDICTIVE MODELS IN "MODEL BUILDER" SELECTED? ARE THOSE ALWAYS THE SAME OR DO THE TYPE OF MODELS ADAPT ACCORDING TO THE DATA ANALYZED?
A: Depending on the business problem and therefore the statistical paradigm, a variety of algorithms are applied including traditional regression techniques, as well as decision tree and other algorithm families. The algorithms we use are our "best in class" and those that pertain to the paradigm (i.e. binary classification, regression, time series, etc).
Q: CAN YOU PROGRAM YOUR OWN MODEL OR MODIFY THE OPTIONS PROVIDED IN MODEL BUILDER?
A: Currently the ability to program your own model isn’t available in the Predictive Toolkit, but what you can do is modify datasets coming in from Domo with additional attributes or detail and therefore modify potentially the approach in building out that model.
We envision some future capabilities like these to be enabled through a software dev kit (SDK) where you could deploy your own code and run your own model in conjunction with the all the other built-in models that we’ve got preloaded in the Predictive Toolkit.
Q: HOW ARE THE DATA MODELS VALIDATED? DO YOU USE AUC(AREA UNDER CURVE) OR ANOTHER VALIDATION METHOD(S)?
A: We validate via several “goodness of fit” measures, depending on the type of statistical paradigm the dataset necessitates. For instance, a binary classification problem will validate based on goodness of fit in training and test portions of the dataset, how well training and test portions match, and parsimony. We encourage you to download the app from the Domo Appstore, run the demo, and click on ‘SciMode’ to see an example of this.
Q: HOW DO YOU VERIFY THE ACCURACY OF PREDICTION?
A: The way that we validate the accuracy of the model is for each of those algorithms we break down the dataset into a training portion and testing portion. The training portion is where the algorithm learns the patterns in the data and then we test what is learned into that test set of data. In doing so we can address the level of accuracy on a number of measures in both the training and test portion of that dataset.
Those are the underlying statistics for each algorithm you see and the level of accuracy for that algorithm on each set of the training and test portion of the exercise.
Q: HOW ARE THE TRAINING DATA SELECTED FOR BUILDING OUT THE STATISTICAL MODELS?
A: The Predictive Toolkit partitions the dataset randomly during Test & Training, allowing the Machine Learning (ML) algorithms to learn patterns in the data in the training set and then apply/test those findings in the testing phase.
Q: HOW MANY ROWS OF DATA ARE CONSIDERED STATISTICALLY SIGNIFICANT WITH THE PREDICTIVE TOOLKIT?
A: The answer there is it depends. It depends on a couple of things; we can build models with a relatively few rows of data as long as there are relevant and rich attributes that go along with it.
Conversely, if the attribute set isn’t as rich then typically we need more observations or more rows to compensate for that. So unfortunately to answer the question, it depends a bit on the complexity of the data and ultimately the richness of the attribute set.
Q: DOES PREDICTIVE TOOLKIT USE MULTI-REGRESSION ANALYSIS?
A: Depending on the business problem and therefore the statistical paradigm, a variety of algorithms are applied including traditional regression techniques, as well as decision tree and other algorithm families.
Q: HOW MANY DATASETS OR SOURCES/COLUMNS ARE NEEDED FOR THE PREDICTIVE TOOLKIT TO WORK/WORK OPTIMALLY?
A: It really depends. We have customers using essentially a single data source, and therefore a single dataset. We have other customers that have used multiple datasets using the Domo ETL platform to transform and join multiple data sources together, one use case I can think of is bringing your sales and marketing data together to bring a more holistic view to the sales and marketing process.
In regards to number of columns/rows, again, this depends on how rich the attributes are, meaning how rich the variables are that you think influence the predicted outcome. If that’s fairly rich and relevant, then few rows are needed.
Q: HOW DO YOU PROTECT AGAINST OVERFITTING?
A: Overfitting, or creating overly complex statistical models that have poor predictive performance, is a common situation that people using algorithms can run into if they’re not careful. Big Squid protects against overfitting through our Scoring process in the PTK. This includes the way that we score the algorithms themselves, reviewing the level of parsimony (number of parameters used to explain the data at a given level of accuracy), accuracy in both the training and test portions of the model, and some things to compensate for what we perceive to be overfitting from a statistical standpoint.
If a model appears to be overfit we will score lower as a result of that. But implicitly what we do as part of the onboarding a customer is go through that process, build out that model and review findings to make sure that we’re not overfitting.
The algorithms we use are awfully powerful and that can be a big risk, so by adjusting for it in the scoring of the models and then looking at it from an empirical basis, we can see if that model is overfitting and then reduce the number of attributes that we’re bringing into the model.
Q: HOW DO YOU PREPARE THE DATA TO SUPPORT AT DIFFERENT AGGREGATION LEVELS?
A: We approach this by building a DataFlow or dataset so that it has embedded aggregation in the dataset. Take the sales forecast example where we’re trying to build a forecast for the store and department level with multiple stores with multiple departments in each store. We can build a dataset where the store and departments are attributes and therefore we’re going to get a forecast for the store and department level.
When that result dataset is pushed back into Domo as a new dataset, you can then aggregate it back up to the corporate level or any other level that you want. So that’s one way - configure the dataset in the right way so you have the right level of aggregation in the dataset needed to power up any sort of drill-through and/or aggregation.
The second level is you can create different dataflows for different aggregation level and run them separately as different models through the toolkit. So in this example we’d have a corporate, regional, and store level model.
Q: WHAT LEVEL OF SUPPORT AND TRAINING IS OFFERED TO HELP MAKE THE BEST USE OF THE PREDICTIVE TOOLKIT?
A: When we onboard customers, we don’t simply drop the App in their Domo instance and expect them to take it and move forward with it. Big Squid will work closely with the customer to build out the first model, in typically a 30-day roll out where we go through data prep, best practices regarding data preparation, construction of the dataset, algorithm selection, scoring and addressing things like overfitting.
We do that with the customer collaboratively so at the end of that 30-day rollout the app is essentially self-service from that point so you can go on and build you own models, refine models we built together but do it in a way that you know and appreciate at least our best practices and methodology. Whether you use them or adopt parts of them that’s totally up to the customer but we do want to have that level of engagement as we roll the Predictive Toolkit out to customer.