Interview With Denis Vorotyntsev, Winner of the AutoML on Time Series Regression AutoSeries Challenge
SY: Congratulations on placing first in the AutoSeries competition. Could you tell us what is auto ML or the intent behind it, and why it could be of interest to those looking into machine learning.
DV: Thank you first of all, this is actually a hard question because in the data science community we have a lot of different definitions of machine learning, data science, and especially automated machine.
For example, could we assume that deep learning is a part of machine learning or could we say that if I trained a simple linear regression that I did some sort of data science or machine learning? It's really hard. To understand what this automated machine is we need first of all to understand what is machine learning in general.
For me machine learning is actually a way of solving different tasks with the help of data. When we don’t know the underlying laws of the process, but we have a lot of data we could apply machine learning to find such laws and make predictions for new unseen data.
This machine learning is actually a way of automating our decisions. I will give an example, in the past it was common to make a credit scoring prediction by using human power. Basically you apply for the loan and then some guy will go through your application and make a decision to give or not to give you a loan. Today, you couldn't find a bank that doesn’t do it automatically without the use machine learning. This is one example how we could automate this process, we have a rich history of loans, some of them were great some of them not, and then we train some model to perform this decision automatically.
The same goes for movie recommendations. For example, we could ask our friends or maybe some experts in movies, tell them that I watched this movie and this one and I liked them, could they recommend something for me. But if we're talking about websites such as Netflix, of course it's not feasible to perform such an operation on the large user base. Here we could apply automated machine learning techniques to optimize this process.
On the general level, machine learning is a process of automating our decisions. Automated machine learning is the process of automating decisions about machine learning, which is the next level of automation. We may see automated machine learning as some box in which we give our data, we give what we want to predict, we give some metrics by which we evaluate our results, and we expect this box to produce some sort of model that is capable of making decisions for new unseen data. We don't need to have any knowledge of data science, we don't need to understand different models, how to train them, how to tune hyperparameters and all this stuff which come from data science experts.
SY: So auto ML essentially automates the process around being able to train a machine learning model.
DV: auto ML means that we want to optimize every step of our data science project, or as many steps as possible. It’s actually a way of finding such parameters of this pipeline, which maximizes our score. For example, you may select different models, you might train these different models with different features, you might select some of these features to be in the final model or throw some out. To do it yourself you need some experience with data science. Automated machine learning automates this process. You will get a pretty good model without any knowledge of data science or machine learning stuff.
SY: In your experience, these tools and techniques, how do they compare to a data scientist? Are they at the level of having an expert guide along the pipeline and the process or are they still at an early stage and need more development.
DV: It depends. For example, in some areas of deep learning, we’ve had great success in finding such architectures that outperforms human level. This is actually the very cool idea of auto ML. The whole process of machine learning could be expressed in some hyperparameters.
Let's say you might select one model or another model. These models which you select are a kind of hyperparameter. You might optimize it or maybe some hyperparameters of the model. Let's say you use a neural network. It has a number of layers which you could use in your neural network, and so on and so forth, and if we go step by step in this data science project from selecting the task, selecting the data to use, selecting the process for feature engineering, you will increase dramatically the number of hyperparameters to optimize.
People don't understand how to work in such high dimensional space. It's basically like chess, you have too many moves to make in chess. The AI solution outperforms humans in this task and the same goes in optimizing hyperparameters in automated machine learning. Basically when we are talking about the optimization of neural network architectures, I think very few people understand how our decision on the number of neurons or number of layers in a deep neural network are actually affecting the final score. This automated search can outperform humans in this task.
SY: Cool. The idea behind auto ML sounds really good. Let's say there's a small team of data scientists or maybe even people without any machine learning experience, but they're interested in machine learning. What are some of your concerns or pitfalls when applying auto ML?
Could a small team of data scientists or non machine learning experts, could they just hand some data off to an auto ML package and then let it loose? What can they do to avoid some of the concerns that you have?
DV: Actually the current state of auto ML, that we have several open source packages, you could just install them and use them right away and this is pretty cool. Also we have some companies which are specialized in automated machine learning, you could have them to help you do that. But the problem is that when you're working on a really important project, when you need to have high scores for your model, when you need to understand what it's going on, you don't want to have a boxed solution. You want to do the solution yourself, maybe perform some advanced operation. I don’t think auto ML could do that right now.
I'll give you an example. Suppose I have an online store. In this online store I sell paper and all related materials. If I don't have money, I don't have time for hiring a data scientist, I just want to have some recommendation engine for people who come to my website, they could look at some products and get recommendations for what else they could buy.
For me, such boxed solution, which wouldn’t achieve top possible scores, but it is easy to use and it is like a plug and play so you could put it somewhere in your website and that's it. That's all you need to know. This is the best option for me. auto ML in this case is the best option. But, for example, I'm working in say a bank, and I'm doing some credit score analysis for credit scoring models. In this case, I’m really interested in achieving as high score as possible. Maybe I will do some sophisticated techniques from recent papers, or maybe I will read top blogs in the field and try to do the same in my pipeline. auto ML couldn’t do that.
It is capable for our optimizing some parameters to do something but it doesn't allow you to do it freely. This is the current problem of auto ML to be short, if I'm talking about high scores, high scores of pipelines, auto ML is not the best choice right now.
SY: Could a small team of data scientists tweak some of the parameters that auto ML would be using and maybe tune the auto ML process a little bit for that specific task to improve scores?
DV: Yes, for example, auto ML is really bad at feature engineering because it's obviously completely unaware of the main knowledge of your problem, it’s not going to understand what features should be tweaked in what way in order to produce highest possible score.
You could do this feature engineering yourself before pushing your data to an auto ML boxed solution. It's one way how you could improve your scores or maybe you could clean and pre-process data yourself before pushing in to auto ML. There are multiple ways how you could improve the performance of auto ML.
SY: It sounds like auto ML can maybe get you to 80 to 90% of the level of an expert data scientist.
DV: Yes and in 90% of cases that’s more than enough. If you need get your solution as fast as possible, 90% of maximum accuracy is the best choice for you because you could produce this accuracy in a matter of a couple hours with an auto ML solution.
SY: It sounds like auto ML is at a level where it could actually replace some smaller teams of data scientists and then get reasonable accuracy when the application really only needs a certain level of performance.
DV: Actually because the job of a data scientist is not only creating models. It's more about figuring out what problems need to be solved and how. When the data scientist is working, we're not only creating the model, we’re identifying the problem, we’re giving ideas on how you may solve it and we create an experiment then we create our models. This small part, creating the models, is actually the job of auto ML. Then we push it into production and check if everything is okay. As you can see the pipeline of the data set project is pretty long and the auto ML work is maybe 20% of the time.
SY: There's always going to be the communicating of the story and all the prep work that goes ahead of the work that the auto ML can actually do.
As we just talked about, people kind of think of auto ML as just about the modeling aspects but it really is about the entire process from pre-processing the data, the feature engineering and selection, model selection and tuning, ensembling of the models, etc.
What parts of the data pipeline are best suited for automating? Where do you see the most value in terms of savings on human resources and computational resources relative to the performance of the end product? What aspects are best left to the data scientists to focus on?
DV: I think of those things you mentioned as modelling parts. Feature engineering, data pre-processing, optimizing hyperparameters, is all part of modelling and could be automated with auto ML.
I think the most interesting part, the most important for me right now is understanding why the model made such a decision. This is the way of interpreting the model. The second way where auto ML could potentially help is finding biases in your data because finding a bias is really hard.
If you could automate this process, for example, we could see that the prediction for one group of users is different from prediction of another group of users. Let’s say they’re different by sex, gender, age, if we could see that our model made such bias in our prediction we could retrain the model again to ensure that our customers will get the best results possible.
SY: Cool. One of the big concerns that I've always heard from people who are thinking about machine learning but they're a little hesitant, one of the big concerns has always been it's a black box and they always need transparency into the model, but with new interpretability techniques we can gain that transparency so we can actually automate those processes as well?
DV: Yes, in some open source projects we can see an example of how to do it very easily. You don't need to calculate feature importance or track the different distributions of probability across different groups of people, you just need to write a couple of lines of code and that's it, the rest will be done by these automated solutions.
SY: What's your perception of the the uptake of auto ML by industry? What do you think needs to happen for it to become more widespread?
DY: I think you will see more and more of these boxed solutions as I mentioned before for online stores, small online retailers, or maybe even in our day to day life. For example, let's say I want to solve the task of classifying cats versus dogs. I want to make some application for my phone for which I could make a photo of some object and it will say to me if it’s a cat or dog. If I could optimize the data collection, the data labeling, model building and the inference it would be the best case for me.
For example, I make several photos, I upload it somewhere and I tell it what to do, then it does the labeling, trains a model and pushes the model to my phone. We can see potential usage of this pipeline everywhere. For example, for shops to find damaged products or for industry where you have a lot of photos and you want to find damages in your process. If you could optimize the whole process of data labeling, model creation and pushing it to the user that would be really cool.
You actually can see that this is very similar to what consulting companies are doing. You could hire a guy to perform one of these jobs, a consultant or a freelancer, in this way to build a model is really similar.
SY: So in an ideal world we would have an end to end product that would have the capability for someone just off the street with zero experience to just type it into their phone what they want and the end to end platform would take care of the entire thing for them?
DY: Yes, exactly. That's why we’re actually slowly moving in that direction because if you check the state of auto ML a couple years ago, the auto ML could only produce a model. The model is just some file. What should I do with this file? I have no idea.
Currently we’re in the state where auto ML is producing the whole pipeline. So it's already pushed into production and you could write some command and get predictions. I believe that, let’s say in 10 years, we’ll have something like “Hey Siri, I want to have a dog and cat application, this is my data, do all the rest yourself.” This will be the future I think.
SY: There are techniques we can do for feature engineering, model selection and tuning. How would the auto ML platform be able to select the appropriate data?
DY: Yes, this is actually a problem, that there are a lot of topics that need to be solved. But again auto ML is not about the greatest score. If you're in this dog and cat classifier to achieve as high accuracy as possible auto ML is not your solution and you’ll need to tweak a lot of parameters yourself. In some cases we don’t need too much training data.
For example, in computer vision and NLP tasks, we may use some sort of transfer learning. We train it on some huge corpus of labelled data than we have new labelled data we just retrain some part of our previous model to make a new one. This is possible. We may use some sort of model storage where we have different pre-trained models, which could be applicable for different tasks, for example for medical imaging, for classifying pictures, for sentiment analysis in NLP tasks and so on so forth.
Again auto ML is not about big scores. It's about achieving enough accuracy in a very short period of time so you don't actually need too much data in this case. In this case you just want to get your results really fast.
SY: I see, so with more and more labeled data, and in fact we are now developing models to label the data for us, that would probably be an essential key to being able to do that end to end platform.
DV: Yes, you correctly mention that today we have some sort of semi-supervised labeling so the labeling process can be easier for humans. For example, if you are classifying cats and dogs there are some very simple heuristics, a simple model could produce an initial probability for you, and the job of someone vetting the data is just to say yes, it is correct or no, this is incorrect and you need to classify this picture in another way. This is one way how to automate this process.
SY: You placed first in the AutoSeries competition, which is applying auto ML to time series data. How does applying auto ML to time series data differ from say a classification problem.
DV: The whole time series problem is very different from working with regular classification or regular tabular data and the difference is that you need to have a very strong validation because if you overfit solving the time series problem you will get very dramatically low scores.
Also, feature engineering is really important because, for example, in common classification tasks, you could create a lot of new features and then select most important among them and this process would repeat several times. Unfortunately, that’s not possible in time series problems because in time series problems you always don't have too much data to validate on. If you perform this feature selection process on a small amount of data then it is easy to overfit and get very low scores.
So the two main problems, first your validation should be as strong as possible and the second your feature engineering approach should be reliable so you can create as many features as you want.
SY: What are some new developments in auto ML or data science and machine learning in general that you're most most excited about? Where do you think we could focus more on?
DV: I’m really excited about the progress in NLP. Maybe two or three years ago everyone was using word2vec and TF-IDF or some other source of transforming words into some miracle representation. Suddenly, just in a matter of a couple months, people completely turned to the Transformer architecture. I remember the time when we were talking about it, you couldn’t use Transformer architecture, it was just not possible, it’s a huge model. Suddenly, everyone's talking about it and everyone uses it in production.
The whole concept of using this Transformer architecture is quite cool actually because this allows you to get some features, some representation, of your sequence data. This is really cool and this is actually applicable in many domains. I believe that Transformer architecture will show a great result in many domains when we deal with sequence data, or say time series.
An example, recently on Kaggle we had a competition about predicting who would win some competition, it was about American football. If you think about it for a moment, a small fraction of the data from the competition actually has a sequence nature, a time series nature, but the Transformer showed the best scores among other models. It outperformed other models by a great amount because it is capable for finding these really important features for decision-making.
One other example, predicting properties of molecules. Molecules have these connections between them and we could create features ourselves based on some domain knowledge of these properties and some maybe geometry of the molecule, but you could just throw the data into Transformer and it will produce some meaningful results from it, which is quite amazing.
SY: That sounds really cool. And as you said, even though it's a development that may have started in natural language processing, that technique can carry over to other sequential data such as times data.
DV: This is actually quite common in the field of data science. For example, we had CNN neural networks first applicable in computer vision type of tasks and then switched to NLP problems and time series problems. Then we have this Transformer architecture and we take this idea and try to apply it in different tasks, for example time series prediction.
SY: Very cool. That's actually all the questions that I have. Is there anything that you'd like to add?
DV: No, I think we're cool.
SY: Great. A big thank you for taking the time to answer my questions.