What is Machine Learning and when should you use it?
In the last few years, artificial intelligence in general and machine learning specifically have become important topics in software development. At Rocketmakers we often see project proposals from clients that include machine learning, but we have subsequently advised against using it. While machine learning is grabbing all the technology headlines at the moment, generally it is overkill for many projects. Often, there are simpler solutions that we can build to solve the problem more quickly and more reliably. In this blog we want to explain a bit about machine learning, and how and when we use it.
What is machine learning?
Machine learning isn’t actually a specific technology; it describes a way of designing an algorithm that teaches itself how to do its job.
There are a few implications to be aware of:
Firstly, the algorithm learns over time, and the longer you run it, the more it learns. More technically, there’s two stages to any machine learning algorithm: the first part trains a model with everything it knows about the world, and the second part interrogates that model with data from the real world, to see what the model thinks will happen next.
Thirdly, this means a machine learning algorithm is never completely accurate - anything it produces is its best prediction according to the model. If the model is very good and closely matches the real world, the algorithm can generate a prediction that closely matches the real world. Sometimes a close match is good enough, but that depends on the domain. For example, if an algorithm for identifying songs tells you the wrong thing you probably don’t mind, but you want a self-driving car to be much more accurate! Often more training time can help, usually more data will help, but sometimes the fundamental approach is wrong, and to get more accurate data you need to build a new algorithm and start it over from scratch.
What is the Rocketmakers approach?
Rocketmakers takes a top down view starting from the initial idea. It is important at this stage to establish if the idea has the depth of data to warrant a machine learning approach, and if not, consider a traditional algorithmic approach. This is a fundamental part of the process to ensure we avoid scenarios such as undertrained or overfitted models which lead to inaccurate and unreliable outputs.
So how do we use it?
Give the model a set of inputs and it predicts outputs
We make use of machine learning by first training our model, then by passing input data to our trained model to generate predictions. These will generally be some numeric output - for example the likelihood that the provided image contains a cat or the odds of a user wanting to purchase a specific item based on their past behaviour.
Note that these are predictions, not clear answers. Machine learning models will rarely offer absolute yes or no responses. There are several kinds of problems that have made good use of machine learning. The classic example is image recognition, but there are many others for which machine learning can effectively solve. For example, time series data; if you have a system collecting a large amount of data then machine learning solutions can offer a glimpse into both past and future behaviours - clustering existing data to identify patterns and training models to predict future ones. As another example, machine learning models can also be extremely effective at predictive text features. The suggestions you get as you type a message on your phone involve taking what you type as an input to a model that’s been trained to predict what the next words to be typed are likely to be.
The training is only as good as the inputs
In our experience, an equally complex challenge is around data preparation and how to support this, and subsequently understanding the problem that is attempting to be solved. An understanding of the datasets is required in order to optimise the attributes and values within the data that is targeted for ingestion. In addition to identifying the appropriate data sets, there are often a number of data cleansing and or enrichment processes required prior to any processing by a ML engine. This is often one of the most fundamental steps to achieving the desired outcome.
Any core element to a successful ML processing engine is the amount of data. The majority of ML processing engines benefit significantly from a larger historic data set. A good example here would be predicting house prices in a given area. A ML algorithm is much more likely to yield an accurate result from 10 years of history data over the past 6 months. There is of course with all data inputs a sweet spot when it comes to the attribute inclusion and the amount of historic data.
What do good ML problems look like?
Problems which are good applications for machine learning techniques will generally involve a dataset which is large enough to provide reasonable training and test data. For example, let's say we have a giant database of images and we want to know how often these images contain a picture of a car. In this case we could split the dataset in two:
- Some of the data is used for training a model, showing it images that we know do or do not contain cars to help it "understand" the differences it needs to look out for
- A subset of the data is excluded from the training, so that we can test the model later and see if it can successfully identify the images containing cars without being explicitly told
There needs to be a lot of data to do this, though what "a lot" means will differ depending on the problem we're trying to solve and the approaches we intend to take.
Once we have enough data to train and test, there's still some things we need to take into account here. We might find that when testing our model it consistently fails to identify some cars in the test data. It may be that it struggles to spot green cars, or it's easily confused by trucks, vans or bicycles. This highlights the importance of ensuring that our training data reflects the range of what might happen and also that we're clear on what it is we're trying to understand. The dataset we're working on should be reasonably representative of the scope of what can happen, lest we end up with a model that thinks dogs are cars because a lot of the images happen to be from camera footage of a road alongside a local dog-walking path. That's a fairly overt error, but something more subtle might be totally valid. The aforementioned possibility of bicycles being counted might be absolutely invalid for an application which is attempting to track vehicle usage along a road for pollution estimation, but for another application which is using that data to estimate road usage throughout the day it may be valid.
Even with the sort of near-perfect data that could result in an extremely accurate model, most machine learning approaches are not flawless. It's necessary to experiment with different ones, trying new approaches with the data and performing a-to-b comparisons to see which approaches perform better. Problems which have a clear right or wrong answer where it's absolutely critical that we can consistently arrive at the correct conclusion often can, theoretically, be solved with machine learning, but at the cost of tremendous amounts of time and effort to experiment with and improve models to the point of near-infallibility.
Even then, should the shape of the data change we might find that an almost-perfect model becomes worse over time. To go back to the previous example, what happens when a new car comes out that's styled significantly differently to those upon which our model was trained? What happens when our pollution tracking application correctly identifies the growing number of electric vehicles on the road as cars, but then the assertions being made based on those identifications aren't taking into account the lack of emissions from these cars?
Machine learning applications tend to be best for solving problems that have some flexibility to their success rate. Correctly identifying that a car is present in a provided image 80% of the time is probably enough for a sampling-based estimation in most use cases, and even if we have a trained model that's achieving that level of success we can experiment and adapt to improve the model and increase that percentage further. This would obviously be a terrible success rate for some problems that would traditionally be tackled in computing, but many of these approaches allow us to emulate human-like (or better) reasoning on a scale at which actual humans cannot reasonably operate, without much of the gut-feel or bias which could lead to incorrect assertions being made.
Which tools should I use?
Cloud platforms have been a key enabler for making AI capabilities available to a wider set of users. With the likes of cloud based services such as AWS Sagemaker, Google AI , IMB Watson and many more, access to off the shelf systems are readily available and often advertised as up and running in just a few clicks. While basic data upload, attribute selection and processing is possible, yielding a high quality outcome that can be used to make data driven decisions is much harder.
So, which one do you choose? Well there is no simple answer to this, however, it’s important to remember that each operates in a different way, with varying levels of functionality and most importantly, levels of visibility to view inside what are often black boxes.
At Rocketmakers, we consider all aspects of data processing, from the most optimal inputs, time to calculation, the amount of data and transformation needed, and alignment with budget expectations. Machine learning, when used within the right context, can become a valuable function within any business to enable data driven decision making. We help our clients understand where and when to utilise the power of machine learning on their data and importantly, support them where an alternative method would be more suitable.
This blog was a collaboration between Greg Skinner, Matt Harris, Clare Henning-Marsh and Adam Clarke