Mystical Data Science: A magic box or nothing special?

These days nobody is surprised by facial recognition systems or by artificially intelligent (AI) chess players winning against world champions. This magic only happens thanks to Data Science and Machine Learning. But what do these terms mean? What types of algorithms stand behind this magic? What fascinating problems are they able to solve? In which areas can they be implemented? These and more questions are going to be discussed in this post.

What is Data Science about?

Of course, Data Science is about data. Some people confuse the terms "Data Science", "Big Data" and "Machine Learning". Actually, both Big Data and Machine Learning are parts of Data Science. In general, Data Science has its main aim in extracting meaning from data. Big Data is part of data science which deals with a huge volume of data, while Machine Learning task is to learn from data. A great deal of a data scientist’s work is devoted to preparing, visualizing, cleaning, and analyzing data. Each of these steps is a separate branch of Data Science that requires different expertise. For instance, data analysis requires strong knowledge in mathematics, whereas data preparation is mostly a logical task.

A great deal of a data scientist’s work is devoted to preparing, visualizing, cleaning, and analyzing data.

Usually, the data is a big table with dozens of columns and hundreds of thousands of rows. Those rows and columns might contain different information. If we are talking about predicting future sales for a store, there would be information about sold products, time of purchase, prices and so on. Or arrays of pixel information if the task is to recognize faces on the pictures. The first step is always analyzing the data, and after the data is analyzed, data scientists start making their magic.

Machine Learning: how it works

Data analysis is indeed an interesting topic, but real miracles happen when machine learning comes into play. It's a magical tool that helps recognize faces, win a game of chess or advice on your future purchases.

There are three main paradigms of Machine Learning : supervised learning, unsupervised learning, and reinforcement learning. Let's talk about all of them.

Supervised learning

If there is a data set, where the target value is known for each row of data - one can use the "supervised learning" type of algorithms. Why is this type of Machine Learning algorithm called "supervised learning"? The correct answers are known and the algorithm iteratively makes predictions on the data and is corrected by the teacher/supervisor.

The main task of these algorithms is to define how one variable depends on the rest of the parameters and the ability to predict the dependent variable by knowing other features. Commonly in this kind of problem, one column contains dependent variables, and other columns contain independent variables. Furthermore, we imply that rows do not depend one on another: one row corresponds to/matches a particular object (user, product, purchase, and so on) with its characteristics.

In general, the full process of supervised learning can be divided into several steps. The first stage is splitting data into training and testing parts. The second step is training (learning). During this part, the algorithm tries to find dependencies between a target value and the rest of the data. The third step is checking the accuracy of the program (supervision). The trained model predicts the dependent variables for the testing part and its predictions are compared to real target values, known before. Since it is possible to change some parameters of the learning process (the type of regularization, number of used independent features, etc ), we can change them and then find out the parameters which led to predictions that are most similar to the real values. After the model is trained, we can use it to make predictions for the data, where the target value is unknown.

There are typical kinds of problems that can be solved with a supervised learning paradigm: classification and regression. The names may sound terrible, but in reality, they are nothing special, simple, and easily understandable.

Classification task is similar to the job of a bank clerk, who decides whether to accept or reject your loan based on the provided information and on previous experience. In their mind people are classified based on certain parameters, such as income level, marital status, age, etc. In this classification, for example, people whose income is low tend not to pay back on time. Computers can remember, analyze, and compute way better than humans, which comes to no surprise that they can handle such tasks.

Regression is a bit trickier: now our bank clerk must decide not only if to approve the loan or not, but also to detect the amount of money the bank can lend as well. So, the input stays the same, but the output has additional characteristics, meaning it's no longer a binary value (yes or no) like it was for the classification case.

Unsupervised learning

On the contrary to supervised learning, in unsupervised learning, we do not have any dependent columns. However, the rows remain independent. As an example we can take a history of purchases in a store. There is no "target value" or a "dependent value" in a purchase. Nevertheless, these data still can be used. Assume you have a task to make a product recommendation for a visitor, knowing their history of purchases.

According to the previous transactions, the algorithm determines visitor’s preferences and finds users similar to them, creating clusters of customers, so that we’ll have a set of clusters, each of them containing members with similar preferences. (These types of tasks are called clusterization). This way you can recommend a product that is popular among other members of the cluster the visitor belongs to. Another type of problem which could be solved with unsupervised learning algorithms is association. The program analyses the sales history of all users and determines products that are commonly bought in the same transaction. In other words, the magic box finds product Y, which is usually accompanied by product X. The association method is used to make recommendations as well.

Reinforcement learning

Another Machine Learning paradigm is reinforcement learning. It is mostly used in games. In this case, the algorithm learns from its own experience.

Assume that your application’s task is to suggest the most attractive font size of a sign for a store. Since your application does not have any previous experience, the only choice is to try different options and compare the results. The application sets one font size, waits a week, calculates the number of people who have visited the store during the week. Then it changes the size, waits another week, calculates the number of people, and so on until it detects a font size that works better. Reinforcement learning types of algorithms allow you to minimize the number of iterations needed so that you don’t have to spend ages experimenting.

Of course, the harder the problems, the more iterations it requires to get a result. But, the general idea is rather simple.

Nothing special, just data

Of course, there are a lot more things to tell about when talking about Data Science: more complicated yet interesting tasks, more tricky algorithms. But for the basics of Data Science, it will be enough. If you feel like going deeper into the subject, you can read about neural networks and their implementations. You will find a lot of interesting topics, but it goes out of this article’s purposes.

Now you understand, no matter they can drive, they can predict, they can recognize - they are not magicians, but Data Science algorithms. Everything they need is data (the more the better) and a programmer's passion. There is no miracle, simply hard and diligent work.

You may be also interested in:

Navigation

Mystical Data Science: A magic box or nothing special?What is Data Science about?Machine Learning: how it works Supervised learning Unsupervised learning Reinforcement learning Nothing special, just data