Hi, my name is Ajda. Welcome to the Introduction to Machine Learning series. In the following videos, I'll show you how to use data to discover interesting groups, models, and make predictions. Although I'll be using Orange Data Mining software for demonstration purposes, these videos will focus primarily on machine learning concepts and techniques, rather than the toolbox itself. Machine learning always starts with data, and that's the topic of the first video. While data in machine learning can come in many forms, the most common is a data table or spreadsheet. Let me show you a simple example of such a data set. I will load data on 16 students and their grades in seven subjects. Each row represents one student, and the columns show their grades in each subject. We can also see that the grades are expressed as numbers. For example, Jena scored 39 in English and 18 in French, and performed brilliantly in Algebra, scoring 99. In machine learning, rows are called data instances and represent our objects of interest. In this case, the objects of interest are students profiled with their scores in seven subjects. We also refer to data instances as examples or cases, while the columns are called data attributes or features. In addition to the input variables representing the students' scores, our dataset contains a meta-feature with the student names. In this data set, we might want to identify groups of students with similar academic patterns, such as those who excel in math but need help with languages. Now, let's switch to a socioeconomic dataset from the Human Development Index database. This dataset is larger and contains 188 countries, each described with 50 variables. These variables represent various socio-economic factors, such as life expectancy, average years of schooling, and national income. I wonder which country has the longest schooling? It's Switzerland, followed by the United Kingdom. And on average, people live the longest in Hong Kong, Japan, and Italy. I can uncover interesting insights by simply sorting and analyzing the countries based on these variables. Note that this dataset contains three meta features: the name of the country and its geographical position. In our socio-economic dataset, we might want to find groups of countries with similar characteristics and then see if these groups are geographically related. Of course, we would need to show our results on a world map. You may have noticed that our datasets contained only numerical variables until now. But that is not always the case. Consider a dataset of over 1,400 employees. They are described with 32 attributes, including their age, travel frequency, daily earnings, and their work department. This dataset contains attributes of mixed types - age and daily earnings are numeric, while travel frequency and department are categorical. Notice the special gray feature, "Attrition," which indicates whether an employee has left the company. This feature is called a class, and the main task with such datasets is to build models that predict the class value based on the other attributes. For example, is a young employee with a low income more likely to leave the company than an older employee who travels a lot? Machine learning works well with tabular data, but that shouldn't limit us. In recent years, artificial intelligence has advanced to the point where we can easily work with datasets of images or text. For example, consider a dataset of images of dogs. As before, we might want to find groups within our data - in this case, dogs - or build a predictive model that can take an image of a dog and identify the dog's breed. Or consider a dataset of daily news articles, where we'd like to find all articles related to "artificial intelligence", or those that cover "cycling races”. We'll learn that machine learning approaches to images, text, and other unstructured datasets
Segment 2 (05:00 - 05:00)
often start by representing them with numbers and converting them into data tables. Numerical representation is typically done using pre-trained foundation models, the types of models that current AI is about. Datasets can come in all different shapes, types, and sizes, and modern machine-learning approaches can handle them all. Interestingly, regardless of the dataset, we perform common tasks on them. These tasks include finding similar data items, identifying groups of data instances, and building predictive models. We'll explore these tasks in our upcoming videos. ________________