Over the past two and half years, I’ve had the opportunity to invest much of my time in data science, machine learning (ML), and artificial intelligence (AI). This blog post answers many of the questions I recount having prior to learning about the work, techniques, jargon, and tooling in ML and AI.
What is AI and ML?
Many companies will market their systems or services as “powered by AI” when it’s not often the case. We will always find these instances of gimmicky marketing, so it is helpful to first understand what is AI and ML, and the different terms, as there are many relevant use cases of AI and ML in our world today.
Artificial Intelligence is a technique for building systems that mimic human behavior or decision-making.
Machine Learning is a subset of AI that uses data to solve tasks. These solvers are trained models of data that learn based on the information provided to them. This information is derived from probability theory and linear algebra. ML algorithms use our data to learn and automatically solve predictive tasks.
Deep Learning is a subset of machine learning which relies on multilayered neural networks to solve these tasks.
Forms of Machine Learning
Given that machine learning is a fundamental basis for AI, it’s worthwhile to understand the different forms of machine learning.
There are three kinds of machine learning: supervised, unsupervised, and reinforcement learning. Each form solves problems differently.
Supervised Machine Learning
In supervised machine learning, we know about the data and the problem. Think of it as, “given a set of features x, we know the value of y,” and so in supervised learning, we create a function that approximates results based on some set of data.
There are two kinds of supervised learning: classification and regression. In a classification problem, we assign data to categories. For example, given a client’s medical information, they test positive or negative for diabetes. In classifications, our trained models, known as classifiers, classify data points into different groups.
If we instead wanted to solve a different problem, like predicting the future value of GameStop stock given the stock market history, we’d turn to a regression. In regression, we return numerical values. Given some sentences, this is the percent likelihood the person is happy or sad.
Unsupervised Machine Learning
In unsupervised machine learning, our data is unlabelled. There are two forms of unsupervised machine learning: clustering and dimension reduction.
In clustering, we learn more about data points as they are clustered, or grouped together. This allows learned models to understand a data set, detect anomalies, and assign relationships between points, often allowing users to develop new categories or features about the data set.
In dimension reduction, we plot data points across different dimensions and feature sets to understand our data sets. This allows for techniques like feature selection or transformation. Dimension reduction solves the curse of dimensionality. The more features to a data set, the more data is needed, and processing many noisy features can impact the performance of an ML model, so unsupervised machine learning techniques are often paired with supervised or reinforcement learning algorithms.
In reinforcement learning (RL), we are learning models over time. A common technique is to utilize deep learning with reinforcement learning to derive relationships between features of a data set that may not otherwise be solved through human research. Deep learning RL has been very successful in the field of medicine as of late.
The Use Cases for AI
From transportation, medicine, natural language processing, and computer vision, ML and AI have made a global impact on the market capital. A recent study also confirms that in the future, deep learning will produce more market capital than the internet.
That said, there are also many applications for AI in the field of software development and IT operations, including AI-Driven Operationalization, Next Level Delivery Insights, and AI-augmented development. These aspects are certainly relevant in the time of SaaS and cloud providers. The image below shows cases of the different roles associated with AI and ML use cases. A solution can be a data source, an AI/ML learner, a decision-maker, or a combination of any of the three.
DevOps for AI and ML
It’s important to note the relationship between AI and DevOps flows both ways. AI and ML not only affect DevOps, but the same is true the other way around. MLOps strives to make the delivery of ML models safe, repeatable, and quick. Kubeflow is one example of a solution that is bringing ML and AI solutions to market with excellence expected from DevOps practices, principles, and culture.
If we want to understand a portion of this pipeline better, it is best to get a peek into a simple classification problem, leveraging the SKLearn library in Python.
Preparing Data and Training Models in Python
Some fields of information may be missing or inaccurate due to the data collection process or the instruments used. Data lakes, data repositories, and databases are all relevant methods of storing data. So data scientists, sometimes, alongside domain experts, will extract that data and preprocess it for an ML model or algorithm.
In this example, we are reading in some data from a CSV file and labeling the features using the pandas and NumPy library. This data set is a popular diabetes data set that contains diabetes patient records obtained by researchers from Washington University.https://gist.github.com/tiffanyjachja/8ce99dd0ef3772ab1736f8a2b32956fd
As shown in the code snippets, we are reading the data set and filtering out any missing values. This data set was obtained through an automatic electronic recording device, so some fields are missing. It’s often the case that the average value for a particular feature will replace any missing values. This was true for this data set as well, with features like blood pressure.
Training a ML Model
After data is cleaned and ready to be processed, the entire data set is split into a training set and a testing set. Validation sets are used in the training process to ensure a model does not overfit on data. Overfitting can cause issues like poor performance on data that hasn’t been seen outside of the training set.
This training set data is used in the learning process. For this classification problem, which predicts patients with diabetes, the learning process is based on function approximation, as shown in the following diagram.
Here is the code to split data and train a decision tree classifier.
Now that we have the trained model and the results, we can evaluate the performance and accuracy of the model. If things look well, this model can be exported or deployed for consumption!
This blog post showcased some simple use cases for AI and ML. There are many other relevant solutions, tools, and libraries in this space. I hope this gives you a good look into this topic so that you are able to learn more about AI and ML in greater depth.
Thanks for reading. This content is actually based on a talk I gave at a North America DevOps Group developer meetup in 2021. If you enjoy this kind of guide and would like more AI/ML content, please let me know! Cheers!