Logistic Regression Basics
Logistic regression is a method for calculating probabilities for problems with limited (often two) potential outcomes. Unlike linear regression, it ensures values can never be greater than 1 or less than 0.
Logistic regression is a method for calculating probabilities for problems with limited (often two) potential outcomes. Unlike linear regression, it ensures values can never be greater than 1 or less than 0.
Regularization is the mitigation we take to ensure data is not overfitted. There are many ways to do this, but among the most important is ensuring we penalize model complexity. Generally: the simpler the model, the less likely we are to overfit.
If loss in validation data initially goes down, but then starts to rise again after a certain number of iterations (while the training data continues to go down), chances are we’re overfitting.
In linear regression problems, some data doesn’t lend itself easily to direct linear solutions. There are many ways to solve such models, but one particularly valuable one is feature crosses: that is, creating new columns that combine existing features to make data that is easier to work with.
For example: let’s suppose you have a data set where features with negative x and y values OR positive x and y values have a strong correlation to one label, and features with either a negative x/y AND a positive x/y correlate to the the other label. By creating a feature cross for the values of x * y, you could ensure all data with the same positive/negative values showed as positive, and all data with one negative/one positive showed as negative. Suddenly, a linear line becomes a feasible way of separating labels.
When we receive data, how do we determine and groom which data is relevant for the purpose of creating useful machine learning models? This challenge – which is often where a majority of a machine learning engineers’ time is spent – is called feature engineering.
If you separate data into training data and test data, you risk overfitting your results to the test data if you do multiple rounds of testing. How can you avoid that risk?
This is where validation data comes in. Validation data is a set of data that sits between training and test data.
In machine learning, generalization refers to a model’s ability to accurately predict values for new data. Overfitting happens when you try to fit a model to suit training data too tightly. Training sets are what you create a model against, whereas test sets are how you validate the accuracy of that model (without having the data influencing the model)
Pandas is a Python library that unlocks “data frames” – a row/column style arrangement similar to spreadsheets – directly in Python. Much like Google Sheets or Microsoft Excel, a data frame has data cells, named columns, and numbered rows.
Numpy is a Python library for working with matrices. You can define matrices manually, via sequences, or via basic mathematical operations on existing matrices
When dealing with massive amounts of data, it is often inefficient to try calculate models based on the entire set. Instead, you want to take a subset of that data to test against. Stochastic Gradient Descent takes a single example and calculates the convergence for that one point
In this article, we provide a basic overview of loss iteration, gradient descent, and learning rates. Concisely: loss iteration is looping over different values to try and find increasingly more accurate weights for a model. Gradient descent is the idea that we can find more accurate models by examining the gradient of a given loss, and iterate over those gradients until we find one at the “bottom” of a convex shape. Learning rates are the amount of change in each of our “guesses” that try to get us to the most efficient values for a model