Daily Reading 13
Linear Regressions
###
Can you explain the basic concept of linear regression and its purpose in the context of machine learning and data analysis?
Linear regression is using an easy to measure variable (called an explanatory variable) to assist in predicting a harder to directly measure, but intrinsically linked variable to the explanatory variable. (called the response variable.)
The simplest form of linear regression is when, as the the explanatory variable increases, the response variable increases in kind, at an unchanging rate. By getting numerous samples of measurement for the explanatory variable, and measuring the response variable under the same circumstances, educated guesses on what the response variable will be can be performed using only the explanatory value.
Having an accurate, and low margin of error method of determining a ballpark number for a highly difficult to measure variable is fundamentally useful, as if the method you’ve devised is accurate enough, you can get away with no longer measuring the difficult to determine variable directly at all.
Lets say you have a pile of quarters. You count how many you have. This measurement is extremely easy, and is just a matter of looking at them. But what if you were asked what the total weights of the different metal contents are in your pile of quarters? melting down, separating, and measuring the mass of the metals each time would be a lot of work! But if you did this process for exactly one quarter, you know the rest of them (probably) have the exact same ratios of metal. If you know what metals are in a single quarter, then the metals in 10 quarters is just your sample measurement times 10, which requires zero melting of coins at all!
Describe the process of implementing a linear regression model using Python’s Scikit Learn library, including the necessary steps and functions.
Numerous python imports are required for scikit learn before it can be used.
imports must be done for:
pandas
numpy
matplotlib.pyplot
as well as specific modules of the overall scikit set.
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
With these loaded, you need access to a CSV or other importable pandas-compatable data set, and then you need to provide sklearn’s train test split with info on which part of the data is the features (information) and target (information you are attempting to guess based on features.)
What is the purpose of splitting the dataset into train and test sets, and how does this contribute to the evaluation of a machine learning model’s performance?
Train Test Split is a method of validating that your proposed model (simulation of a link between values) is accurate and robust for many situations.
The standard approach for doing this is to get 75% of your data, chosen randomly, and group it into a “training set.” The remaining data is put into another different set called the “test set.”
The training set is used for training your model on what correlations it is looking for, and so is used for exactly that. After the training is done, the test set is used to see if your model can predict with accuracy what the target will be off of patterns found in the training set.
If this separated train and test method is not used, the model runs the risk of overfitting, where the found patterns are specialized happenstance that relates to your dataset, and is unrelated to the broader trends that are being looked for, rendering the model unfit for use with any other data provided later.