Lecture 10 - 11 #
Experimental Data #
- These lectures are talking about the interplay between statistics and experimental science. And how to get a valid statistical conclusion.
The Behavior of Springs #
Hooke’s law of elasticity:
F = -kx
- In other words, the force,
F
, stored in a spring is linearly related to the distance,x
, the spring has been compressed(or stretched). k
is called the spring constant.- All springs have an elastic limit, beyond which the law fails.
- In other words, the force,
For example, How much does a rider have to weigh to compress the spring on motorbike 1cm? ($k \approx 35,000 N/m$)
- $F = 0.01m * 35,000 N/m$
- $F = 350 N$
- $F = mass * acc$, $F = mass * 9.81m/s^2$
- $mass * 9.81m/s^2 = 350N$
- $mass \approx 35.68kg$
Finding
k
We can’t conduct the experiment perfectly, so, to hang a series of increasingly heavier weights on the spring, measure the stretch of the spring each time, and plot the results will be a better way.
First we should know:
F=-kx
k=-F/x
k=9.81*m/x
Then, we got some data
0.0865 0.1 0.1015 0.15 ... 0.4263 0.65 0.4562 0.7
Plot the data, we got:
- It has some measurement errors.
Next step is to fit curves to data.
When we fit a curve to a set of data, we are finding a fit that relates an independent variable (the mass) to an estimated value of a dependent variable (the distance)
To fit curves to data, we need to define an objective function that provides a quantitative assessment of how well the curve fits the data. In this case, a straight line, that is linear function.
Then, finding the best fit is an optimization problem. most commonly used objective function is called least squares:
- $\displaystyle\sum_{i=0}^{len(observed)-1}(observed[i] - predicted[i])^2$
Next is to use a successive approximation algorithm to find the best least-squares fit. PyLab provides a built-in function, polyfit:
pylab.polyfit(observedXVals, observedYVals, n)
- this function finds the coefficients of a polynomial of degree
n
that provides a best least-squares fit for the set of points defined by the arrays observedXVals and observedYVals. - The algorithm used by polyfit is called linear regression.
- this function finds the coefficients of a polynomial of degree
Visualizing the Fit
def getData(fileName): dataFile = open(fileName, 'r') distances = [] masses = [] dataFile.readline() #discard header for line in dataFile: d, m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses, distances) def labelPlot(): pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') def plotData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 #acc. due to gravity pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') labelPlot() def fitData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 #get force pylab.plot(xVals, yVals, 'bo', label = 'Measured points') labelPlot() a,b = pylab.polyfit(xVals, yVals, 1) # estYVals = a*xVals + b estYVals = pylab.polyval(model, xVals) print('a =', a, 'b =', b) pylab.plot(xVals, estYVals, 'r', label = 'Linear fit, k = ' + str(round(1/a, 5))) pylab.legend(loc = 'best') fitData('springData.txt')
k = 21.53686
Another Experiment #
to fit curves to these mystery data:
how good are these fits
we can see that quadratic model is better than linear model, but how bad a fit is the line and how good is the quadratic fit?
comparing mean squared error
def aveMeanSquareError(data, predicted): error = 0.0 for i in range(len(data)): error += (data[i] - predicted[i])**2 return error/len(data)
- then we got:
- Ave. mean square error for linear model = 9372.73078965
- Ave. mean square error for quadratic model = 1524.02044718
- Seems like, quadratic model is better than linear model. But we still have to ask, is the quadratic fit good in an absolute sense?
- The mean square error is useful for comparing two different models of the same data, but it’s not actually very useful for getting the absolute goodness of fit.
- then we got:
Coefficient of Determination ( R^2 ) #
$R^{2}=1-\frac{\sum_{i}(y_i-p_i)^2}{\sum_{i}(y_i-\mu_i)^2}$
- By comparing the estimation errors (the numerator) with the variability of the original values (the denominator), R^2 is intended to capture the proportion of variability in a data set that is accounted for by the statistical model provided by the fit
- Always between 0 and 1 when fit generated by a linear regression and tested on training data
- If R^2 = 1, the model explains all of the variability in the data. If R^2 = 0, there is no relationship between the values predicted by the model and the actual data.
def rSquared(observed, predicted): error = ((predicted - observed)**2).sum() meanError = error/len(observed) return 1 - (meanError/numpy.var(observed))
Testing Goodness of Fits
def genFits(xVals, yVals, degrees): models = [] for d in degrees: model = pylab.polyfit(xVals, yVals, d) models.append(model) return models def testFits(models, degrees, xVals, yVals, title): pylab.plot(xVals, yVals, 'o', label = 'Data') for i in range(len(models)): estYVals = pylab.polyval(models[i], xVals) error = rSquared(yVals, estYVals) pylab.plot(xVals, estYVals, label = 'Fit of degree '\ + str(degrees[i])\ + ', R2 = ' + str(round(error, 5))) pylab.legend(loc = 'best') pylab.title(title) xVals, yVals = getData('mysteryData.txt') degrees = (1, 2) models = genFits(xVals, yVals, degrees) testFits(models, degrees, xVals, yVals, 'Mystery Data')
- Quadratic model get 84%, and Linear get almost 0%.
- Since the degree of polynomial can affect the Goodness of Fits, what if we use some bigger ones? Can we get a tighter Fit?
A Tighter Fit May Not Be We Need, So How to Choose? #
Why We Build Models
- Help us understand the process that generated the data
- Help us make predictions about out-of-sample data
- A good model helps us do these things
How Mystery Data Was Generated
def genNoisyParabolicData(a, b, c, xVals, fName): yVals = [] for x in xVals: theoreticalVal = a*x**2 + b*x + c yVals.append(theoreticalVal + random.gauss(0, 35)) f = open(fName,'w') f.write('y x\n') for i in range(len(yVals)): f.write(str(yVals[i]) + ' ' + str(xVals[i]) + '\n') f.close()
Let’s Look at Two Data Sets
#parameters for generating data xVals = range(-10, 11, 1) a, b, c = 3, 0, 0 degrees = (2, 4, 8, 16) #generate data random.seed(0) genNoisyParabolicData(a, b, c, xVals, 'Dataset 1.txt') genNoisyParabolicData(a, b, c, xVals, 'Dataset 2.txt') xVals1, yVals1 = getData('Dataset 1.txt') models1 = genFits(xVals1, yVals1, degrees) testFits(models1, degrees, xVals1, yVals1, 'DataSet 1.txt') pylab.figure() xVals2, yVals2 = getData('Dataset 2.txt') models2 = genFits(xVals2, yVals2, degrees) testFits(models2, degrees, xVals2, yVals2, 'DataSet 2.txt')
Hence Degree 16 Is Tightest Fit, But
- We want model to work well on other data generated by the same process
- It needs to generalize
Cross Validate
- Generate models using one dataset, and then test it on the other
- Use models for Dataset 1 to predict points for Dataset 2
- Use models for Dataset 2 to predict points for Dataset 1
- Expect testing error to be larger than training error
- A better indication of generalizability than training error
pylab.figure() testFits(models1, degrees, xVals2, yVals2, 'DataSet 2/Model 1') pylab.figure() testFits(models2, degrees, xVals1, yVals1, 'DataSet 1/Model 2')
- Generate models using one dataset, and then test it on the other
Increasing the Complexity
- coefficient may be zero if Fitting a quadratic to a perfect line
- for example one:
- data: [{0, 0}, {1, 1}, {2, 2}, {3, 3}]
- Fit
y=ax^2+bx+c
, will gety=0x^2+1x+0
, which isy=x
- for example two:
- data: [{0, 0}, {1, 1}, {2, 2}, {3, 3.1}]
- Fit
y=ax^2+bx+c
, will gety=0.025x^2+0.955x+0.005
, which isy=x
- for example one:
- if data is noisy, can fit the noise rather than the underlying pattern in the data?
- for example:
- data: [{0, 0}, {1, 1}, {2, 2}, {3, 3}, {4, 20}]
- Fit
y=ax^2+bx+c
, we got:- R-squared = 0.7026
- suppose we had used a first-degree fit
model = pylab.polyfit(xVals, yVals, 1)
- for example:
- coefficient may be zero if Fitting a quadratic to a perfect line
Conclusion
- Choosing an overly-complex model can leads to overfitting to the training data
- Increases the risk of a model that works poorly on data not included in the training set
- On the other hand choosing an insufficiently complex model has other problems
- when we fit a line to data that was basically parabolic
Suppose We Don’t Have a Solid Theory to Validate Model #
- Use cross-validation results to guide the choice of model complexity
- If dataset small, use leave-one-out cross validation
- If dataset large enough, use k-fold cross validation or repeated-random-sampling validation
Leave-one-out Cross Validation #
Let
D
be the original data settestResults = [] for i in range(len(D)): training = D[:].pop(i) model = buildModel(training) testResults.append(test(model, D[i])) Average testResults
Repeated Random Sampling #
Let
D
be the original data setLet
n
be the number of random samplestestResults = [] for i in range(n) randomly partition D into two sets: training and test model = buildModel(training) testResults.append(test(model, test)) Average testResults
An Example, Temperature By Year #
Task: Model how the mean daily high temperature in the U.S. varied from 1961 through 2015
- Get means for each year and plot them
- Randomly divide data in half n times
- For each dimensionality to be tried
- Train on one half of data
- Test on other half
- Record r-squared on test data
- For each dimensionality to be tried
- Report mean r-squared for each dimensionality
- Code: lecture11-3.py
The Whole Data Set:
Results:
- Mean R-squares for test data
- For dimensionality 1 mean = 0.7535 Std = 0.0656
- For dimensionality 2 mean = 0.7291 Std = 0.0744
- For dimensionality 3 mean = 0.7039 Std = 0.0684
- Mean R-squares for test data
Line seems to be the winner
- Highest average r-squared
- Simplest model
Wrapping Up Curve Fitting #
- We can use linear regression to fit a curve to data
- Mapping from independent values to dependent values
- That curve is a model of the data that can be used to predict the value associated with independent values we haven’t seen (out of sample data)
- R-squared used to evaluate model
- Higher not always “better” because of risk of over fitting
- Choose complexity of model based on
- Theory about structure of data
- Cross validation
- Simplicity