Sunday, September 15, 2019

PCA and Me.

I recently watched Computerphile's video sequence on Data Analysis, which prompted me to share my own experience with PCA:

Early in my engineering career I worked in a field where we had to measure certain quantities with extraordinarily high accuracy.  When we found ourselves an a situation where we needed to buy an instrument that cost well over $1,000,000 (the instrument was so rare it was impossible to rent one), management suggested we start a side project to build our own instrument that would at least meet our immediate needs, and if we did it well, we could go on to sell it to compete against that million dollar instrument.  Our physicists and other scientists immediately dreamed up a "novel physics" sensor they predicted would be both more sensitive and less expensive.  They then built a "Proof of Concept" device in the lab, and it's performance looked very promising.

My job was first to turn that table-covering lab experiment into something useful to our engineers, then determine if it could be manufactured and sold.  The lab device functioned horribly when removed from the lab.  The lab was temperature controlled, vibration isolated (optical table), light controlled (dark), sound controlled (anechoic wall coverings), EM controlled (Faraday cage, shields), and so on.

What they did in the lab was expose the sensor to known levels of the stimulus we wanted to detect and measure, then develop algorithms to map the raw sensor signal to the applied stimulus, then do several test runs to gather enough data to determine accuracy, precision and repeatability.  My job was to determine what would be needed to build a device that worked outside the lab, with enough quality and performance to meet our needs.

My first test was simply to repeat the lab test on my engineering workbench.  As previously mentioned, the results were horrible:  A quick plot instantly proved the value from the device appeared to be utterly unrelated to the applied stimulus, even after the raw output was post-processed with the algorithms used in the lab. In fact, the raw output looked more like random noise.

This was no surprise!  Few sensors, if any, ever measure only one thing.  For example, the voltage sensor in a common hand-held multimeter is a circuit that is affected by many other environmental stimuli other than the voltage present on the probes, such as temperature, electrical noise,  pressure, humidity, and so on.  Yet portable multimeters with 6-digit accuracy can be had for only a few hundred dollars: Clearly, these other stimuli can be engineered out of the final product to the extent that 6-digit precision is achieved.

The lab environment is what's called a "single variable system": Everything but the desired stimulus was held constant.  My workbench was far "noisier".  The next step was to intentionally vary as many environmental factors as possible, and see how the sensor responded.  Ideally, only one environmental factor would be varied at a time, but that's simply not practical outside a far larger lab.  So you go the opposite way, taking data with as much stable as possible, then vary the factors one at a time or in combination, whichever was most practical (fast, easy, cheap), the primarily requirement being to measure everything that could be measured in parallel with the desired applied stimulus.

The "pièce de résistance" of this effort was a long data set taken over days while simultaneously (and very carefully) varying as many environmental factors as possible, which in this case took place inside a temperature+humidity chamber that contained a miniature shake table (basically, a speaker with a plate on top instead of a cone), to which I added accelerometers to measure motion, and whatever other instruments I could find that measured anything and everything else to as high a precision as possible.

This setup made Frankenstein's Monster look pretty, and Rube Goldberg's devices look simple, elegant and sensible.   Getting all this data correctly gathered and recorded was its own nightmare, the most critical item being correctly tagging each and every measurement with the precise time at which it was taken.  (Timing deserves it's own separate post.)

Each data point contains the time at which the data was taken, the value for each environmental parameter being measured (including the desired stimulus), and finally the raw value output by the sensor itself.  The correct term for a data point created from multiple measurements is a "sample", done specifically to remind us that we aren't seeing "actual physics", but only what our instruments are revealing to us.

Note: What I've described above is "time-series" data, which enables many additional analytical techniques to be applied, because time connects adjacent data points in ways few other parameters permit.  Most importantly, time-series data can be analyzed in both the linear domain (much as is done in the video series) and also in the frequency (or complex) domain.  The most well-known tool connecting these domains is the FFT (Fast Fourier Transform), though there are others.

At the end, what you have is a truckload of data.  Several truckloads.  Millions of data points, each with up to a dozen attributes.   At that point, the data collection stops and the analysis starts.  The best place to start is with the largest, messiest data set.  First you condition the data as described in the videos.  Then the best tool to apply is PCA.

I followed an iterative process:
1. Run PCA.
2. Determine which environmental factor best correlates to PC1.
3. Remove that factor from the sensor data.
4. Repeat from Step 1 until PC1 no longer correlates with any of the environmental factors.

Step 3 may sound simple, but it is very complex to do correctly.  Simply subtracting a normalized value from the raw sensor data is seldom useful, as the effect is seldom purely additive or purely linear.  We must determine if there are known/common transformations that will permit the environmental factor to account for most of PC1.  Temperature, for example, often has an exponential factor in its effect.

"But wait!" I hear you say. "Isn't PCA strictly a linear process?  How can you use it to derive an exponential correction?"  The simple answer is you can't, not directly.  So you cheat.  Given enough data, PCA can be applied to shorter chunks, permitting piece-wise linear corrections to be determined, from which the governing non-linear (exponential or polynomial) correction may be derived.  That's why multiple millions of samples are taken.

Not surprisingly, the first PC1 correlated with temperature, validating the truism "All sensors are thermometers".  Which is why every measurement instrument applies at least one, and often multiple, temperature corrections.

Next was vibration, with the matching truism "All sensors are microphones", which explains the shock mounts used within many instruments.  Just rapping your knuckle on the case of a $20K oscilloscope will often be enough to cause it to trigger due to piezo-electric effects present in the MLC capacitors used in the sensitive input amplifiers.  (See the EEVBlog videos on this.)

The above process has one huge, massive, terrible downside: It accumulates/amplifies all noise present in the data.  In my case, Step 4 was reached even before the stimulus of interest was matched by correlation with PC1!  The noise was dominant and correlated with nothing.

Which means we toss out all the data and start over, this time directly removing the environmental factors having the highest correlations.  There are two ways to remove the effect of an environmental factor from a sensor: Either hold it constant or remove a signal to cancel its effect.

For the example of temperature, the sensor could be actively heated/cooled to keep it at a known temperature, something commonly done for precision time references such as crystals and atomic clocks (called putting them in an "oven").  This is called "effect prevention", and it is relatively expensive to implement within an instrument, to be avoided unless absolutely necessary.  There are some sensor materials that work best only at a single temperature, so an oven is the only choice if that sensor is to be used.

The other alternative is to reduce the effect as best we can, then generate a signal that matches the remaining effect and remove it from the sensor value.  This is called "effect compensation", and is relatively cheap to implement, though it is always preferred to find sensor materials that don't need compensation.  For temperature, it can be as simple as wrapping the sensor in fiberglass with a temperature sensor inside.  The sensor could be an RTD, diode or thermocouple, which ever best matches the behavior observed in the raw sensor signal.  Then that signal is subtracted or divided from the sensor signal.

Then it's time to repeat the data gathering run in the exact same way as before, and repeat the above analysis.  We repeat the process of correlating and removing effects and doing the data gathering until as many correlations as practically possible have been removed.  It should be no surprise that temperature had to be "removed" multiple times, generally by adding higher-order terms to the correction.

We soon reached the point where we could reliably extract a useful measurement for the desired stimulus from the single system sitting on my workbench.  It performed significantly worse than the million dollar instrument, but it met our immediate needs.  That device was more properly called an "escaped lab rat", in that while it definitely worked outside the lab, it wasn't anything close to being a commercial instrument.

A commercial instrument has two key features:  It can be calibrated, and once calibrated it provides useful results for an extended period of time.  In the example of the million dollar instrument, it had to be calibrated every time it was turned on, which meant it could never be turned off!  (This is not uncommon for ultra-high-end instruments.)  So part of the purchase price was an uninterruptible power supply.

Fortunately, our "escaped lab rat" could be turned on and off as needed, requiring only a 5-10 minute stabilization period before producing usable measurements.  Which gave us a very good reason to keep working, and management agreed, giving us a generous budget.  The larger budget was needed because this effort would be far greater than a single engineer at a workbench.

The project started just as did my prior efforts, with taking lots of data and doing lots of analysis.  Only this time the goal was to optimize everything to find the best solution for each correction, which meant testing multiple alternatives, and sometimes combining them.  This is when R&D (Research and Development) becomes "Product Development".  Being primarily an R&D engineer, I stayed with the project long enough to share my work, then moved on to other projects.

That instrument did make it to market, then completely took it over.  I'd love to say what that instrument was, or who made it, or what it sensed, but none of it was patented: The technologies used were considered so bleeding-edge that filing patents would expose them to the world, encouraging others to engineer around the patents rather than create something from first-principles as we did.  That makes what we did a Trade Secret, something I can't share until it otherwise becomes common knowledge (and is the reason why some NDAs have no expiration date).

The process shared above is common practice in sensor R&D, and is the best reason to become multidisciplinary:  While my university degree is in Computer Engineering (100% CS + 30% EE), I was an electronics lab technician before and during college, and also did well in physics, math and statistics.  I'm now a Systems Engineer, where I get to work at the highest product and technology levels, yet I'm still able to get some lab time in when a knotty problem arises.

Of all the skills I've accumulated, the most useful has been knowing when and how to use PCA, and knowing what to do with the results.  Statistics and data analysis methods allow us to tame the chaos of the real world, and to make sense of it.