I have worked with data-driven applications and databases for finance, telecom, military sensors, and bioinformatics since the late 1990’s. For the past 5 years, I have focused on data management systems in support of scientific research and machine learning. I am sharing the knowledge gained through my work because it is relevant to other businesses.
Raw data has value, but that value can be increased by processing the data into information. Information is suitable for additional analysis (machine learning or otherwise) which converts it to knowledge. It takes effort to make these data transformations, but once done, you have data in a form that is suitable for reuse. It becomes simpler to use in subsequent analyses with new and better techniques. If you plan out your data strategy and automate these transformations with a data science pipeline, you can save much time on behalf of your developers and data scientists, and enter data heaven!
Data Collection Strategy
The scientific method involves creating and testing hypotheses. Data Science is no different. The heart of a data strategy is to have hypotheses — questions you want your data to answer. When you know the questions, you will know what data you need and can design systems to collect the right data and prepare it for analysis. There are many considerations that should be applied enterprise wide to your data producing systems.
- Plan your naming (file conventions, column names, etc.), stick to it, be consistent
- Unify your web log formats, syslog formats, and other application logs
- Use standard formats to simplify handling, such as ISO standard timestamps
- Have a plan for empty columns (empty string vs null) and not a number (NaN) and how your analytical code will handle them
Note that these techniques will also help you manage and assess the health of your enterprise systems easier, since they reduce the effort needed to report on your system logs.
If you have multiple sources of data in your enterprise, do they follow internal standards? It is far simpler to put data in the proper format at its time of creation, then have to touch every piece of data to verify its format or make your analytical code handle similar data in different ways. This will save you significant development dollars and processing time.
Follow specs, use correct data type formats, and ensure you have good libraries for processing the data. This will make it easier to write analytical code, and the code will likely perform better. The ISO has standards for certain data types, especially timestamps.
Companies are going to drown in IoT data if they do not plan how to deal with it. Pay attention to data sampling rates — they add up quickly, and you probably don’t need it all. Learn to summarize the data to meet your analytical needs — maybe you just need alerts, 10-minute averages, or a Fourier Transform to show a pattern in data. Is your data text or binary? Storing numerical IoT data as binary data is much more efficient and will save money.
- Do you have permission to collect the data you are collecting?
- Is the data being safeguarded per applicable regulations? This is especially important for medical and government/DoD data, and if you have European customers.
To get the ball rolling you need to curate your data. To curate means to store, make discovery easy, support sharing, annotate, and maintain provenance. Below I cover some specific areas that will help you curate your data.
The key to curation is to create metadata for datasets, preferably at the time of data deposit, since the person or system providing the data knows the metadata. What is metadata? File names, storage locations, file creator, points of contact (name, email, etc.), how the data was created, data source, keywords, creation dates, permissions, data description, data dictionary (field name descriptions), special data processing steps, etc. Your business domain may have additional data items.
Metadata gets loaded into a search engine, such as Elasticsearch, to support data discovery. Elasticsearch is great (been using it since 2015, and Apache SOLR for years before that) and you can also use Kibana to report on your data. A metadata-based search engine performs very well and requires less hardware than a full-text search of the same data. In the case of numerical data (like IoT sensors), search is not practical, so metadata is required.
Metadata can also be easily used for creating links between documents and showing these using a Graph Database, such as Neo4J. We apply this technique to search results and show the user a graph of the results so they can visualize how the results are related to each other. This can find results more central to a search (you can see the data cluster) and also be used to launch a new search.
Data is a major component of your Intellectual Property and must be secured with multiple lines of defense: network defenses to restrict overall access, software based access controls to limit internal access, and data field controls (encryption, hashing, or even obfuscation with dummy data columns) as a final defense. Threats abound: Middle east wipers destroying your data, organized crime using ransomware to lock your data or simply steal valuable data (credit cards, drug test results, Intellectual Property), public disclosure of data that could eliminate a competitive advantage, cause embarrassment, or result in other damage. Just check the news for the attack of the week and the lessons learned. If you need help improving or establishing cybersecurity controls, see https://www.nist.gov/cyberframework. This is NOT just for government and is an excellent resource.
Quality Assurance (QA) is used to prevent defects, but what is a data defect? Defect examples are bad values (numbers out of range), miscategorized data, missing time series data, or a failed processing pipeline so data is not transformed (failure to normalize, for instance). To prevent these, you need robust exception handling. If a pipeline fails, do you attempt to restart it and restore the missing data? Are exceptions in code resulting in non-initialized variables (“null” values) being used in data fields? Good engineering practices are required for defect prevention, including code reviews, high unit test code coverage, web/database field validation and constraints, or even some Chaos Monkey testing to validate you have robust pipelines.
Quality Control (QC) is used for defect detection. Once you have your data in storage, do you ever look at it for problems? Sampling the data and comparing it against expected values by running through test code will help. For example, testing against boundary values can identify anomalies. Check geolocation data against rules for latitude and longitude (and common default values of 0,0 and somewhere in Kansas). Check sensors against expected ranges of values — does 1,000,000 volts make sense? Probably not. Pay special attention to fields with no constraints, like a database column defined as varchar. Application layer constraints can get violated by back-end processing or direct API’s into a data storage system where the constraints don’t exist.
Machine Learning Benefits
Data curation improves corporate knowledge about your business. It will help transform raw data to information and convert information to knowledge. Data is not exclusively for ML; cleaning datasets and making them ready for analysis is good for any data science analysis. If pursuing Deep Learning, one needs quality data sets and more of them for training and algorithm development than other techniques.
One you have good, curated data there are many things you can do with it regarding machine learning. Details on these would take another post or two, but I want to get these ideas out.
- Automate creation of cross-validated training sets
- Conduct confusion matrix analysis to discover data problems — bad labels, wrong file types, not enough data in some categories
- Perform Automated Feature Engineering
- Identify problems in training data that could induce bias in a machine learning model
- Perform trend analysis to show how training set improvements improve model performance
- Report on type and amount of training data in system
Automate Your Data Pipeline
Here is a collection of links to tools to help with workflow, data flow, and data management. I have been impressed with NiFi for data flow processing and I know other companies using some tools below. Some of these focus on machine learning.
Open Sourcing Amundsen: A Data Discovery And Metadata Platform
By Tao Feng, Jin Hyuk Chang, Tamika Tannis, Daniel Won
Me! This was taken somewhere in Spain, maybe Cordoba, about 10 years ago. Hence the memory loss.