Data from business is mined to generate intelligence. Some textbooks on data analytics are too complicated for the common man. This is a conversational book with concrete examples.
Wholeness of Data Analytics. Data can be analysed so that patterns can emerge. The ideas can be analysed so that they can better meet customer needs. Business produces data which needs to be mined in order to generate intelligence. A company can develop balanced scorecard etc. Reports should be customized to dashboards, for easy viewing. Analytics can be used in sports. For instance, undervalued players with good stats but not picked up by scouts can be detected via analytics. Analytics can improve a retail stores’ business. Daily revenue and expenses can be tracked etc. There are many types of patterns: temporal, spatial, and functional. The Pareto principle might reveal that 20% of the customers bring in 80% of the business etc. Functional patterns involve test-taking skills. Patterns can be broken by the emergence of a black swan. Patterns can be hidden. Data mining can reveal interesting insights. There are five steps in a data processing chain. Data à Database à Data warehouse à Data Mining à Data visualization. Data can come from many sources. Data can be nominal in nature. They could also be ordered. For data on a scale of 1 to 10, they are called discrete numeric values. Some data need to be first structured first and then analysed. Nowadays, every click is being recorded. Data modelling can ensure integrity of data and most follow the relational data model. Understand the one-to-many relationship. Order and products follow a many-to-many relationship. Some databases are so large that they enter petabytes and exabytes. MySQL is an example of a free DBMS (Database Management System). In the data warehouse, the data is being cleaned and rolled up to key dimensions for analysis. They contain many types of lookup tables. Data mining is to recognize innovative patterns in data. The value of the insight depends on what problem needs to be solved. Decision trees are commonly used. Regression is about finding the best fit curve and line and is used for projections. Artificial Neural Networks have their roots in the field of artificial intelligence. Clustering is used for market segmentation. Association Rules Mining look for association between data values, for example: cross-selling. Real-time analysis is the best.
Data mining is the act of digging into large amounts of raw data to discover unique nontrivial useful patterns. Data is cleaned up, and then special tools and techniques can be applied to search for patterns. Diving into clean and nicely organized data from the right perspectives can increase the chances of making the right discoveries. – Anil Maheshwari
Kryder’s law predicts that the density and capability of hard drive storage media will double every 18 months. As storage costs keep coming down at a rapid rate, there is a greater incentive to record and store more events and activities at a higher resolution. – Anil Maheshwari
Data Mining should be done to solve high-priority, high-value problems. Much effort is required to gather data, clean and organize it, mine it with many techniques, interpret the results, and find the right insight. It is important that there be a large expected payoff from finding the insight. – Anil Maheshwari
Business Intelligence Concepts and Applications. Information is the life-blood for businesses. Organizations should collect their own data. In this modern age, almost everything is potentially connected to everything else. Khan Academy has developed many tools. There are also dashboards that are designed for careers. Do not ignore fast-changing information. BI can aid both strategic and operational decision making. Constantly scan your environment and create what-if analysis on possible scenarios. Decision models have to be revised after new insights have been revealed. One can rely on BI specialists or simply do the analysis themselves. Microsoft Excel can be an easy and effective BI tool. There are dash-boarding systems which are updated immediately the data is changed. There are also open-source systems like Weka which are available. ‘Data’ Analyst is now a popular job. BI can be applied to customer relationship management (example: maximize returns on marketing campaigns; improve customer retention; maximize customer value; identify and delight highly-valued customers; manage brand image). Health-care (example: diagnosis of ailments; treatment effectiveness; wellness management; manage fraud and abuse; public health management). Education (example: student enrolment; course offerings; alumni pledges). Retail (example: optimizing inventory levels; improve store layout and sales promotion; optimize logistics for seasonal effects; minimize losses due to limited shelf life). Banking (example: automate the loan application process; detect fraudulent transactions; maximize customer value by cross-selling; optimize cash reserves with forecasting). Financial Services (example: predict change in stock or bond prices; assess effects of events on market movements; identify fraudulent activities in trading). Insurance (example: forecast claim costs; determine optimal rate plans; optimize marketing to specific customers; identify and prevent fraudulent claim activities). Manufacturing (example: discover novel patterns to improve product quality; predict/prevent machinery failures). Telecom (example: churn management; marketing and product creation; network failure management; fraud management). Government (example: law enforcement; scientific research).
Data Warehousing. The data in the data warehouse is meant for collation to reports. The DW must be kept updated. A good DW must be subject-oriented. It should be integrated and also should provide time-variant (time series). The data should be non-volatile. It should also be summarized and not normalized. There must both functional data and enterprise data warehouses. The DW architectures consists of 1) data sources; 2) data transformation; 3) data mart; 4) accessing users and applications. Data should be extracted and aligned by key fields. It should be cleansed of irregularities or missing values. Data should be extracted on a frequent basis. There should be a proper data warehouse design, for instance, a ‘star schema’ etc. The DW should be aligned with corporate strategy. Financial viability should also be established. It is crucial to manage user expectations as well.
Data Mining. The past can reveal crucial data about the future. For example, if you know that ‘customers who buy cheese and milk also buy bread 90% of the time’, this is useful information. Targeting customers can be accurate. Not all data streams can provide valuable insight. It is important to choose where to collect data from. Curating data takes time and effort. Data can also be unstructured and semi-structured. The quality of data is paramount. It has to be reliable. Data must be cleansed: 1) duplicate data removed; 2) missing values filled; 3) data values transformed; 4) conditional formatting; 5) data elements adjusted; 6) outliers removed; 7) correct for bias in selection; 8) adjusted to same granularity; 9) increase information density. An output can be a decision tree. There are both supervised learning (use of decision tree and use past decisions to affect future decisions) and unsupervised learning (ex: cluster analysis). Predictive accuracy = (Correct Predictions)/Total Predictions. A model with more than 70% is considered good. Decision trees are easy to use. Some popular software are 1) C5, 2) CART, 3) CHAID. Artificial Neural Networks (ANN) rely on data feeding and then training the neural network to adjust its internal computations and parameters based on previous decisions. Clustering is a segmentation technique. Association rules identifies cross-selling opportunities. There are many data mining tools available in the market. IBM’s SPSS Modeler is one of the best data mining tools available. Business Objects in SAP is also one of the leading BI suites in the market. Create iterative sequence of steps. There are 6 steps for best practices: 1) business understanding; 2) creative solutions; 3) data should be clean and high quality; 4) develop algorithms/modelling; 5) create what-if scenarios; 6) roll-out and incorporate in business processes. You don’t need to be an expert to do data mining. Clear metadata is important.
Decision Trees. They are efficient to use. For every question you ask, you will enter a separate branch. Humans need to learn from past experiences. The more data is provided, the more accurate it will be. The more variables there are, the tree will also come out with higher accuracy. Look at past decisions. Not all the variables must be included in the decision tree. First, determine the root node of the tree. Which is the most important question to ask? That should be the root node. Choose the one that has the lowest error rate based on historical value. Similarly, the next node can be developed. There are algorithms for making decision trees. The tree should not be unusually long.
Regression. It is used to model predictive relationship between several independent variables and one dependent variable. The idea is to find a best fit curve or line. The quality is measured by the coefficient of correlation (R). If you can, read the book ‘The Signal and the Noise’ by Nate Silver. Correlation is measured from 0 to 1. 0 indicates no relationship at all. Some charts can be modelled using the simple linear regression model. There are also complicated and complex quadratic regression equations. You can add other factors as well to improve the model. The model will then be able to explain even more. Correlation > 95% is considered good. There are also curvi-linear regression. Sometimes, you need to square the factor so that the correlation will improve. There is also logistic regression models. This is good as it can work with dependent variables with binary values. Microsoft Excel is also powerful enough to perform simple regression analysis. However, data must be cleaned before regression can be applied. Adding too many variables will also affect the predictive capability of the model. They do not automatically take care of non-linearity.
Artificial Neural Networks. This is inspired by the information processing model of the brain. They are versatile systems. The system can be updated with improvements and then improve decision making along the way. There are many business applications of ANN. It can be used in stock price prediction. A neuron is a basic processing unit of the network. The output from this neuron can be transferred to another neuron in the network. It is possible for it to be more accurate than humans. Every input is assigned some weight. Different algorithms can be utilized. For instance, one can use something called ‘Multi-layer Perception’. However, they are deemed to be black-box solutions. Optimal designing an ANN is not simple. Large datasets are required to hone an ANN.
Cluster Analysis. This is used for identifying a group of things. There is no right or wrong way in assigning a cluster. Cluster analysis can be useful in market segmentation, text mining etc. Heuristics can be used to determine the number of clusters. There are different ways to define what a cluster is. It could be the most common occurring value etc. There are both Manhattan and Euclidian distances. The K-means algorithm minimizes the least square distance from the centre points of the clusters. K-Means algorithm is simple to use and easy to implement. However, the user must specify the value of K.
Association Rule Mining. This is good as it helps to identify shopping patterns. It can also help to identify cross-selling opportunities. Data is categorical. Netflix also uses algorithms. Are they being manipulative in nature? It is important to find rules that satisfy the minimum support and minimum confidence. Algorithms can help identify the frequent item sets. Apriori algorithm is one of the most common ones. Support levels and confidence levels must be established.
Text Mining. This is about discovering knowledge from organized collection of textual databases. Text mining can be performed for social media technologies. If you speak about certain topics on whatsapp, the data might be mined. Facebook might be able to capitalize information you know based on what you say on WhatsApp. Text Mining is useful for people like Chief Knowledge Officers. It is possible to guess a person’s emotional state based on what a person says. For text mining to take place, the text must be structured and analysed. A bag of words will be picked and the frequency of occurrence will be computed. Data mining tools can help to classify and analyse them. Data should be cleaned for spelling errors. All the analysis can form a term document matrix. Regression can also be performed. Documents with similar profiles can be bundled together. Text data must be converted into frequency data before it can be analysed. The most important thing is to ask the right question. Think outside the box. Be creative in proposing solutions. Go after the problem iteratively. Text mining is useful in the advent of social media.
Web Mining. This is the art of discovering patterns from the Web. There is a huge volume of data being published every day. Appearance, content and functionality matters a lot. It should have an aesthetic design. The content should be well planned. Data is useful for commercial advertising. Web mining is divided into ‘web content’; ‘web structure’; ‘web usage’. Text and applications can be analysed by the number of visits. The pages with many interesting links is known as a ‘hub’. Information like clicks are captured in an ad server or proxy server etc. Metadata can be gathered as well. Usage patterns can be analysed by ‘clickstream’ analysis. There are many business uses for web usage mining. PageRank is a fantastic web mining algorithm.
Big Data. This refers to a large data set that is extremely large and complex. This is an up-and-coming industry. It can help to identify trends. The challenge is always how you go about analysing it. There are both structured and unstructured data. Data can move quickly. Social media contains a lot of data. Big data will disrupt your business. How can you harness big data information for growth? Cost of storage is falling and the speed of access is much faster in modern times. There are many new tools to handle this data set. Being customer centric helps. The data collected should be able to help your customers. Advanced capabilities are required. The faster you analyse, the better it is. Plan for exponential growth.
Data Modelling Primer. There are 10 qualities for good data. Data should be 1) accurate; 2) persistent; 3) available; 4) accessible; 5) comprehensive; 6) analysable; 7) flexible; 8) scalable; 9) secure; 10) cost effective. Nowadays, databases are mostly relational in nature. Data tables can be joined using key attributes. An entity should have certain attributes about them. There are different types of relationships: 1) one-to-one; 2) one-to-many; 3) many-to-many. Every entity must have a key attribute. It is easy to implement if a DBMS is being used. This is a popular way of linking data together.