Data Mining Concepts

EMC2DATA

Data Mining Concepts

  • 1. Defining the Problem
  • 2. Preparing Data
  • 3. Exploring Data
  • 4. Building Models
  • 5. Exploring and Validating Models
  • 6. Deploying and Updating Models
  • Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.

    These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific scenarios, such as:

  • Forecasting: Estimating sales, predicting server loads or server downtime
  • Risk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
  • Recommendations: Determining which products are likely to be sold together, generating recommendations
  • Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
  • Grouping: Separating customers or events into cluster of related items, analyzing and predicting affinities
  • Building a mining model is part of a larger process that includes everything from asking questions about the data and creating a model to answer those questions, to deploying the model into a working environment. This process can be defined by using the following six basic steps:

  • 1. Defining the Problem
  • 2. Preparing Data
  • 3. Exploring Data
  • 4. Building Models
  • 5. Exploring and Validating Models
  • 6. Deploying and Updating Models
  • The diagram describes the relationships between each step in the process, and the technologies in Microsoft SQL Server that you can use to complete each step.

    Data Mining Concepts

    The process illustrated in the diagram is cyclical, meaning that creating a data mining model is a dynamic and iterative process. After you explore the data, you may find that the data is insufficient to create the appropriate mining models, and that you therefore have to look for more data. Alternatively, you may build several models and then realize that the models do not adequately answer the problem you defined, and that you therefore must redefine the problem. You may have to update the models after they have been deployed because more data has become available. Each step in the process might need to be repeated many times in order to create a good model.

    Microsoft SQL Server Data Mining provides an integrated environment for creating and working with data mining models.

    This environment includes SQL Server Development Studio, which contains data mining algorithms and query tools that make it easy to build a comprehensive solution for a variety of projects, and SQL Server Management Studio, which contains tools for browsing models and managing data mining objects. For more information, see Creating Multidimensional Models Using SQL Server Data Tools (SSDT). For an example of how the SQL Server tools can be applied to a business scenario, see the Basic Data Mining Tutorial.