Friday, 22 February 2013

Techniques for Data Subset

In my previous posts, I explained about Data Subset in TDM & Implementation Approaches to Data Sub-setting.  In this post, I will explain some basic techniques with the help of which we can do a data subset.

  • First N records
    • This is a pretty simple technique, wherein the first N records of a table is retrieved from the Production database.  This can be achieved using a simple SQL Query such as
      • SELECT TOP 10000 * FROM DBO.CUSTOMERS
  • Based on a filter criteria
    • This is a scenario where the subset conditions can be based on a simple filter criteria like Age > 50, City = London, etc.  This is easier to implement in cases where the subset requirements are less complicated.  An example query for such technique would be
      • SELECT * FROM DBO.CUSTOMERS WHERE AGE > 50 AND CITY = 'LONDON'
  • Based on a complex SQL query
    • Sometimes the subset requirements can be more complicated.  It might involve dependencies across multiple tables.  What that means is multi-level queries, and complicated joins in SQL terms.  So a subset can be based on complex SQL queries.  For ex.  Fetch all the customers that have placed orders during the Christmas season.  This will translate into an SQL query which will involve both Customer as well as Orders table.  For example a query might be
      • SELECT * FROM CUSTOMERS WHERE CUSTID IN (SELECT DISTINCT  ORD_CUSTID FROM ORDERS WHERE ORDER_DATE > '1-DEC-2012' AND ORDER_DATE < '1-JAN-2013')
  • Sampling / Distribution based
    • Sampling is a process of picking a sample of the records, for example every 5000th record or based on the distribution of certain fields. For ex.  Fetching the records with 10% data with London city, 50% data with Chennai city, 40% data with Mumbai city.  This can be useful in scenarios where there is a need for specific sets of test data in larger quantities.  There are no straight forward SQL queries for this technique and is more complicated than the rest of the techniques.
A thing to be noted here is, the condition that is applied on the Parent table should be cascaded down to the Child tables also.  This is a critical requirement for any data subset as it will ensure data integrity across all the tables.

NOTE: The assumption is the RDBMS used here in our case is SQL Server, the concept is similar for other RDBMS also but the syntax might vary slightly.

Hope this post was informative.  Thanks for the read.  Comments are welcome.  Please feel free to add any techniques / debate on the techniques mentioned above.  Cheers.

About the Author

Rajaraman Raghuraman has nearly 8 years of experience in the Information Technology industry focusing on Product Development, R&D, Test Data Management and Automation Testing.  He has architected a TDM product from scratch and currently leads the TDM Product Development team in Cognizant.  He is passionate about Agile Methodologies and is a huge fan of Agile Development and Agile Testing.  He blogs at Test Data Management Blog & Agile Blog.  Connect with him on Google+

1 comment:

  1. Employing the right methods ensures that relevant information is extracted, Best VPNs Free reducing computational complexity and improving the overall efficiency of data-driven tasks."

    ReplyDelete