I have 3 tables with dates to which I want to join to a date dimension table but it is returning many duplicates with left joins

clip art of 
 a double-quote character

Question

I have 4 tables with dates to which I want to join to a date dimension table but it is returning many duplicates with left joins.

Tables are basically a date field which I want to count.

         mdate datetime, mordate varchar(10),fteam varchar(20)
         sdate datetime,fteam varchar(20)
         bdate datetime,fteam varchar(20)

These are actually one table with the separate dates which I am joining 3 times to the dimension table to get one dataset. Also this table

        compdate datetime, fteam varchar(20)

and the date dimension table as date in yyyymmdd,which I join on the date field.
as

        select cp.fteam,md.mdate,sd.sdate,bd.bdate,cp.cpdate,d.date

         into #resultstable

        from datedimension d 
        left join mdate md
        on d.date = convert(date,md.mdate,103)
        left join sentdate sd
        on d.date = convert(date,sd.sdate,103)
        left join bacdate bd
        on d.date = convert(date,bd.bdate,103)
        left join compdate cp
        on d.date = convert(date,cp.cdate,103) 

Doing this I want the dates in the date dimension to give me one date I can use a where clause on to get counts of each date from the 4 different tables for a report.

However it is giving me many repeats as each time a there is a matching date you get the same line repeated for the matching date on all tables.
This gives many counts which are wrong.
ie)if md table has a record 2 records for 2016/06/29 and cp has 3 and bd has six:
The dimension date result will be 36! for md when it should only be showing 2!, ie 6x3x2.

How can I join these tables with causing repeats and incorrect results.

I thought it would be a standard way to join fact tables with a dimension table to give accurate results and not duplicates as you are join sets together.

I have tried picking only the dates from each table only but it still gives repeats.
I cannot show a schema as company details but you can put together a hypothetical one from the tables shown.

asked 2016-07-29 by Haider Ali


Answer

What you are seeing is that because none of your fact tables relate to each other, you are essentially creating a Cartesian product for the fact tables–where they only relate to each other by date.

Consider this simplified version of your example, where I also include some sample data for “today”:

    CREATE TABLE #fact1 (id int identity, dt datetime, val varchar(5));
    CREATE TABLE #fact2 (id int identity, dt datetime, val varchar(5));
    CREATE TABLE #fact3 (id int identity, dt datetime, val varchar(5));
    CREATE TABLE #fact4 (id int identity, dt datetime, val varchar(5));
    CREATE TABLE #date (dt datetime, val varchar(5));
    GO
    
    INSERT INTO #fact1 (dt, val) VALUES (GETDATE(),'fact1');
    INSERT INTO #fact2 (dt, val) VALUES (GETDATE(),'fact2');
    INSERT INTO #fact3 (dt, val) VALUES (GETDATE(),'fact3');
    INSERT INTO #fact4 (dt, val) VALUES (GETDATE(),'fact4');
    WAITFOR DELAY '00:00:01';
    GO 5
    
    INSERT INTO #date (dt, val) VALUES (CAST(GETDATE() AS date),'Today');
    GO
    
    
    SELECT *
    FROM #date d
    JOIN #fact1 AS f1 ON d.dt = CAST(f1.dt AS date)
    JOIN #fact2 AS f2 ON d.dt = CAST(f2.dt AS date)
    JOIN #fact3 AS f3 ON d.dt = CAST(f3.dt AS date)
    JOIN #fact4 AS f4 ON d.dt = CAST(f4.dt AS date);
    GO
    
    DROP TABLE #fact1;
    DROP TABLE #fact2;
    DROP TABLE #fact3;
    DROP TABLE #fact4;
    DROP TABLE #date;
    GO

Note that 625 rows are returned. This is the Cartesian product of the four fact tables, which is then joined to the dimension table. This happens because there is no relation between the fact tables other than the date. As a result, any one row for “today” in one fact table is joined to every row for “today” in every other fact table.

Instead, consider how your four fact tables related WITHOUT the join to the date dimension table. Re-write your query so that the data makes sense before joining to the date dimension. Do the tables relate on something like an order_id or any other aspect?

If the fact tables only relate insomuch as you are aggregating them by date–then yes, you’ll need to take another approach:

a) Aggregate by date first, then join the aggregated sets together. This option makes the most sense if you only need the aggregated values, and don’t need the full details for your report.

    SELECT *
    FROM #date d
    JOIN (SELECT CAST(dt AS date) AS dt, count(*) AS dt_count 
          FROM #fact1 GROUP BY CAST(dt AS date)) AS f1 ON d.dt = f1.dt
    JOIN (SELECT CAST(dt AS date) AS dt, count(*) AS dt_count 
          FROM #fact2 GROUP BY CAST(dt AS date)) AS f2 ON d.dt = f2.dt
    JOIN (SELECT CAST(dt AS date) AS dt, count(*) AS dt_count 
          FROM #fact3 GROUP BY CAST(dt AS date)) AS f3 ON d.dt = f3.dt
    JOIN (SELECT CAST(dt AS date) AS dt, count(*) AS dt_count 
          FROM #fact4 GROUP BY CAST(dt AS date)) AS f4 ON d.dt = f4.dt;

b) Assign an arbitrary row_number() for each calendar day, then use that as a secondary join criterion. If the data doesn’t actually relate, this option might work, but the detailed result set is largely meaningless when all data in a single row doesn’t refer to a single entity. This might give you the right numbers, but is logically a useless result set.

    SELECT *
    FROM #date d
    JOIN (SELECT *, 
             ROW_NUMBER() OVER(PARTITION BY CAST(dt AS date) ORDER BY dt) AS row_num
          FROM #fact1 ) AS f1 ON d.dt = CAST(f1.dt AS date) 
    JOIN (SELECT *, 
             ROW_NUMBER() OVER(PARTITION BY CAST(dt AS date) ORDER BY dt) AS row_num
          FROM #fact2 ) AS f2 ON d.dt = CAST(f2.dt AS date) AND f1.row_num = f2.row_num
    JOIN (SELECT *, 
             ROW_NUMBER() OVER(PARTITION BY CAST(dt AS date) ORDER BY dt) AS row_num
          FROM #fact3 ) AS f3 ON d.dt = CAST(f3.dt AS date) AND f1.row_num = f3.row_num
    JOIN (SELECT *, 
             ROW_NUMBER() OVER(PARTITION BY CAST(dt AS date) ORDER BY dt) AS row_num
          FROM #fact4 ) AS f4 ON d.dt = CAST(f4.dt AS date) AND f1.row_num = f4.row_num;

c) Break this up into separate statements: one for each fact table. Optionally UNION those results into a single result set. This result set could then be further aggregated/grouped to give you the results you want.

    SELECT *, 'Fact 1' AS SourceTable
    FROM #date d
    JOIN #fact1 AS f1 ON d.dt = CAST(f1.dt AS date)
    UNION ALL
    SELECT *, 'Fact 2' AS SourceTable
    FROM #date d
    JOIN #fact2 AS f2 ON d.dt = CAST(f2.dt AS date)
    UNION ALL
    SELECT *, 'Fact 3' AS SourceTable
    FROM #date d
    JOIN #fact3 AS f3 ON d.dt = CAST(f3.dt AS date)
    UNION ALL
    SELECT *, 'Fact 4' AS SourceTable
    FROM #date d
    JOIN #fact4 AS f4 ON d.dt = CAST(f4.dt AS date);

In my opinion, options a & c offer the best solutions when the fact tables don’t otherwise relate to each other. Option b might work, but you would need to be very careful that your data is meaningful & doesn’t create confusing or erroneous results.

Additionally, while it is orthogonal to the question asked, keep in mind that applying a function to a join criteria (in this case, CONVERTing the date column) will prevent index usage, resulting in a table scan.

answered 2016-08-05 by Andy Mallon