Sunday, July 17, 2011

Blog Post: Flat File Source Changes in Denali

Nearly everyone uses the flat file source adapter, and nearly everyone that uses it runs into a common set of challenges. In Denali, we?ve made a couple of key enhancements to the Flat File Source to support the most commonly asked-for options:

  • Delimited files with varying numbers of columns per row (aka ?Ragged-Right? delimited files)
  • Delimited files with embedded qualifiers

 

Ragged-Right Delimited Files

A common way NULL data is represented at the end of rows in delimited files is without the column delimiter. Here?s an example:

OrderID

CustomerID

EmployeeID

ShipVia

10248

VINET

5

3

10249

TOMSP

6

10250

HANAR

4

2

When represented as a delimited file, it?s fairly common to see this kind of representation

OrderID,CustomerID,EmployeeID,ShipVia  10248,VINET,5,3  10249,TOMSP,6  10250,HANAR,4,2

Note the lack of a trailing comma after the ?6? in the second row. Customarily, most flat file parsers treat the lack of data on this row to mean there are NULLs in the remaining columns on the row.

Prior to Denali, SSIS took a different approach, where it would ignore any row delimiter until it believed it was parsing the last column of the row. So for example the flat file above might be parsed as:

OrderID

CustomerID

EmployeeID

ShipVia

10248

VINET

5

3

10249

TOMSP

610250

HANAR,4,2

Of course, since ?HANAR,4,2? isn?t a valid number, if ShipVia was an integer column, the parse would fail. And worse yet, there was little you could do to work around the limitation; your choices were to write your own parser, convert the data before parsing it, use the Flat File Source just to parse rows, or some such.

In Denali, this kind of file is parsed differently ? by default, we?ll always look for a row delimiter in unquoted data, and start a new row if it?s seen. So the table above is parsed as one would expect:

clip_image001

This new behavior is on by default, but can be disabled at any time using the AlwaysCheckForRowDelimiters property on the connection manager.

Embedded Qualifiers

Another custom in delimited files is the use of a qualifier character to ?escape? or embed a qualifier character into a qualified string, for example:

ID

Title

1148

Can?t Buy a Thrill

1149

Echoes, Silence, Patience & Grace

To allow the literal commas to be part of the string, the use of qualifiers is typical in a flat file representing this data. However if an apostrophe(') is used for qualifying, the apostrophe in the first row must be escaped, and a typical way to accomplish this is by doubling the qualifier character:

ID,Title  1148,'Can''t Buy a Thrill'  1149,'Echoes, Silence, Patience & Grace'

Prior to Denali, such files would always fail to parse: though SSIS supported qualifiers, SSIS had no support for embedded qualifiers, and would treat the first qualifier character (in this case the first apostrophe after the ?n? in ?Can?t Buy a Thrill?) as the end of that field, and throw an error due to the lack of a delimiter immediately after the apostrophe.

In Denali, this is fixed, and the data above is properly parsed as:

clip_image002

SONUS NETWORKS SONIC AUTOMOTIVE SKYWORKS SOLUTIONS SILICON LABORATORIES SI INTERNATIONAL

No comments:

Post a Comment