[Coding] .NET and Reading Excel Files (OLEDB vs. Interop)
Posted by Khatharsis on December 7, 2013
One of the tasks I’ve been working on for my project at work is parsing through an Excel spreadsheet. In a brief summary, the data from the spreadsheet will get transferred to a database. In a more detailed summary, the data from the spreadsheet will get translated into objects which then get put into the database.
But first, I wanted to make sure I was getting the correct values. So, my interim task was first to parse through the spreadsheet, stick it in a DataSet, and output onto an HTML page. The next task was to take the values in the DataSet, put it in an object, and pull out values from that object when displaying on the HTML page.
But first, the tricky part was getting .NET to talk to Office. There are two ways of going about this, without resorting to a third-party library. The first uses OLEDB which is included with .NET and doesn’t require Office to be installed. The second uses Interop, which requires Office to be installed. So, I took the easy route (OLEDB) as a quick proof-of-concept.
The following is a simple example of the OLEDB method:
internal DataSet ImportDataFromSpreadsheet()
{
string connection = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=\"Excel 8.0;IMEX=1;HDR=NO;TypeGuessRows=0;ImportMixedTypes=Text\""
string connStr = String.Format(connection,
HttpContext.Current.Server.MapPath("~/xls/sample_sheet.xls"));
using (OleDbConnection cmd = new OleDbConnection(connStr))
{
OleDbCommand command = new OleDbCommand("SELECT * FROM [Sheet1$]", cmd);
OleDbDataAdapter data = new OleDbDataAdapter();
data.SelectCommand = command;
DataSet ds = new DataSet();
data.Fill(ds);
return ds;
}
}
You may note the connection string has a few extended properties. The spreadsheet is in a fancy, pretty-looking format that, of course, does not map well to parsing. So, I wanted all of the data I import as a string value and handle conversions on the .NET end. OLEDB was pulling data and automatically converting them to a certain type based on the first few rows as sample data.
This meant that data would be missing. For example, a column that had mostly numbers but a few dashes here and there would return just the numbers and leave out the dashes because they aren’t numeric values. In a couple of cases, even the numeric values were left out.
There were two steps to resolving this. The first was to go through the spreadsheet and click on each problem cell, then click away so it would miraculously fix itself. This made the missing numeric values appear. The second was to add the extra extended properties: IMEX=1 to indicate there are mixed data columns, HDR=0 to indicate that the first row of data should not be treated as a header row, TypeGuessRows=0 to indicate that no rows should be used to guess at a column’s type, and ImportMixedTypes=Text to indicate mixed types should be imported as string. This made some of the other, non-numeric values appear, but not all. Which didn’t make much sense when one column was nearly identical to the next.
Unfortunately, the step to resolve this issue was down at the registry level and I didn’t want to muck around on my development machine and find that I’d have to muck around on the staging and production servers as well. So, I decided to try Interop just to get my hands a little wet with something new.
At this point, I also asked my mentor about what library we might be using in any of the existing applications to work with Excel. I was surprised to find Interop being referenced (still unclear on if it’s actually being used). But, in our case, having Interop used/referenced on an existing server that my project would likely reside on meant that I could easily use Interop without worrying too much about IT needing to set up the server (which includes taking down the live apps until the setup is complete).
Using Interop was like getting to know someone new who didn’t like to talk much about themselves. Tutorials were a little confusing and resulted in glazed eyes for me, but I finally forced myself to sit down and implement a proof-of-concept (similar with what I did with OLEDB) using Interop.
First, I had to add the COM reference (Microsoft Excel xx.x Object Library; not Microsoft Office xx.x Object Library!) and a using Microsoft.Office.Interop.Excel statement. Pretty straightforward. The reference shouldn’t be hard to find if you have Office installed. If you don’t have Office installed, then I assume you won’t be finding that library in the list.
The following is a simple example of the code I constructed:
internal DataSet ImportDataFromSpreadsheetUsingInterop()
{
Application excel = new Application();
Workbooks wbs = excel.Workbooks;
Workbook wb = wbs.Open(
HttpContext.Current.Server.MapPath("~/xls/sample_analyte.xls"), ReadOnly: true);
Worksheet ws = wb.Sheets[1];
Range range = ws.UsedRange;
DataSet ds = new DataSet();
ds.Tables.Add();
for (int i = 0; i < range.Columns.Count; i++)
{
DataColumn dc = new DataColumn("F"+i, System.Type.GetType("System.String"));
ds.Tables[0].Columns.Add(dc);
}
// Excel arrays are 1-based
for (int i = 1; i <= range.Rows.Count; i++)
{
string[] rowData = new string[range.Columns.Count];
for (int j = 1; j <= range.Columns.Count; j++)
{
rowData[j-1] = range.Cells[i, j].Text;
}
ds.Tables[0].Rows.Add(rowData);
}
range = null;
ws = null;
wbs = null;
if (wb != null)
wb.Close(false);
wb = null;
if (excel != null)
excel.Quit();
excel = null;
return ds;
}
One thing to remember with Interop is it's all 1-based instead of 0-based. What would normally be Sheets[0] is now Sheets[1] and Cells[1, 1] instead of Cells[0, 0]. You can end up with a "Exception from HRESULT: 0x800A03EC Error" exception when accidentally using a 0-base.
Another error I ran into was "Cannot perform runtime binding on a null reference" which resulted in adding a check to make sure Cell[i, j] was not equal to an empty string. Actually, this check is necessary if you plan on using the Cells[i, j].Value2.ToString(), but not necessary if you are using Cells[i, j].Text as I do above. I left the check in there for reference.
A third point I want to make is the using() statement that is invaluable for I/O operations is not applicable for Application or Workbook. So, be sure to null and Close()/Quit() the Interop references when done.
A fourth point is yes, I believe you need to manually loop through the data to construct a DataSet. It is unfortunate it's not already built-in, but the control over what you're actually pulling from the spreadsheet (e.g., a string over an int) and putting into the DataSet is nicer.
A final point that I ran into after patting myself on the back was attempting to open the Excel file and finding it was locked. By me. Or someone masquerading as me, i.e., my computer/ASP.NET server. So, I stopped my local ASP.NET server and closed Visual Studio. Nope. Turns out I had multiple EXCEL.EXE processes running, one for each time I ran/refreshed the page. I closed out of those via Task Manager and was able to open the Excel file.
Turns out, doing something like wb.Workbooks.Open() is bad because C# creates a temporary COM variable of type Workbooks. (See the third link at the end of this post for more; in essence "Never use 2 dots with com objects.") Even if you Close() and null wb, it won't release the Workbooks object that was created internally, thus leaving a link open to Excel. In addition to fixing this memory leak, I added a ReadOnly parameter to my Open() call because I'm not editing the file anyway and that should also mitigate any further holes in this simple concept code.
Since I wanted strings-only from the spreadsheet, .Text returns me the values I want and all of the values to boot. In addition, it doesn't truncate floating values from 1.000 to 1, which is a big plus. (Due to the nature of the values we are storing, they also need to be stored as strings regardless if they are actually numbers so keeping the resolution of floating points is a bonus.) Although I don't plan on using spreadsheet-direct-to-HTML, it does make debugging and making sure the values I'm pulling and seeing are the same values as those on the spreadsheet easier.
Some references that helped me with this task:
-Interop With .NET 4 - How to Read Data From Excel (2011)
-Faster MS Excel Reading using Office Interop Assemblies (2005)
-How to properly clean up Excel interop objects (2008)