External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, one may use external tables for the purpose of reading external data using a dedicated SQL pool or serverless SQL pool.
The major applications of External tables in dedicated SQL pool and serverless SQL pool are:
- Querying Azure Blob Storage and Azure Data Lake Gen2 with Transact-SQL statements
- Storing query results to files in Azure Blob Storage or Azure Data Lake Storage using CETAS
- Importing data from Azure Blob Storage and Azure Data Lake Storage and store it in a dedicated SQL
Azure Synapse Analytics allows people to read Parquet files that are stored in the Azure Data Lake by making use of high-performance Parquet readers and T-SQL language. The prime characteristic of such high-performance Parquet readers is that they use native (C++) code to read Parquet files, as opposed to existing Polybase Parquet reader technology that leverages the Java code. Such native readers tend to be introduced in the serverless SQL pools in Azure Synapse Analytics workspaces. This native technology used in the serverless SQL pools has demonstrated superior performance in comparison to the existing Polybase external table in the dedicated SQL pools, in several experiments. One can even avail this native technology for reading Parquet files now in the dedicated SQL pools. One can create external tables that use native code to read Parquet files in the dedicated Pools in Azure Synapse Analytics, and subsequently, enhance the performance of their queries that access external Parquet files.
How does this new technology help people?
- Before: One could have only used Polybase TYPE=HADOOP external tables to read the Parquet files with the dedicated SQL pools. Based on the Java code, Polybase TYPE=HADOOP tables do not always meet the performance expectations. In many cases, Java data structures additionally have to be converted to the native structures.
- Now: Apart from the Polybase TYPE=HADOOP external tables, one may also use a new type of native external tables that are much faster. Delivering superior performance, native external tables are implemented with the usage of native code. The new native tables and the existing Polybase TYPE=HADOOP external tables can be used on the same dedicated pool. It is even possible to enjoy a 10x better performance while accessing external Parquet files just by removing the TYPE option from the external data source.
Hadoop vs native external tables
In dedicated SQL pools one can use two types of tables:
- Hadoop external tables: The Java technology is used in the Polybase Hadoop external table for reading external Parquet files. This is a widely used and available technology.
- Native external tables: Currently gated in the public purview, these new external tables use the native Parquet readers
The only syntax difference among these two table types are the external data source definitions:
- In case people want to use the existing Hadoop external tables, they would be required to create an external data source with the TYPE=HADOOP option.
- If they want to use the new native external tables to create an external data source, then it can be done without the TYPE option.
The syntax for the external tables is the same in both the situations. One just has to create an external table on top of the data source that they have created with or without the TYPE option. The external table will use native code or Java code on the basis of the TYPE attribute in the underlying EXTERNAL DATA SOURCE object.
On the whole, one should try to use native external tables in the dedicated SQL pools in Azure Synapse analytics to boost the performance of their queries that use the external tables on top of Parquet files.