rds_duckdb - ApsaraDB RDS - Alibaba Cloud Documentation Center

When you need to quickly obtain results from complex queries, you can use the rds_duckdb extension of ApsaraDB RDS for PostgreSQL. This extension exports local tables as column-oriented tables and uses vectorization capabilities to significantly accelerate complex queries without modifying original query statements, allowing you to retrieve results efficiently.

Feature overview

The rds_duckdb extension of ApsaraDB RDS for PostgreSQL is developed based on the efficient and resource-friendly DuckDB to enhance analytical query capabilities. You can use this extension to export local tables from an ApsaraDB RDS for PostgreSQL instance as column-oriented tables and enable AP query acceleration. This significantly accelerates complex queries to meet analytical service requirements.

Prerequisites

The major engine version of the instance is PostgreSQL 12 or later.
The minor engine version of the instance is 20241030 or later.
You have added rds_duckdb to the Running Value of shared_preload_libraries.
For more information about how to configure parameters, see Configure parameters. For example, change the Running Value to 'pg_stat_statements,auto_explain,rds_duckdb'.

Note

When the major engine version of the instance is PostgreSQL 15 and the minor engine version is 20250228, automatic incremental synchronization of column-oriented table data is enabled by default.

Create and delete the extension

Use a privileged account to create and delete the extension.

Create the extension

CREATE EXTENSION rds_duckdb;

View the DuckDB kernel version used by the extension

SELECT rds_duckdb.duckdb_version();

Delete the extension

DROP EXTENSION rds_duckdb;

Manage column-oriented tables

Create a column-oriented table

Execute the following statement to export a local table of the RDS instance as a column-oriented table. The local table can be a user table, materialized view, or foreign table. The column-oriented table is used to accelerate analytical queries.

Note

When the major engine version of the instance is PostgreSQL 15 and the minor engine version is 20250228, automatic incremental synchronization of column-oriented table data is enabled by default. Before you create a column-oriented table, you can configure the target ApsaraDB RDS for PostgreSQL instance and local table to implement automatic incremental synchronization of column-oriented table data. For more information, see Configure automatic incremental synchronization for column-oriented tables.

SELECT rds_duckdb.create_duckdb_table('local_table_name');

Refresh a column-oriented table

Execute the following statement to refresh the exported column-store table based on the latest data in the local table of the RDS instance and update the table schema and data:

SELECT rds_duckdb.refresh_duckdb_table('local_table_name');

View the size of a column-oriented table

SELECT rds_duckdb.duckdb_table_size('local_table_name');

View the sizes of all exported tables in the current database

SELECT rds_duckdb.duckdb_database_size();

Delete a column-oriented table

SELECT rds_duckdb.drop_duckdb_table('local_table_name');

Manage AP query acceleration

The rds_duckdb extension can be used to accelerate read-only queries. After you enable the AP query acceleration feature, if SQL statements need to be executed to query data from tables and DuckDB column-oriented tables exist for these tables, the SQL statements are executed in DuckDB to accelerate the queries. If the SQL statements are unsupported DML or DDL statements or column-oriented tables do not exist, the SQL statements are executed in ApsaraDB RDS for PostgreSQL.

For SQL statements that are executed in ApsaraDB RDS for PostgreSQL, the system displays a warning message in the following format: WARNING: Trying to execute an operation with non-duckdb tables(test), fallback to PG. Content in the parentheses () indicates the tables of the RDS instance for which DuckDB column-oriented tables do not exist.

For SQL statements that are not read-only, the system displays the following warning message: WARNING: Modification operations on DuckDB tables are currently not supported, fallback to PG.

Enable AP query acceleration

SET rds_duckdb.execution = on;

Configure AP query acceleration parameters

You can configure parameters in a session to manage the performance of the AP query acceleration. For example:

SET rds_duckdb.worker_threads = 32;
SET rds_duckdb.memory_limit = 16384;

Parameter

Description

Recommended value

rds_duckdb.worker_threads

The number of worker threads that are used for AP query acceleration.

Valid values: 1 to 255.

Default value: 1, which indicates that only one worker thread is used.

If you want to achieve significant performance improvements, we recommend that you set this parameter to the number of CPU cores.
This parameter is closely related to the hardware device on which the queries are run. A larger value indicates a higher CPU load when AP query acceleration is performed. You must configure this parameter based on your business requirements.
If you set this parameter to a higher value, you can obtain better performance but the CPU load also increases. If you set this parameter to a lower value, the acceleration performance is affected but the CPU load also decreases.

rds_duckdb.memory_limit

The maximum memory that can be used for AP query acceleration.

Unit: MB. You do not need to add a unit when you configure this parameter.

Valid values: 1 to INT32_MAX.

Default value: 100, which indicates that the upper limit is 100 MB.

If you want to achieve significant performance improvements, we recommend that you set this parameter to a large value that is supported by your workloads.
This parameter is closely related to the hardware device on which the queries are run. A larger value indicates a higher memory usage when AP query acceleration is performed. You must configure this parameter based on your business requirements.
The default value of this parameter is a conservative value. We recommend that you configure this parameter based on your business requirements.
If you set this parameter to a small value, the process of exporting the column-oriented table for a large table and the performance of AP query acceleration are affected.

Note

For more information about DuckDB parameters, see DuckDB.

Disable AP query acceleration

SET rds_duckdb.execution = off;

Configure automatic incremental synchronization for column-oriented tables

When the major engine version of the instance is PostgreSQL 15 and the minor engine version is 20250228, automatic incremental synchronization of column-oriented table data is enabled by default. Before you create a column-oriented table, you must perform the following operations to configure the target ApsaraDB RDS for PostgreSQL instance and local table.

In Extension management, check the version of the rds_duckdb extension and upgrade the extension to version 1.3.
Configure parameters to change the running value of the wal_level parameter to logical.
(Optional) If the target local table does not have a primary key, execute the following statement to set a REPLICA IDENTITY index as the replication key for the table:
```
ALTER TABLE <local_table_name> REPLICA IDENTITY USING INDEX <index_name>;
```
Use an account that has the replication permission or a privileged account to create a column-oriented table.

View the synchronization status and synchronization offset progress of a column-oriented table

SELECT * FROM rds_duckdb.duckdb_sync_stat;

Sample response and parameter description

Sample response:

 sync_table | sync_status_description |          sync_error_description          | confirmed_lsn 
------------+-------------------------+------------------------------------------+---------------
 test       | not syncing             | no primary key or replica identity index | 
 test2      | not syncing             | no primary key or replica identity index | 
 test3      | data syncing            | no errors                                | 0/250D1E8
 test4      | not syncing             | no primary key or replica identity index | 
 test5      | data syncing            | no errors                                | 0/250D1E8
 test6      | data syncing            | no errors                                | 0/250D1E8
 test7      | data syncing            | no errors                                | 0/250D1E8
 test8      | data syncing            | no errors                                | 0/250D1E8

Parameter description:

Parameter

Description

sync_status_description

The synchronization status of the column-oriented table.

not syncing: Incremental data synchronization is not performed.
data copying: The first full data replication is in progress.
data catchup: The incremental data that is generated during the full data replication is being applied.
data syncing: Normal incremental synchronization is in progress.

sync_error_description

The reason why incremental synchronization is not performed for the column-oriented table.

no errors: No issues exist.
dml replay conflict: A DML operation causes a conflict during incremental replay.
ddl replay conflict: A DDL operation causes a conflict during incremental replay.
no primary key or replica identity index: The ApsaraDB RDS for PostgreSQL table that corresponds to the column-oriented table does not have a primary key or a REPLICA IDENTITY index.
unsupported relation type: The type of the table to be synchronized is not supported, such as a partitioned table, view, or materialized view.
rds_duckdb.enable_sync not set: The global synchronization GUC is not enabled.
Note
When the major engine version of the instance is PostgreSQL 15 and the minor engine version is 20250228, this feature is enabled by default.
removing duckdb table: The column-oriented table is being deleted.

View SQL execution plans

Use the EXPLAIN statement to view the execution plans of SQL statements after the AP query acceleration feature is enabled or disabled. For example:

The following sample code provides an example on the execution plan of the SQL statements after the AP query acceleration feature is enabled:

Enable AP query acceleration

tpch_10x=# SET rds_duckdb.execution = on;
SET
tpch_10x=# EXPLAIN SELECT
tpch_10x-#     100.00 * sum(
tpch_10x(#         CASE WHEN p_type LIKE 'PROMO%' THEN
tpch_10x(#             l_extendedprice * (1 - l_discount)
tpch_10x(#         ELSE
tpch_10x(#             0
tpch_10x(#         END) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue
tpch_10x-# FROM
tpch_10x-#     lineitem,
tpch_10x-#     part
tpch_10x-# WHERE
tpch_10x-#     l_partkey = p_partkey
tpch_10x-#     AND l_shipdate >= date '1995-09-01'
tpch_10x-#     AND l_shipdate < CAST('1995-10-01' AS date);
                         QUERY PLAN
------------------------------------------------------------
 Custom Scan (DuckDBNode)  (cost=0.00..0.00 rows=0 width=0)
   DuckDB Plan:

 ┌───────────────────────────┐
 │         PROJECTION        │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Projections:       │
 │       promo_revenue       │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │   Estimated Cardinality:  │
 │             1             │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │    UNGROUPED_AGGREGATE    │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Aggregates:        │
 │          sum(#0)          │
 │          sum(#1)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Projections:       │
 │ CASE  WHEN (prefix(p_type,│
 │    'PROMO')) THEN (CAST(  │
 │ (l_extendedprice * (1.000 │
 │    - CAST(l_discount AS   │
 │     DECIMAL(18,3)))) AS   │
 │   DECIMAL(20,5))) ELSE 0  │
 │         .00000 END        │
 │ (l_extendedprice * (1.000 │
 │    - CAST(l_discount AS   │
 │      DECIMAL(18,3))))     │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │   Estimated Cardinality:  │
 │          6600339          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         HASH_JOIN         │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │         Join Type:        │
 │           INNER           │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Conditions:        ├──────────────┐
 │   l_partkey = p_partkey   │              │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
 │   Estimated Cardinality:  │              │
 │          6600339          │              │
 └─────────────┬─────────────┘              │
 ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
 │         SEQ_SCAN          ││         SEQ_SCAN          │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Stringified:       ││        Stringified:       │
 │          lineitem         ││            part           │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │        Projections:       ││        Projections:       │
 │         l_partkey         ││         p_partkey         │
 │      l_extendedprice      ││           p_type          │
 │         l_discount        ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   Estimated Cardinality:  │
 │          Filters:         ││          2000000          │
 │ l_shipdate>='1995-09-01': ││                           │
 │ :DATE AND l_shipdate<'1995││                           │
 │     -10-01'::DATE AND     ││                           │
 │   l_shipdate IS NOT NULL  ││                           │
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
 │   Estimated Cardinality:  ││                           │
 │          11997210         ││                           │
 └───────────────────────────┘└───────────────────────────┘


(71 rows)

The following sample code provides an example on the execution plan of the SQL statements after the AP query acceleration feature is disabled:

Disable AP query acceleration

tpch_10x=# SET rds_duckdb.execution = off;
SET
tpch_10x=# EXPLAIN SELECT
    100.00 * sum(
        CASE WHEN p_type LIKE 'PROMO%' THEN
            l_extendedprice * (1 - l_discount)
        ELSE
            0
        END) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue
FROM
    lineitem,
    part
WHERE
    l_partkey = p_partkey
    AND l_shipdate >= date '1995-09-01'
    AND l_shipdate < CAST('1995-10-01' AS date);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1286740.42..1286740.43 rows=1 width=32)
   ->  Gather  (cost=1286739.96..1286740.37 rows=4 width=64)
         Workers Planned: 4
         ->  Partial Aggregate  (cost=1285739.96..1285739.97 rows=1 width=64)
               ->  Parallel Hash Join  (cost=1235166.04..1282419.39 rows=189747 width=33)
                     Hash Cond: (part.p_partkey = lineitem.l_partkey)
                     ->  Parallel Seq Scan on part  (cost=0.00..43232.15 rows=500016 width=29)
                     ->  Parallel Hash  (cost=1233776.40..1233776.40 rows=111171 width=20)
                           ->  Parallel Seq Scan on lineitem  (cost=0.00..1233776.40 rows=111171 width=20)
                                 Filter: ((l_shipdate >= '1995-09-01'::date) AND (l_shipdate < '1995-10-01'::date))
 JIT:
   Functions: 17
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(13 rows)

Performance Testing

This section uses Linux as an example to evaluate the performance improvements of the rds_duckdb extension in complex queries based on the standard TPC-H test. For more information, see Performance Testing of the rds_duckdb extension.